Stream Clustering: Efficient Kernel-Based Approximation Using Importance Sampling. - Citegraph

Paper Info

Title
Stream Clustering: Efficient Kernel-Based Approximation Using Importance Sampling.

Abstract
Stream clustering methods, which group continuous, temporally ordered dynamic data instances, have been used in a number of applications such as stock market analysis, network analysis, and cosmological analysis. Most of the popular stream clustering algorithms are linear in nature, i.e. they assume that the data is linearly separable in the input space and use measures such as the Euclidean distance to define the inter-point similarity. Though these linear clustering algorithms are efficient, they do no achieve acceptable cluster quality on real-world data. Kernel-based clustering algorithms, which use non-linear similarity measures, yield better cluster quality, but are unsuitable for clustering data streams due to their high running time and memory complexity. We propose an efficient kernel-based clustering algorithm, called the Approximate Stream Kernel k-means, which uses importance sampling to sample a subset of the data stream, and clusters the entire stream based on each data point's similarity to the sampled data points in real-time. Every time a data point is sampled, the kernel matrix representing the similarity between the sampled points is updated, and projected to a low dimensional space spanned by its top eigenvectors. The data points are clustered in this low-dimensional space using the k-means algorithm. Thus, the Approximate Stream Kernel k-means algorithm performs clustering in linear time using kernel-based similarity. We show that only a small number of points need to be sampled from the data stream, and the resulting approximation error is well-bounded. Using several large benchmark data sets to simulate data streams, we demonstrate that the proposed algorithm achieves a significant speedup over other kernel-based clustering algorithms with minimal loss in cluster quality. We also demonstrate the practical applicability of the proposed algorithm by using it to efficiently find trending topics in the Twitter stream data.

Year	DOI	Venue
2015	10.1109/ICDMW.2015.31	ICDM Workshops
Keywords	Field	DocType
Stream clustering, Kernel clustering, k-means, Twitter stream	Fuzzy clustering,Data mining,CURE data clustering algorithm,Data stream mining,Computer science,Artificial intelligence,Cluster analysis,Canopy clustering algorithm,Data stream clustering,Correlation clustering,Pattern recognition,Constrained clustering,Machine learning	Conference
Citations	PageRank	References
0	0.34	14
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (14 rows)

Name	Order	Citations	PageRank
Radha Chitta	1	177	8.01
Rong Jin	2	6206	334.26
Anil Jain	3	33507	3334.84

1