Title
Stream Clustering: Efficient Kernel-Based Approximation Using Importance Sampling.
Abstract
Stream clustering methods, which group continuous, temporally ordered dynamic data instances, have been used in a number of applications such as stock market analysis, network analysis, and cosmological analysis. Most of the popular stream clustering algorithms are linear in nature, i.e. they assume that the data is linearly separable in the input space and use measures such as the Euclidean distance to define the inter-point similarity. Though these linear clustering algorithms are efficient, they do no achieve acceptable cluster quality on real-world data. Kernel-based clustering algorithms, which use non-linear similarity measures, yield better cluster quality, but are unsuitable for clustering data streams due to their high running time and memory complexity. We propose an efficient kernel-based clustering algorithm, called the Approximate Stream Kernel k-means, which uses importance sampling to sample a subset of the data stream, and clusters the entire stream based on each data point's similarity to the sampled data points in real-time. Every time a data point is sampled, the kernel matrix representing the similarity between the sampled points is updated, and projected to a low dimensional space spanned by its top eigenvectors. The data points are clustered in this low-dimensional space using the k-means algorithm. Thus, the Approximate Stream Kernel k-means algorithm performs clustering in linear time using kernel-based similarity. We show that only a small number of points need to be sampled from the data stream, and the resulting approximation error is well-bounded. Using several large benchmark data sets to simulate data streams, we demonstrate that the proposed algorithm achieves a significant speedup over other kernel-based clustering algorithms with minimal loss in cluster quality. We also demonstrate the practical applicability of the proposed algorithm by using it to efficiently find trending topics in the Twitter stream data.
Year
DOI
Venue
2015
10.1109/ICDMW.2015.31
ICDM Workshops
Keywords
Field
DocType
Stream clustering, Kernel clustering, k-means, Twitter stream
Fuzzy clustering,Data mining,CURE data clustering algorithm,Data stream mining,Computer science,Artificial intelligence,Cluster analysis,Canopy clustering algorithm,Data stream clustering,Correlation clustering,Pattern recognition,Constrained clustering,Machine learning
Conference
Citations 
PageRank 
References 
0
0.34
14
Authors
3
Name
Order
Citations
PageRank
Radha Chitta11778.01
Rong Jin26206334.26
Anil Jain3335073334.84