Title
Sparse Kernel Clustering of Massive High-Dimensional Data sets with Large Number of Clusters
Abstract
In clustering applications involving documents and images, in addition to the large number of data points (N) and their high dimensionality (d), the number of clusters (C) into which the data need to be partitioned is also large. Kernel-based clustering algorithms, which have been shown to perform better than linear clustering algorithms, have high running time complexity in terms of N, d and C. We propose an efficient sparse kernel k-means clustering algorithm, which incrementally samples the most informative points from the data set using importance sampling, and constructs a sparse kernel matrix using these sampled points. Each row in this matrix corresponds to a data point's similarity with its p-nearest neighbors among the sampled points (p -- N). This sparse kernel matrix is used to perform clustering and obtain the cluster labels. This combination of sampling and sparsity reduces both the running time and memory complexity of kernel clustering. In order to further enhance its efficiency, the proposed algorithm projects the data on to the top C eigenvectors of the sparse kernel matrix and clusters these eigenvectors using a modified k-means algorithm. The running time of the proposed sparse kernel k-means algorithm is linear in N and d, and logarithmic in C. We show analytically that only a small number of points need to be sampled from the data set, and the resulting approximation error is well-bounded. We demonstrate, using several large high-dimensional text and image data sets, that the proposed algorithm is significantly faster than classical kernel-based clustering algorithms, while maintaining clustering quality.
Year
DOI
Venue
2015
10.1145/2809890.2809896
PIKM@CIKM
Keywords
Field
DocType
Big data, Kernel clustering, Importance sampling, Sparsity
Canopy clustering algorithm,CURE data clustering algorithm,Data stream clustering,Pattern recognition,Correlation clustering,Computer science,Determining the number of clusters in a data set,Artificial intelligence,Cluster analysis,Variable kernel density estimation,Single-linkage clustering
Conference
Citations 
PageRank 
References 
0
0.34
16
Authors
3
Name
Order
Citations
PageRank
Radha Chitta11778.01
Anil Jain2335073334.84
Rong Jin36206334.26