Sparse Kernel Clustering of Massive High-Dimensional Data sets with Large Number of Clusters - Citegraph

Paper Info

Title
Sparse Kernel Clustering of Massive High-Dimensional Data sets with Large Number of Clusters

Abstract
In clustering applications involving documents and images, in addition to the large number of data points (N) and their high dimensionality (d), the number of clusters (C) into which the data need to be partitioned is also large. Kernel-based clustering algorithms, which have been shown to perform better than linear clustering algorithms, have high running time complexity in terms of N, d and C. We propose an efficient sparse kernel k-means clustering algorithm, which incrementally samples the most informative points from the data set using importance sampling, and constructs a sparse kernel matrix using these sampled points. Each row in this matrix corresponds to a data point's similarity with its p-nearest neighbors among the sampled points (p -- N). This sparse kernel matrix is used to perform clustering and obtain the cluster labels. This combination of sampling and sparsity reduces both the running time and memory complexity of kernel clustering. In order to further enhance its efficiency, the proposed algorithm projects the data on to the top C eigenvectors of the sparse kernel matrix and clusters these eigenvectors using a modified k-means algorithm. The running time of the proposed sparse kernel k-means algorithm is linear in N and d, and logarithmic in C. We show analytically that only a small number of points need to be sampled from the data set, and the resulting approximation error is well-bounded. We demonstrate, using several large high-dimensional text and image data sets, that the proposed algorithm is significantly faster than classical kernel-based clustering algorithms, while maintaining clustering quality.

Year	DOI	Venue
2015	10.1145/2809890.2809896	PIKM@CIKM
Keywords	Field	DocType
Big data, Kernel clustering, Importance sampling, Sparsity	Canopy clustering algorithm,CURE data clustering algorithm,Data stream clustering,Pattern recognition,Correlation clustering,Computer science,Determining the number of clusters in a data set,Artificial intelligence,Cluster analysis,Variable kernel density estimation,Single-linkage clustering	Conference
Citations	PageRank	References
0	0.34	16
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (16 rows)

Name	Order	Citations	PageRank
Radha Chitta	1	177	8.01
Anil Jain	2	33507	3334.84
Rong Jin	3	6206	334.26

1