Title
Speeding-up the kernel k-means clustering method: A prototype based hybrid approach
Abstract
Kernel k-means clustering method has been proved to be effective in identifying non-isotropic and linearly inseparable clusters in the input space. However, this method is not a suitable one for large datasets because of its quadratic time complexity with respect to the size of the dataset. This paper presents a simple prototype based hybrid approach to speed-up the kernel k-means clustering method for large datasets. The proposed method works in two stages. First, the dataset is partitioned into a number of small grouplets by using the leaders clustering method which takes the size of each grouplet, called the threshold t, as an input parameter. The conventional leaders clustering method is modified such that these grouplets are formed in the kernel induced feature space, but each grouplet is represented by a pattern (called its leader) in the input space. The dataset is re-indexed according to these grouplets. Later, the kernel k-means clustering method is applied over the set of leaders to derive a partition of the leaders set. Finally, each leader is replaced by its group to get a partition of the entire dataset. The time complexity as well as space complexity of the proposed method is O(n+p^2), where p is the number of leaders. The overall running time and the quality of the clustering result depends on the threshold t and the order in which the dataset is scanned. This paper presents a study on how the input parameter t affects the overall running time and the clustering quality obtained by the proposed method. Further, both theoretically and experimentally it has been shown how the order of scanning of the dataset affects the clustering result. The proposed method is also compared with the other recent methods that are proposed to speed-up the kernel k-means clustering method. Experimental study with several real world as well as synthetic datasets shows that, for an appropriate value of t, the proposed method can significantly reduce the computation time but with a small loss in clustering quality, particularly for large datasets.
Year
DOI
Venue
2013
10.1016/j.patrec.2012.11.009
Pattern Recognition Letters
Keywords
Field
DocType
entire dataset,input space,computation time,hybrid approach,large datasets,clustering result,proposed method work,input parameter,recent method,clustering quality
Fuzzy clustering,Canopy clustering algorithm,CURE data clustering algorithm,Data stream clustering,Correlation clustering,Pattern recognition,Artificial intelligence,Constrained clustering,Cluster analysis,Mathematics,Single-linkage clustering
Journal
Volume
Issue
ISSN
34
5
0167-8655
Citations 
PageRank 
References 
10
0.51
25
Authors
3
Name
Order
Citations
PageRank
T. Hitendra Sarma1262.16
P. Viswanath214811.77
B. Eswara Reddy37811.25