Title
Efficient MapReduce Kernel k-Means for Big Data Clustering.
Abstract
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non linear separability of the input data. It has recently received a distributed implementation, named Trimmed Kernel k-Means, following the MapReduce distributed computing model. In addition to performing the computations in a distributed manner, Trimmed Kernel k-Means also trims the kernel matrix, in order to reduce the memory requirements and improve performance. The trimming of each row of the kernel matrix is achieved by attempting to estimate the cardinality of the cluster that the corresponding sample belongs to, and removing the kernel matrix entries connecting the sample to samples that probably belong to another cluster. The Spark cluster computing framework was used for the distributed implementation. In this paper, we present a distributed clustering scheme that is based on Trimmed Kernel k-Means, which employs subsampling, in order to be able to efficiently perform clustering on an extremely large dataset. The results indicate that the proposed method run much faster than the original Trimmed Kernel k-Means, while still providing clustering performance competitive with other state of the art kernel approaches.
Year
DOI
Venue
2016
10.1145/2903220.2903255
SETN
Keywords
Field
DocType
Kernel k-Means, clustering, Big Data, distributed computing, MapReduce
Data mining,Radial basis function kernel,Kernel embedding of distributions,Computer science,Tree kernel,Polynomial kernel,Artificial intelligence,Cluster analysis,String kernel,Kernel method,Variable kernel density estimation,Machine learning
Conference
Citations 
PageRank 
References 
3
0.41
16
Authors
4
Name
Order
Citations
PageRank
Nikolaos Tsapanos1263.87
Anastasios Tefas22055177.05
Nikolaos Nikolaidis310810.31
Ioannis Pitas46478626.09