Efficient MapReduce Kernel k-Means for Big Data Clustering. - Citegraph

Paper Info

Title
Efficient MapReduce Kernel k-Means for Big Data Clustering.

Abstract
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non linear separability of the input data. It has recently received a distributed implementation, named Trimmed Kernel k-Means, following the MapReduce distributed computing model. In addition to performing the computations in a distributed manner, Trimmed Kernel k-Means also trims the kernel matrix, in order to reduce the memory requirements and improve performance. The trimming of each row of the kernel matrix is achieved by attempting to estimate the cardinality of the cluster that the corresponding sample belongs to, and removing the kernel matrix entries connecting the sample to samples that probably belong to another cluster. The Spark cluster computing framework was used for the distributed implementation. In this paper, we present a distributed clustering scheme that is based on Trimmed Kernel k-Means, which employs subsampling, in order to be able to efficiently perform clustering on an extremely large dataset. The results indicate that the proposed method run much faster than the original Trimmed Kernel k-Means, while still providing clustering performance competitive with other state of the art kernel approaches.

Year	DOI	Venue
2016	10.1145/2903220.2903255	SETN
Keywords	Field	DocType
Kernel k-Means, clustering, Big Data, distributed computing, MapReduce	Data mining,Radial basis function kernel,Kernel embedding of distributions,Computer science,Tree kernel,Polynomial kernel,Artificial intelligence,Cluster analysis,String kernel,Kernel method,Variable kernel density estimation,Machine learning	Conference
Citations	PageRank	References
3	0.41	16
Authors
4

Authors (4 rows)

Cited by (3 rows)

References (16 rows)

Name	Order	Citations	PageRank
Nikolaos Tsapanos	1	26	3.87
Anastasios Tefas	2	2055	177.05
Nikolaos Nikolaidis	3	108	10.31
Ioannis Pitas	4	6478	626.09

1