Title
Big Data Clustering With Kernel K-Means: Resources, Time And Performance
Abstract
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non-linear separability of the input data. With respect to the challenges of Big Data research, a field that has established itself in the last few years and involves performing tasks on extremely large amounts of data, several adaptations of the Kernel k-Means have been proposed, each of which has different requirements in processing power and running time, while also incurring different trade-offs in performance. In this paper, we present several issues and techniques involving the usage of Kernel k-Means for Big Data clustering and how the combination of each component in a clustering framework fares in terms of resources, time and performance. We use experimental results, in order to evaluate several combinations and provide a recommendation on how to approach a Big Data clustering problem.
Year
DOI
Venue
2018
10.1142/S0218213018600060
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS
Keywords
DocType
Volume
Big data, kernel k-means, data clustering, approximate kernel k-means, Apache Spark, distributed computation
Journal
27
Issue
ISSN
Citations 
4
0218-2130
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Nikolaos Tsapanos1263.87
Anastasios Tefas22055177.05
Nikolaos Nikolaidis310810.31
Ioannis Pitas46478626.09