Abstract | ||
---|---|---|
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non-linear separability of the input data. With respect to the challenges of Big Data research, a field that has established itself in the last few years and involves performing tasks on extremely large amounts of data, several adaptations of the Kernel k-Means have been proposed, each of which has different requirements in processing power and running time, while also incurring different trade-offs in performance. In this paper, we present several issues and techniques involving the usage of Kernel k-Means for Big Data clustering and how the combination of each component in a clustering framework fares in terms of resources, time and performance. We use experimental results, in order to evaluate several combinations and provide a recommendation on how to approach a Big Data clustering problem. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1142/S0218213018600060 | INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS |
Keywords | DocType | Volume |
Big data, kernel k-means, data clustering, approximate kernel k-means, Apache Spark, distributed computation | Journal | 27 |
Issue | ISSN | Citations |
4 | 0218-2130 | 0 |
PageRank | References | Authors |
0.34 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Nikolaos Tsapanos | 1 | 26 | 3.87 |
Anastasios Tefas | 2 | 2055 | 177.05 |
Nikolaos Nikolaidis | 3 | 108 | 10.31 |
Ioannis Pitas | 4 | 6478 | 626.09 |