Title
On the performance of high dimensional data clustering and classification algorithms
Abstract
There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorithms supported by Mahout within two different cloud runtimes, Hadoop and Granules. Our benchmarks use the same Mahout backend code, ensuring a fair comparison. The differences between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We include an analysis of our results for each of these algorithms in a distributed setting, as well as a discussion on measures for failure recovery.
Year
DOI
Venue
2013
10.1016/j.future.2012.05.026
Future Generation Comp. Syst.
Keywords
Field
DocType
computational pipeline,different stage,data mining,failure recovery,fair comparison,mahout backend code,different cloud runtimes,individual computation,classification algorithm,granules runtimes,clustering algorithm,high dimensional data,granules,machine learning,clustering,classification
Recommender system,Data mining,Clustering high-dimensional data,Computer science,Implementation,Cluster analysis,Statistical classification,Cloud computing,Computation,Distributed computing
Journal
Volume
Issue
ISSN
29
4
0167-739X
Citations 
PageRank 
References 
21
0.91
17
Authors
2
Name
Order
Citations
PageRank
Kathleen Ericson1503.82
Shrideep Pallickara283792.72