On the performance of high dimensional data clustering and classification algorithms - Citegraph

Paper Info

Title
On the performance of high dimensional data clustering and classification algorithms

Abstract
There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorithms supported by Mahout within two different cloud runtimes, Hadoop and Granules. Our benchmarks use the same Mahout backend code, ensuring a fair comparison. The differences between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We include an analysis of our results for each of these algorithms in a distributed setting, as well as a discussion on measures for failure recovery.

Year	DOI	Venue
2013	10.1016/j.future.2012.05.026	Future Generation Comp. Syst.
Keywords	Field	DocType
computational pipeline,different stage,data mining,failure recovery,fair comparison,mahout backend code,different cloud runtimes,individual computation,classification algorithm,granules runtimes,clustering algorithm,high dimensional data,granules,machine learning,clustering,classification	Recommender system,Data mining,Clustering high-dimensional data,Computer science,Implementation,Cluster analysis,Statistical classification,Cloud computing,Computation,Distributed computing	Journal
Volume	Issue	ISSN
29	4	0167-739X
Citations	PageRank	References
21	0.91	17
Authors
2

Authors (2 rows)

Cited by (21 rows)

References (17 rows)

Name	Order	Citations	PageRank
Kathleen Ericson	1	50	3.82
Shrideep Pallickara	2	837	92.72

1