Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce. - Citegraph

Paper Info

Title
Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce.

Abstract
Density Peaks (DP) is a recently proposed clustering algorithm that has distinctive advantages over existing clustering algorithms. It has already been used in a wide range of applications. However, DP requires computing the distance between every pair of input points, therefore incurring quadratic computation overhead, which is prohibitive for large data sets. In this paper, we study efficient distributed algorithms for DP. We first show that a naïve MapReduce solution (Basic-DDP) has high communication and computation overhead. Then, we propose LSH-DDP, an approximate algorithm that exploits Locality Sensitive Hashing for partitioning data, performs local computation, and aggregates local results to approximate the final results. We address several challenges in employing LSH for DP. We leverage the characteristics of DP to deal with the fact that some of the result values cannot be directly approximated in local partitions. We present formal analysis of LSH-DDP, and show that the approximation quality and the runtime can be controlled by tuning the parameters of LSH-DDP. Experimental results on both a local cluster and EC2 show that LSH-DDP achieves a factor of 1.7–70x speedup over the naïve Basic-DDP and 2x speedup over the state-of-the-art EDDPC approach, while returning comparable cluster results. Compared to the popular K-means clustering, LSH-DDP also has comparable or better performance. Furthermore, LSH-DDP could achieve even higher efficiency with a lower accuracy requirement.

Year	DOI	Venue
2017	10.1109/TKDE.2016.2609423	IEEE Trans. Knowl. Data Eng.
Keywords	DocType	Volume
Clustering algorithms,Approximation algorithms,Algorithm design and analysis,Partitioning algorithms,Machine learning algorithms,Density measurement	Conference	28
Issue	ISSN	Citations
12	1041-4347	13
PageRank	References	Authors
0.57	22	3

Authors (3 rows)

Cited by (13 rows)

References (22 rows)

Name	Order	Citations	PageRank
Yanfeng Zhang	1	170	15.56
Shimin Chen	2	96	8.64
Ge YU	3	1313	175.88

1