Title
Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce.
Abstract
Density Peaks (DP) is a recently proposed clustering algorithm that has distinctive advantages over existing clustering algorithms. It has already been used in a wide range of applications. However, DP requires computing the distance between every pair of input points, therefore incurring quadratic computation overhead, which is prohibitive for large data sets. In this paper, we study efficient distributed algorithms for DP. We first show that a naïve MapReduce solution (Basic-DDP) has high communication and computation overhead. Then, we propose LSH-DDP, an approximate algorithm that exploits Locality Sensitive Hashing for partitioning data, performs local computation, and aggregates local results to approximate the final results. We address several challenges in employing LSH for DP. We leverage the characteristics of DP to deal with the fact that some of the result values cannot be directly approximated in local partitions. We present formal analysis of LSH-DDP, and show that the approximation quality and the runtime can be controlled by tuning the parameters of LSH-DDP. Experimental results on both a local cluster and EC2 show that LSH-DDP achieves a factor of 1.7–70x speedup over the naïve Basic-DDP and 2x speedup over the state-of-the-art EDDPC approach, while returning comparable cluster results. Compared to the popular K-means clustering, LSH-DDP also has comparable or better performance. Furthermore, LSH-DDP could achieve even higher efficiency with a lower accuracy requirement.
Year
DOI
Venue
2017
10.1109/TKDE.2016.2609423
IEEE Trans. Knowl. Data Eng.
Keywords
DocType
Volume
Clustering algorithms,Approximation algorithms,Algorithm design and analysis,Partitioning algorithms,Machine learning algorithms,Density measurement
Conference
28
Issue
ISSN
Citations 
12
1041-4347
13
PageRank 
References 
Authors
0.57
22
3
Name
Order
Citations
PageRank
Yanfeng Zhang117015.56
Shimin Chen2968.64
Ge YU31313175.88