Title
Scalable Hierarchical Agglomerative Clustering
Abstract
ABSTRACTThe applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
Year
DOI
Venue
2021
10.1145/3447548.3467404
Knowledge Discovery and Data Mining
Keywords
DocType
Citations 
Clustering, Hierarchical Clustering
Conference
0
PageRank 
References 
Authors
0.34
3
12
Name
Order
Citations
PageRank
Nicholas Monath1199.68
Kumar Avinava Dubey200.68
Guru Guruganesh300.34
Manzil Zaheer416023.65
Amr Ahmed5174392.13
Andrew Kachites McCallumzy6192031588.22
Gökhan Mergen700.34
Marc A. Najork82538278.16
Mert Terzihan900.34
Bryon Tjanaka1000.34
Yuan Wang118234.72
Yuchen Wu1200.68