Abstract | ||
---|---|---|
Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings. |
Year | Venue | Keywords |
---|---|---|
2012 | PVLDB | proposed initialization algorithm k-means,parallelizing k-means,scalable k-means,real-world large-scale data,constant number,popular data,initialization algorithm,good initialization,passes suffices,massive data,proper initialization |
DocType | Volume | Issue |
Journal | 5 | 7 |
ISSN | Citations | PageRank |
Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 7, pp.
622-633 (2012) | 35 | 1.29 |
References | Authors | |
31 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Bahman Bahmani | 1 | 434 | 16.71 |
Benjamin Moseley | 2 | 554 | 50.11 |
Andrea Vattani | 3 | 171 | 11.45 |
Ravi Kumar | 4 | 13932 | 1642.48 |
Sergei Vassilvitskii | 5 | 2750 | 139.31 |