Abstract | ||
---|---|---|
K-means is one of the most influential and utilized machine learning algorithms today. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink k-means in terms of modern architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. We utilize this to develop two modules k$||$means$_T$, and SEM-k$||$means$_T^{RC}$. k$||$means$_T$ is optimized for NUMA architectures and radically improves the performance for datasets that fit in memory. SEM-k$||$means$_T^{RC}$ improves k-meansu0027 scalability on a memory budget using semi-external memory, SSDs and a minimalistic triangle inequality pruning algorithm. SEM-k$||$means$_T^{RC}$ scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. We show k$||$means$_T$ outperforms commercial products like H$_2$O, Dato and Sparku0027s MLlib by up to an order of magnitude on $O(10^7)$ point datasets. Furthermore, SEM-k$||$means$_T^{RC}$ effortlessly clusters $geq O(10^8)$ point datasets while outperforming distributed in-memory frameworks. |
Year | Venue | Field |
---|---|---|
2016 | arXiv: Distributed, Parallel, and Cluster Computing | Synchronization,Parameterized complexity,Spark (mathematics),Computer science,Parallel computing,Real-time computing,Triangle inequality,Cluster analysis,Scalability,Auxiliary memory,Distributed computing,Computation |
DocType | Volume | Citations |
Journal | abs/1606.08905 | 0 |
PageRank | References | Authors |
0.34 | 7 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Disa Mhembere | 1 | 63 | 5.42 |
Da Zheng | 2 | 6 | 2.49 |
Joshua T. Vogelstein | 3 | 273 | 31.99 |
Carey E. Priebe | 4 | 505 | 108.56 |
Randal C. Burns | 5 | 8 | 4.19 |