NUMA-optimized In-memory and Semi-external-memory Parameterized Clustering. - Citegraph

Paper Info

Title
NUMA-optimized In-memory and Semi-external-memory Parameterized Clustering.

Abstract
K-means is one of the most influential and utilized machine learning algorithms today. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink k-means in terms of modern architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. We utilize this to develop two modules k$\|\|$means$_T$, and SEM-k$\|\|$means$_T^{RC}$. k$\|\|$means$_T$ is optimized for NUMA architectures and radically improves the performance for datasets that fit in memory. SEM-k$\|\|$means$_T^{RC}$ improves k-meansu0027 scalability on a memory budget using semi-external memory, SSDs and a minimalistic triangle inequality pruning algorithm. SEM-k$\|\|$means$_T^{RC}$ scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. We show k$\|\|$means$_T$ outperforms commercial products like H$_2$O, Dato and Sparku0027s MLlib by up to an order of magnitude on $O(10^7)$ point datasets. Furthermore, SEM-k$\|\|$means$_T^{RC}$ effortlessly clusters $geq O(10^8)$ point datasets while outperforming distributed in-memory frameworks.

Year	Venue	Field
2016	arXiv: Distributed, Parallel, and Cluster Computing	Synchronization,Parameterized complexity,Spark (mathematics),Computer science,Parallel computing,Real-time computing,Triangle inequality,Cluster analysis,Scalability,Auxiliary memory,Distributed computing,Computation
DocType	Volume	Citations
Journal	abs/1606.08905	0
PageRank	References	Authors
0.34	7	5

Authors (5 rows)

Cited by (0 rows)

References (7 rows)

Name	Order	Citations	PageRank
Disa Mhembere	1	63	5.42
Da Zheng	2	6	2.49
Joshua T. Vogelstein	3	273	31.99
Carey E. Priebe	4	505	108.56
Randal C. Burns	5	8	4.19

1