Abstract | ||
---|---|---|
We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve problems such as k-means clustering and low rank approximation. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for k-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as a general transformation from a constant success probability subspace embedding to a high success probability subspace embedding with a dimension and sparsity independent of the success probability, may be of independent interest. |
Year | Venue | Field |
---|---|---|
2014 | ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014) | Mathematical optimization,Embedding,Subspace topology,Computer science,Server,Low-rank approximation,Artificial intelligence,Cluster analysis,Machine learning,Principal component analysis,Empirical research,Speedup |
DocType | Volume | ISSN |
Journal | 27 | 1049-5258 |
Citations | PageRank | References |
32 | 1.11 | 17 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Maria-Florina Balcan | 1 | 1445 | 105.01 |
Vandana Kanchanapally | 2 | 32 | 1.11 |
Yingyu Liang | 3 | 393 | 31.39 |
David P. Woodruff | 4 | 32 | 1.11 |