Abstract | ||
---|---|---|
This paper studies the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix A ∈ Rm×n, a rank parameter k<rank(A), and an accuracy parameter 0<ε<1, we want to output an m×k orthonormal matrix U for which ||A-UUTA||2F≤(1+ε)||A-Ak||2F where Ak∈Rm×n is the best rank-k approximation to A.
Our contributions are summarized as follows: 1. In the arbitrary partition distributed model of Kannan et al. (COLT 2014), each of s machines holds a matrix Ai and A=ΣAi. Each machine should output U. Kannan et al. achieve O(skm/ε)+poly(sk/ε) words (of O(log(nm)) bits) communication. We obtain the improved bound of O(skm)+poly(sk/ε) words, and show an optimal (up to low order terms) Ω(skm) lower bound. This resolves an open question in the literature. A poly(ε-1) dependence is known to be required, but we separate this dependence from m.
2. In a more specific distributed model where each server receives a subset of columns of A, we bypass the above lower bound when A is φ-sparse in each column. Here we obtain an O(skφ/ε)+poly(sk/ε) word protocol. Our communication is independent of the matrix dimensions, and achieves the guarantee that each server, in addition to outputting U, outputs a subset of O(k/ε) columns of A containing a U in its span (that is, for the first time, we solve distributed column subset selection). Additionally, we show a matching Ω(skφ/ε) lower bound for distributed column subset selection. Achieving our communication bound when A is sparse in general but not sparse in each column, is impossible.
3. In the streaming model of computation, in which the columns of the matrix A arrive one at a time, an algorithm of Liberty (KDD, 2013) with an improved analysis by Ghashami and Phillips (SODA, 2014) achieves O(km/ε) "real numbers" space complexity. We improve this result, since our one-pass streaming PCA algorithm achieves an O(km/ε)+poly(k/ε) word space upper bound. This almost matches a known Ω(km/ε) bit lower bound of Woodruff (NIPS, 2014). We show that with two passes over the columns of A one can achieve an O(km)+poly(k/ε) word space upper bound; another lower bound of Woodruff (NIPS, 2014) shows that this is optimal for any constant number of passes (up to the poly(k/ε) term and the distinction between words versus bits).
4. Finally, in turnstile streams, in which we receive entries of A one at a time in an arbitrary order, we describe an algorithm with O((m+n)kε-1) words of space. This improves the O((m+n)ε-2)kε-2) upper bound of Clarkson and Woodruff (STOC 2009), and matches their Ω((m+n)kε-1) word lower bound.
Notably, our results do not depend on the condition number or any singular value gaps of A.
|
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/2897518.2897646 | STOC '16: Symposium on Theory of Computing
Cambridge
MA
USA
June, 2016 |
Keywords | Field | DocType |
low rank matrix decomposition,singular value decomposition,principal component analysis,column subset selection,distributed,streaming,sparse,lower bounds | Discrete mathematics,Singular value decomposition,Condition number,Combinatorics,Singular value,Upper and lower bounds,Matrix (mathematics),Model of computation,Partition (number theory),Real number,Mathematics | Conference |
ISSN | ISBN | Citations |
0737-8017 | 978-1-4503-4132-5 | 17 |
PageRank | References | Authors |
0.61 | 32 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Christos Boutsidis | 1 | 610 | 33.37 |
David P. Woodruff | 2 | 2156 | 142.38 |
Peilin Zhong | 3 | 99 | 10.36 |