Optimal principal component analysis in distributed and streaming models. - Citegraph

Paper Info

Title
Optimal principal component analysis in distributed and streaming models.

Abstract
This paper studies the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix A ∈ Rm×n, a rank parameter k<rank(A), and an accuracy parameter 0<ε<1, we want to output an m×k orthonormal matrix U for which \|\|A-UUTA\|\|2F≤(1+ε)\|\|A-Ak\|\|2F where Ak∈Rm×n is the best rank-k approximation to A. Our contributions are summarized as follows: 1. In the arbitrary partition distributed model of Kannan et al. (COLT 2014), each of s machines holds a matrix Ai and A=ΣAi. Each machine should output U. Kannan et al. achieve O(skm/ε)+poly(sk/ε) words (of O(log(nm)) bits) communication. We obtain the improved bound of O(skm)+poly(sk/ε) words, and show an optimal (up to low order terms) Ω(skm) lower bound. This resolves an open question in the literature. A poly(ε-1) dependence is known to be required, but we separate this dependence from m. 2. In a more specific distributed model where each server receives a subset of columns of A, we bypass the above lower bound when A is φ-sparse in each column. Here we obtain an O(skφ/ε)+poly(sk/ε) word protocol. Our communication is independent of the matrix dimensions, and achieves the guarantee that each server, in addition to outputting U, outputs a subset of O(k/ε) columns of A containing a U in its span (that is, for the first time, we solve distributed column subset selection). Additionally, we show a matching Ω(skφ/ε) lower bound for distributed column subset selection. Achieving our communication bound when A is sparse in general but not sparse in each column, is impossible. 3. In the streaming model of computation, in which the columns of the matrix A arrive one at a time, an algorithm of Liberty (KDD, 2013) with an improved analysis by Ghashami and Phillips (SODA, 2014) achieves O(km/ε) "real numbers" space complexity. We improve this result, since our one-pass streaming PCA algorithm achieves an O(km/ε)+poly(k/ε) word space upper bound. This almost matches a known Ω(km/ε) bit lower bound of Woodruff (NIPS, 2014). We show that with two passes over the columns of A one can achieve an O(km)+poly(k/ε) word space upper bound; another lower bound of Woodruff (NIPS, 2014) shows that this is optimal for any constant number of passes (up to the poly(k/ε) term and the distinction between words versus bits). 4. Finally, in turnstile streams, in which we receive entries of A one at a time in an arbitrary order, we describe an algorithm with O((m+n)kε-1) words of space. This improves the O((m+n)ε-2)kε-2) upper bound of Clarkson and Woodruff (STOC 2009), and matches their Ω((m+n)kε-1) word lower bound. Notably, our results do not depend on the condition number or any singular value gaps of A.

Year	DOI	Venue
2016	10.1145/2897518.2897646	STOC '16: Symposium on Theory of Computing Cambridge MA USA June, 2016
Keywords	Field	DocType
low rank matrix decomposition,singular value decomposition,principal component analysis,column subset selection,distributed,streaming,sparse,lower bounds	Discrete mathematics,Singular value decomposition,Condition number,Combinatorics,Singular value,Upper and lower bounds,Matrix (mathematics),Model of computation,Partition (number theory),Real number,Mathematics	Conference
ISSN	ISBN	Citations
0737-8017	978-1-4503-4132-5	17
PageRank	References	Authors
0.61	32	3

Authors (3 rows)

Cited by (17 rows)

References (32 rows)

Name	Order	Citations	PageRank
Christos Boutsidis	1	610	33.37
David P. Woodruff	2	2156	142.38
Peilin Zhong	3	99	10.36

1