Title
On the sample complexity of subspace clustering with missing data
Abstract
Subspace clustering is a useful tool for analyzing large complex data, but in many relevant applications missing data are common. Existing theoretical analysis of this problem shows that subspace clustering from incomplete data is possible, but that analysis requires the number of samples (i.e., partially observed vectors) to be super-polynomial in the dimension d. Such huge sample sizes are unnecessary when no data are missing and uncommon in applications. There are two main contributions in this paper. First, it is shown that if subspaces have rank at most r and the number of partially observed vectors greater than dr+1 (times a poly-logarithmic factor), then with high probability the true subspaces are the only subspaces that agree with the observed data. We may conclude that subspace clustering may be possible without impractically large sample sizes and that we can certify the output of any subspace clustering algorithm by checking its fit to the observed data. The second main contribution is a novel EM-type algorithm for subspace clustering with missing data. We demonstrate and compare it to several other algorithms. Experiments with simulated and real data show that such algorithms work well in practice.
Year
DOI
Venue
2014
10.1109/SSP.2014.6884630
Statistical Signal Processing
Keywords
Field
DocType
computational complexity,data analysis,expectation-maximisation algorithm,pattern clustering,EM-type algorithm,large complex data analysis,missing data,partially observed vectors,poly-logarithmic factor,sample complexity,subspace clustering algorithm,super-polynomial,Matrix Completion,Subspace Clustering
Canopy clustering algorithm,Fuzzy clustering,CURE data clustering algorithm,Clustering high-dimensional data,Data stream clustering,Correlation clustering,Pattern recognition,Artificial intelligence,Missing data,Cluster analysis,Mathematics
Conference
Citations 
PageRank 
References 
11
0.70
13
Authors
3
Name
Order
Citations
PageRank
Daniel L. Pimentel-Alarcón1376.30
Robert Nowak27309672.50
Laura Balzano341027.51