Abstract | ||
---|---|---|
Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the so-called Column Subset Selection Problem (CSSP). In words, the CSSP seeks the "best" subset of exactly k columns from an m x n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel two-stage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previously-existing results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures." |
Year | DOI | Venue |
---|---|---|
2008 | 10.1145/1401890.1401903 | KDD |
Keywords | Field | DocType |
k landmark feature,application domain,unsupervised feature selection,novel two-stage algorithm,k column,principal components analysis,n data,top k,modern statistical data analysis,document-term data,object-feature matrix,data matrix,numerical linear algebra,data analysis,random sampling,feature selection,pca,principal component analysis | Data mining,Dimensionality reduction,Feature selection,Computer science,Matrix (mathematics),Artificial intelligence,Subspace topology,Pattern recognition,Sampling (statistics),Landmark,Numerical linear algebra,Machine learning,Principal component analysis | Conference |
Citations | PageRank | References |
48 | 2.48 | 23 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Christos Boutsidis | 1 | 610 | 33.37 |
Michael W. Mahoney | 2 | 3297 | 218.10 |
Petros Drineas | 3 | 2165 | 201.55 |