Title
Unsupervised feature selection for principal components analysis
Abstract
Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the so-called Column Subset Selection Problem (CSSP). In words, the CSSP seeks the "best" subset of exactly k columns from an m x n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel two-stage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previously-existing results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures."
Year
DOI
Venue
2008
10.1145/1401890.1401903
KDD
Keywords
Field
DocType
k landmark feature,application domain,unsupervised feature selection,novel two-stage algorithm,k column,principal components analysis,n data,top k,modern statistical data analysis,document-term data,object-feature matrix,data matrix,numerical linear algebra,data analysis,random sampling,feature selection,pca,principal component analysis
Data mining,Dimensionality reduction,Feature selection,Computer science,Matrix (mathematics),Artificial intelligence,Subspace topology,Pattern recognition,Sampling (statistics),Landmark,Numerical linear algebra,Machine learning,Principal component analysis
Conference
Citations 
PageRank 
References 
48
2.48
23
Authors
3
Name
Order
Citations
PageRank
Christos Boutsidis161033.37
Michael W. Mahoney23297218.10
Petros Drineas32165201.55