Unsupervised feature selection for principal components analysis - Citegraph

Paper Info

Title
Unsupervised feature selection for principal components analysis

Abstract
Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the so-called Column Subset Selection Problem (CSSP). In words, the CSSP seeks the "best" subset of exactly k columns from an m x n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel two-stage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previously-existing results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures."

Year	DOI	Venue
2008	10.1145/1401890.1401903	KDD
Keywords	Field	DocType
k landmark feature,application domain,unsupervised feature selection,novel two-stage algorithm,k column,principal components analysis,n data,top k,modern statistical data analysis,document-term data,object-feature matrix,data matrix,numerical linear algebra,data analysis,random sampling,feature selection,pca,principal component analysis	Data mining,Dimensionality reduction,Feature selection,Computer science,Matrix (mathematics),Artificial intelligence,Subspace topology,Pattern recognition,Sampling (statistics),Landmark,Numerical linear algebra,Machine learning,Principal component analysis	Conference
Citations	PageRank	References
48	2.48	23
Authors
3

Authors (3 rows)

Cited by (48 rows)

References (23 rows)

Name	Order	Citations	PageRank
Christos Boutsidis	1	610	33.37
Michael W. Mahoney	2	3297	218.10
Petros Drineas	3	2165	201.55

1