Title
Novel unsupervised feature filtering of biological data.
Abstract
Many methods have been developed for selecting small informative feature subsets in large noisy data. However, unsupervised methods are scarce. Examples are using the variance of data collected for each feature, or the projection of the feature on the first principal component. We propose a novel unsupervised criterion, based on SVD-entropy, selecting a feature according to its contribution to the entropy (CE) calculated on a leave-one-out basis. This can be implemented in four ways: simple ranking according to CE values (SR); forward selection by accumulating features according to which set produces highest entropy (FS1); forward selection by accumulating features through the choice of the best CE out of the remaining ones (FS2); backward elimination (BE) of features with the lowest CE.We apply our methods to different benchmarks. In each case we evaluate the success of clustering the data in the selected feature spaces, by measuring Jaccard scores with respect to known classifications. We demonstrate that feature filtering according to CE outperforms the variance method and gene-shaving. There are cases where the analysis, based on a small set of selected features, outperforms the best score reported when all information was used. Our method calls for an optimal size of the relevant feature set. This turns out to be just a few percents of the number of genes in the two Leukemia datasets that we have analyzed. Moreover, the most favored selected genes turn out to have significant GO enrichment in relevant cellular processes.
Year
DOI
Venue
2006
10.1093/bioinformatics/btl214
ISMB (Supplement of Bioinformatics)
Keywords
Field
DocType
best ce,large noisy data,relevant feature set,biological data,small set,forward selection,selected feature space,novel unsupervised feature filtering,selected feature,small informative feature subsets,ce value,lowest ce,principal component analysis,feature space,principal component,data collection,singular value decomposition
Biological data,Singular value decomposition,Data mining,Ranking,Pattern recognition,Feature (computer vision),Computer science,Filter (signal processing),Artificial intelligence,Jaccard index,Cluster analysis,Principal component analysis
Conference
Volume
Issue
ISSN
22
14
1367-4811
Citations 
PageRank 
References 
70
3.84
12
Authors
4
Name
Order
Citations
PageRank
Roy Varshavsky1947.01
Assaf Gottlieb21778.98
Michal Linial31502149.92
David Horn441451.58