Title
Multistage feature selection approach for high-dimensional cancer data.
Abstract
Cancer is a serious disease that causes death worldwide. DNA methylation (DNAm) is an epigenetic mechanism, which controls the regulation of gene expression and is useful in early detection of cancer. The challenge with DNA methylation microarray datasets is the huge number of CpG sites compared to the number of samples. Recent research efforts attempted to reduce this high dimensionality by different feature selection techniques. This article proposes a multistage feature selection approach to select the optimal CpG sites from three different DNAm cancer datasets (breast, colon and lung). The proposed approach combines three different filter feature selection methods including Fisher Criterion, t-test and Area Under ROC Curve. In addition, as a wrapper feature selection, we apply genetic algorithms with Support Vector Machine Recursive Feature Elimination (SVM-RFE) as its fitness function, and SVM as its evaluator. Using the Incremental Feature Selection (IFS) strategy, subsets of 24, 13 and 27 optimal CpG sites are selected for the breast, colon and lung cancer datasets, respectively. By applying fivefold cross-validation on the training datasets, these subsets of optimal CpG sites showed perfect classification accuracies of 100, 100 and 97.67%, respectively. Moreover, the testing of the three independent cancer datasets by these final subsets resulted in accuracies 96.02, 98.81 and 94.51%, respectively. The experimental results demonstrated high classification performance and small optimal feature subsets. Consequently, the biological significance of the genes corresponding to these feature subsets is validated using enrichment analysis.
Year
DOI
Venue
2017
10.1007/s00500-016-2439-9
Soft Comput.
Keywords
Field
DocType
DNA methylation (DNAm), CpG sites, Feature selection, Genetic algorithms, Support vector machine (SVM), Incremental feature selection (IFS), Enrichment analysis
Pattern recognition,Feature selection,Computer science,CpG site,Support vector machine,DNA methylation,Fitness function,Artificial intelligence,dNaM,Cancer,Genetic algorithm
Journal
Volume
Issue
ISSN
21
22
1433-7479
Citations 
PageRank 
References 
2
0.35
23
Authors
3
Name
Order
Citations
PageRank
Alhasan Alkuhlani120.69
Mohammad Nassef2143.31
Ibrahim Farag3207.01