Title
Feature Selection And Resampling In Class Imbalance Learning: Which Comes First? An Empirical Study In The Biological Domain
Abstract
Class imbalance exists in many applications of bioinformatics and biomedicine, while dimension reduction in the feature space is often needed when building prediction models on a dataset. When the above two issues need to be considered simultaneously for skewed/imbalanced datasets, practitioners and researchers in machine learning may raise the following question: should feature selection be conducted before or after the resampling methods for combating the skewness of a dataset? While feature selection and class imbalance learning have been widely studied in the literature, little study has jointly investigated them. This paper presents a first empirical study on the performance of the two opposing pipelines for binary imbalance learning, i.e., first feature selection then resampling, or first resampling then feature selection. We carry out the study on 35 publicly available datasets belonging to the biological field, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classifiers. Our experiments reveal that, there is no constant winner between the two pipelines, practitioners should test both pipelines in order to derive the best classification model for imbalance learning, in particular, the resampling before feature selection pipeline should not be neglected; but we also show that, the feature selection before resampling pipeline outperforms the other in more cases than not.
Year
DOI
Venue
2017
10.1109/BIBM.2017.8217782
2017 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM)
DocType
ISSN
Citations 
Conference
2156-1125
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
Chongsheng Zhang1603.61
Jingjun Bi200.34
Paolo Soda340739.44