Title
Biases in feature selection with missing data.
Abstract
Feature selection is of great importance for two possible scenarios: (1) prediction, i.e., improving (or minimally degrading) the predictions of a target variable while discarding redundant or uninformative features and (2) discovery, i.e., identifying features that are truly dependent on the target and may be genuine causes to be determined in experimental verifications (for example for the task of drug target discovery in genomics). In both cases, if variables have a large number of missing values, imputing them may lead to false positives; features that are not associated with the target become dependent as a result of imputation. In the first scenario, this may not harm prediction, but in the second one, it will erroneously select irrelevant features. In this paper, we study the risk/benefit trade-off of missing value imputation in the context of feature selection, using causal graphs to characterize when structural bias arises. Our aim is also to investigate situations in which imputing missing values may be beneficial to reduce false negatives, a situation that might arise when there is a dependency between feature and target, but the dependency is below the significance level when only complete cases are considered. However, the benefits of reducing false negatives must be balanced against the increased number of false positives. In the case of binary target variable and continuous features, the t-test is often used for univariate feature selection. In this paper, we also introduce a de-biased version of the t-test allowing us to reap the benefits of imputation, while not incurring the penalty of increasing the number of false positives.
Year
DOI
Venue
2019
10.1016/j.neucom.2018.10.085
Neurocomputing
Keywords
Field
DocType
Feature selection,Missing data,De-biased t-test
Graph,Feature selection,Drug target,Artificial intelligence,Imputation (statistics),Missing data,Univariate,Machine learning,Mathematics,False positive paradox,Binary number
Journal
Volume
ISSN
Citations 
342
0925-2312
2
PageRank 
References 
Authors
0.35
7
7
Name
Order
Citations
PageRank
Borja Seijo-Pardo1382.93
Amparo Alonso-Betanzos288576.98
Kristin P. Bennett31670189.26
Verónica Bolón-Canedo447633.04
Julie Josse58811.56
Mehreen Saeed6877.32
Isabelle Guyon7110331544.34