Title
Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation.
Abstract
Cross-validation of predictive models is the de-facto standard for model selection and evaluation. In proper use, it provides an unbiased estimate of a modelu0027s predictive performance. However, data sets often undergo a preliminary data-dependent transformation, such as feature rescaling or dimensionality reduction, prior to cross-validation. It is widely believed that such a preprocessing stage, if done in an unsupervised manner that does not consider the class labels or response values, has no effect on the validity of cross-validation. In this paper, we show that this belief is not true. Preliminary preprocessing can introduce either a positive or negative bias into the estimates of model performance. Thus, it may lead to sub-optimal choices of model parameters and invalid inference. In light of this, the scientific community should re-examine the use of preliminary preprocessing prior to cross-validation across the various application domains. By default, all data transformations, including unsupervised preprocessing stages, should be learned only from the training samples, and then merely applied to the validation and testing samples.
Year
Venue
DocType
2019
arXiv: Methodology
Journal
Volume
Citations 
PageRank 
abs/1901.08974
0
0.34
References 
Authors
0
2
Name
Order
Citations
PageRank
Amit Moscovich100.68
Saharon Rosset21087105.33