Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation. - Citegraph

Paper Info

Title
Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation.

Abstract
Cross-validation of predictive models is the de-facto standard for model selection and evaluation. In proper use, it provides an unbiased estimate of a modelu0027s predictive performance. However, data sets often undergo a preliminary data-dependent transformation, such as feature rescaling or dimensionality reduction, prior to cross-validation. It is widely believed that such a preprocessing stage, if done in an unsupervised manner that does not consider the class labels or response values, has no effect on the validity of cross-validation. In this paper, we show that this belief is not true. Preliminary preprocessing can introduce either a positive or negative bias into the estimates of model performance. Thus, it may lead to sub-optimal choices of model parameters and invalid inference. In light of this, the scientific community should re-examine the use of preliminary preprocessing prior to cross-validation across the various application domains. By default, all data transformations, including unsupervised preprocessing stages, should be learned only from the training samples, and then merely applied to the validation and testing samples.

Year	Venue	DocType
2019	arXiv: Methodology	Journal
Volume	Citations	PageRank
abs/1901.08974	0	0.34
References	Authors
0	2

Authors (2 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Amit Moscovich	1	0	0.68
Saharon Rosset	2	1087	105.33

1