Abstract | ||
---|---|---|
Motivation: Numerous competing algorithms for prediction in high-dimensional settings have been developed in the statistical and machine-learning literature. Learning algorithms and the prediction models they generate are typically evaluated on the basis of cross-validation error estimates in a few exemplary datasets. However, in most applications, the ultimate goal of prediction modeling is to provide accurate predictions for independent samples obtained in different settings. Cross-validation within exemplary datasets may not adequately reflect performance in the broader application context. Methods: We develop and implement a systematic approach to 'cross-study validation', to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets. We illustrate it via simulations and in a collection of eight estrogen-receptor positive breast cancer microarray gene-expression datasets, where the objective is predicting distant metastasis-free survival (DMFS). We computed the C-index for all pairwise combinations of training and validation datasets. We evaluate several alternatives for summarizing the pairwise validation statistics, and compare these to conventional cross-validation. Results: Our data-driven simulations and our application to survival prediction with eight breast cancer microarray datasets, suggest that standard cross-validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross-study validation. Furthermore, the ranking of learning algorithms differs, suggesting that algorithms performing best in cross-validation may be suboptimal when evaluated through independent validation. |
Year | DOI | Venue |
---|---|---|
2014 | 10.1093/bioinformatics/btu279 | BIOINFORMATICS |
Keywords | Field | DocType |
algorithms,artificial intelligence,gene expression profiling | Data mining,Pairwise comparison,Ranking,Computer science,Prediction algorithms,Predictive modelling,Bioinformatics,Application Context | Journal |
Volume | Issue | ISSN |
30 | 12 | 1367-4803 |
Citations | PageRank | References |
5 | 0.49 | 14 |
Authors | ||
7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Christoph Bernau | 1 | 7 | 0.87 |
Markus Riester | 2 | 25 | 3.04 |
Anne-Laure Boulesteix | 3 | 945 | 63.22 |
Giovanni Parmigiani | 4 | 174 | 12.46 |
Curtis Huttenhower | 5 | 438 | 30.18 |
Levi Waldron | 6 | 51 | 6.96 |
Lorenzo Trippa | 7 | 7 | 1.00 |