Title
Leakage in data mining: formulation, detection, and avoidance
Abstract
Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical i.i.d. assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected.
Year
DOI
Venue
2011
10.1145/2382577.2382579
ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
Keywords
DocType
Volume
data mining target,data mining mistake,defining modeling goal,social network challenge,data mining challenge,major public data mining,real-life data mining project,causal graph modeling concept,data mining problem,new approach,real-life project,data management,leakage,data mining,prediction model,social network,statistical inference,predictive modeling
Conference
6
Issue
ISSN
Citations 
4
1556-4681
29
PageRank 
References 
Authors
2.92
9
4
Name
Order
Citations
PageRank
shachar kaufman1363.63
Saharon Rosset21087105.33
Claudia Perlich352345.01
ori stitelman41117.82