Title
Snoring: a noise in defect prediction datasets
Abstract
In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as "sleeping defects". We call "snoring" the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove the last 50% of releases.
Year
DOI
Venue
2019
10.1109/MSR.2019.00019
Proceedings of the 16th International Conference on Mining Software Repositories
Keywords
Field
DocType
dataset bias, defect prediction, fix-inducing changes
Data mining,Computer science
Conference
ISSN
ISBN
Citations 
2160-1852
978-1-7281-3370-6
1
PageRank 
References 
Authors
0.35
20
3
Name
Order
Citations
PageRank
Aalok Ahluwalia110.35
Davide Falessi250434.89
Massimiliano Di Penta35703265.47