Title
An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data.
Abstract
Software defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality software product. Though, previous studies have successfully used machine learning techniques for software defect prediction, these techniques yield biased results when applied on imbalanced data sets. An imbalanced data set has non-uniform class distribution with very few instances of a specific class as compared to that of the other class. Use of imbalanced datasets leads to off-target predictions of the minority class, which is generally considered to be more important than the majority class. Thus, handling imbalanced data effectively is crucial for successful development of a competent defect prediction model. This study evaluates the effectiveness of machine learning classifiers for software defect prediction on twelve imbalanced NASA datasets by application of sampling methods and cost sensitive classifiers. We investigate five existing oversampling methods, which replicate the instances of minority class and also propose a new method SPIDER3 by suggesting modifications in SPIDER2 oversampling method. Furthermore, the work evaluates the performance of MetaCost learners for cost sensitive learning on imbalanced datasets. The results show improvement in the prediction capability of machine learning classifiers with the use of oversampling methods. Furthermore, the proposed SPIDER3 method shows promising results.
Year
DOI
Venue
2019
10.1016/j.neucom.2018.04.090
Neurocomputing
Keywords
Field
DocType
Defect prediction,Imbalanced data,Oversampling methods,MetaCost learners,Machine learning techniques,Procedural metrics
Data set,Oversampling,Software bug,Software,Artificial intelligence,Sampling (statistics),Systems development life cycle,Mathematics,Empirical research,Replicate,Machine learning
Journal
Volume
ISSN
Citations 
343
0925-2312
4
PageRank 
References 
Authors
0.38
0
2
Name
Order
Citations
PageRank
Ruchika Malhotra153335.12
Shine Kamal240.38