Title
Understanding Machine Learning Software Defect Predictions
Abstract
Software defects are well-known in software development and might cause several problems for users and developers aside. As a result, researches employed distinct techniques to mitigate the impacts of these defects in the source code. One of the most notable techniques focuses on defect prediction using machine learning methods, which could support developers in handling these defects before they are introduced in the production environment. These studies provide alternative approaches to predict the likelihood of defects. However, most of these works concentrate on predicting defects from a vast set of software features. Another key issue with the current literature is the lack of a satisfactory explanation of the reasons that drive the software to a defective state. Specifically, we use a tree boosting algorithm (XGBoost) that receives as input a training set comprising records of easy-to-compute characteristics of each module and outputs whether the corresponding module is defect-prone. To exploit the link between predictive power and model explainability, we propose a simple model sampling approach that finds accurate models with the minimum set of features. Our principal idea is that features not contributing to increasing the predictive power should not be included in the model. Interestingly, the reduced set of features helps to increase model explainability, which is important to provide information to developers on features related to each module of the code which is more defect-prone. We evaluate our models on diverse projects within Jureczko datasets, and we show that (i) features that contribute most for finding best models may vary depending on the project and (ii) it is possible to find effective models that use few features leading to better understandability. We believe our results are useful to developers as we provide the specific software features that influence the defectiveness of selected projects.
Year
DOI
Venue
2020
10.1007/s10515-020-00277-4
AUTOMATED SOFTWARE ENGINEERING
Keywords
DocType
Volume
Software defects, Explainable models, Jureczko datasets, SHAP values
Journal
27
Issue
ISSN
Citations 
3-4
0928-8910
2
PageRank 
References 
Authors
0.40
0
5
Name
Order
Citations
PageRank
Geanderson Esteves dos Santos120.40
Eduardo Figueiredo285136.26
Adriano Veloso374954.37
Markos Viggiato442.78
N. Ziviani5292.56