Title
Learning from Open-Source Projects: An Empirical Study on Defect Prediction
Abstract
The fundamental issue in cross project defect prediction is selecting the most appropriate training data for creating quality defect predictors. Another concern is whether historical data of open-source projects can be used to create quality predictors for proprietary projects from a practical point-of-view. Current studies have proposed statistical approaches to finding these training data, however, thus far no apparent effort has been made to study their success on proprietary data. Also these methods apply brute force techniques which are computationally expensive. In this work we introduce a novel data selection procedure which takes into account the similarities between the distribution of the test and potential training data. Additionally we use feature subset selection to increase the similarity between the test and training sets. Our procedure provides a comparable and scalable means of solving the cross project defect prediction problem for creating quality defect predictors. To evaluate our procedure we conducted empirical studies with comparisons to the within company defect prediction and a relevancy filtering method. We found that our proposed method performs relatively better than the filtering method in terms of both computation cost and prediction performance.
Year
DOI
Venue
2013
10.1109/ESEM.2013.20
ESEM
Keywords
Field
DocType
public domain software,cross project defect prediction problem,data selection procedure,test distribution,open-source project learning,relevancy filtering method,statistical analysis,learning (artificial intelligence),company defect prediction,quality defect predictor creation,computation cost,program debugging,prediction performance,brute force techniques,test-training set similarity,statistical approach,project management,feature subset selection,cross-project,proprietary data,software defect prediction,data similarity,instance selection,training data,proprietary projects,learning artificial intelligence
Training set,Data modeling,Data mining,Computer science,Filter (signal processing),Cross project,Artificial intelligence,Machine learning,Empirical research,Scalability,Computation,Project management
Conference
Volume
Issue
ISSN
null
null
1938-6451
ISBN
Citations 
PageRank 
978-0-7695-5056-5
24
0.63
References 
Authors
22
4
Name
Order
Citations
PageRank
Zhimin He153635.90
Fayola Peters2240.63
Tim Menzies32886151.44
Ye Yang410318.26