Title
Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem
Abstract
With available tools and datasets existing on GitHub ecosystem, researchers have the opportunities to study diverse software engineering problems on a large-scale dataset. However, there are many potential threats when researchers try to directly use large-scale datasets, and one important threat is that GitHub contains many private projects (e.g., homework) and non-development projects (e.g., blog). For researchers who want to study cooperative behavior of developers or development process of projects, their research samples should not contain private projects and non-development projects. To solve this problem, we first analyzed the weaknesses of the base line methods (i.e., selecting top projects) and extended ML-based methods (i.e., training models on a labeled training dataset using ML algorithms, Extended_MLMs for short), and proposed two methods called Enhanced_RFM and Fusion_DL_RFM to address the weaknesses of Extended_RFM (the Extended_MLM that is based on Random Forest and has the best performance among all the Extended_MLMs). The results show that: (1) existing project sample selection methods have a low F-measure and poor generality (i.e., have a bad performance on the testing dataset); (2) Enhanced_RFM outperforms Fusion_DL_RFM on accuracy and stability; and (3) by adopting Enhanced_RFM, the F-measure of Extended_RFM is improved from 0.690 to 0.810 and the precision of Extended_RFM is improved from 0.559 to 0.785 under cross validation, which indicates that the generality of Extended_RFM is significantly improved.
Year
DOI
Venue
2022
10.1007/s10515-022-00322-4
Automated Software Engineering
Keywords
DocType
Volume
Open source software project, GitHub, Public development project
Journal
29
Issue
ISSN
Citations 
1
0928-8910
0
PageRank 
References 
Authors
0.34
9
6
Name
Order
Citations
PageRank
Can Cheng100.34
Bing Li238040.45
Zengyang Li300.34
Peng Liang457049.57
Xiaofeng Han531.13
Jiahua Zhang625.23