Title | ||
---|---|---|
Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem |
Abstract | ||
---|---|---|
With available tools and datasets existing on GitHub ecosystem, researchers have the opportunities to study diverse software engineering problems on a large-scale dataset. However, there are many potential threats when researchers try to directly use large-scale datasets, and one important threat is that GitHub contains many private projects (e.g., homework) and non-development projects (e.g., blog). For researchers who want to study cooperative behavior of developers or development process of projects, their research samples should not contain private projects and non-development projects. To solve this problem, we first analyzed the weaknesses of the base line methods (i.e., selecting top projects) and extended ML-based methods (i.e., training models on a labeled training dataset using ML algorithms, Extended_MLMs for short), and proposed two methods called Enhanced_RFM and Fusion_DL_RFM to address the weaknesses of Extended_RFM (the Extended_MLM that is based on Random Forest and has the best performance among all the Extended_MLMs). The results show that: (1) existing project sample selection methods have a low F-measure and poor generality (i.e., have a bad performance on the testing dataset); (2) Enhanced_RFM outperforms Fusion_DL_RFM on accuracy and stability; and (3) by adopting Enhanced_RFM, the F-measure of Extended_RFM is improved from 0.690 to 0.810 and the precision of Extended_RFM is improved from 0.559 to 0.785 under cross validation, which indicates that the generality of Extended_RFM is significantly improved. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1007/s10515-022-00322-4 | Automated Software Engineering |
Keywords | DocType | Volume |
Open source software project, GitHub, Public development project | Journal | 29 |
Issue | ISSN | Citations |
1 | 0928-8910 | 0 |
PageRank | References | Authors |
0.34 | 9 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Can Cheng | 1 | 0 | 0.34 |
Bing Li | 2 | 380 | 40.45 |
Zengyang Li | 3 | 0 | 0.34 |
Peng Liang | 4 | 570 | 49.57 |
Xiaofeng Han | 5 | 3 | 1.13 |
Jiahua Zhang | 6 | 2 | 5.23 |