Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem - Citegraph

Paper Info

Title
Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

Abstract
With available tools and datasets existing on GitHub ecosystem, researchers have the opportunities to study diverse software engineering problems on a large-scale dataset. However, there are many potential threats when researchers try to directly use large-scale datasets, and one important threat is that GitHub contains many private projects (e.g., homework) and non-development projects (e.g., blog). For researchers who want to study cooperative behavior of developers or development process of projects, their research samples should not contain private projects and non-development projects. To solve this problem, we first analyzed the weaknesses of the base line methods (i.e., selecting top projects) and extended ML-based methods (i.e., training models on a labeled training dataset using ML algorithms, Extended_MLMs for short), and proposed two methods called Enhanced_RFM and Fusion_DL_RFM to address the weaknesses of Extended_RFM (the Extended_MLM that is based on Random Forest and has the best performance among all the Extended_MLMs). The results show that: (1) existing project sample selection methods have a low F-measure and poor generality (i.e., have a bad performance on the testing dataset); (2) Enhanced_RFM outperforms Fusion_DL_RFM on accuracy and stability; and (3) by adopting Enhanced_RFM, the F-measure of Extended_RFM is improved from 0.690 to 0.810 and the precision of Extended_RFM is improved from 0.559 to 0.785 under cross validation, which indicates that the generality of Extended_RFM is significantly improved.

Year	DOI	Venue
2022	10.1007/s10515-022-00322-4	Automated Software Engineering
Keywords	DocType	Volume
Open source software project, GitHub, Public development project	Journal	29
Issue	ISSN	Citations
1	0928-8910	0
PageRank	References	Authors
0.34	9	6

Authors (6 rows)

Cited by (0 rows)

References (9 rows)

Name	Order	Citations	PageRank
Can Cheng	1	0	0.34
Bing Li	2	380	40.45
Zengyang Li	3	0	0.34
Peng Liang	4	570	49.57
Xiaofeng Han	5	3	1.13
Jiahua Zhang	6	2	5.23

1