Recognizing lines of code violating company-specific coding guidelines using machine learning - Citegraph

Paper Info

Title
Recognizing lines of code violating company-specific coding guidelines using machine learning

Abstract
Software developers in big and medium-size companies are working with millions of lines of code in their codebases. Assuring the quality of this code has shifted from simple defect management to proactive assurance of internal code quality. Although static code analysis and code reviews have been at the forefront of research and practice in this area, code reviews are still an effort-intensive and interpretation-prone activity. The aim of this research is to support code reviews by automatically recognizing company-specific code guidelines violations in large-scale, industrial source code. In our action research project, we constructed a machine-learning-based tool for code analysis where software developers and architects in big and medium-sized companies can use a few examples of source code lines violating code/design guidelines (up to 700 lines of code) to train decision-tree classifiers to find similar violations in their codebases (up to 3 million lines of code). Our action research project consisted of (i) understanding the challenges of two large software development companies, (ii) applying the machine-learning-based tool to detect violations of Sun’s and Google’s coding conventions in the code of three large open source projects implemented in Java, (iii) evaluating the tool on evolving industrial codebase, and (iv) finding the best learning strategies to reduce the cost of training the classifiers. We were able to achieve the average accuracy of over 99% and the average F-score of 0.80 for open source projects when using ca. 40K lines for training the tool. We obtained a similar average F-score of 0.78 for the industrial code but this time using only up to 700 lines of code as a training dataset. Finally, we observed the tool performed visibly better for the rules requiring to understand a single line of code or the context of a few lines (often allowing to reach the F-score of 0.90 or higher). Based on these results, we could observe that this approach can provide modern software development companies with the ability to use examples to teach an algorithm to recognize violations of code/design guidelines and thus increase the number of reviews conducted before the product release. This, in turn, leads to the increased quality of the final software.

Year	DOI	Venue
2020	10.1007/s10664-019-09769-8	Empirical Software Engineering
Keywords	Field	DocType
Measurement, Machine learning, Action research, Code reviews	Codebase,Data mining,Static program analysis,Software engineering,Computer science,Source code,Coding conventions,Software quality,Code review,Software development,Source lines of code	Journal
Volume	Issue	ISSN
25	1	1382-3256
Citations	PageRank	References
1	0.34	0
Authors
5

Authors (5 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Miroslaw Ochodek	1	96	9.74
Regina Hebig	2	179	24.24
Wilhelm Meding	3	212	18.66
Gert Frost	4	1	0.34
Miroslaw Staron	5	486	52.25

1