Title | ||
---|---|---|
Is this code written in English? A study of the natural language of comments and identifiers in practice |
Abstract | ||
---|---|---|
Comments and identifiers are the main source of documentation of source-code and are therefore an integral part of the development and the maintenance of a program. As English is the world language, most comments and identifiers are written in English. However, if they are in any other language, a developer without knowledge of this language will almost perceive the code to be undocumented or even obfuscated. In absence of industrial data, academia is not aware of the extent of the problem of non-English comments and identifiers in practice. In this paper, we propose an approach for the language identification of source-code comments and identifiers. With the approach, a large-scale study has been conducted of the natural language of source-code comments and identifiers, analyzing multiple open-source and industry systems. The results show that a significant amount of the industry projects contain comments and identifiers in more than one language, whereas none of the analyzed open-source systems has this problem. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1109/ICSM.2015.7332491 | ICSME |
Keywords | Field | DocType |
natural language,identifiers,source-code documentation,program development,program maintenance,English,language identification,source-code comments,open-source system,industry system,industry project | World Wide Web,Programming language,World language,Identifier,Systems engineering,Computer science,Internal documentation,Natural language programming,Natural language,Language identification,Documentation,Language industry | Conference |
ISSN | Citations | PageRank |
1063-6773 | 2 | 0.35 |
References | Authors | |
15 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Timo Pawelka | 1 | 2 | 0.35 |
Elmar Juergens | 2 | 743 | 31.07 |