Title
Is this code written in English? A study of the natural language of comments and identifiers in practice
Abstract
Comments and identifiers are the main source of documentation of source-code and are therefore an integral part of the development and the maintenance of a program. As English is the world language, most comments and identifiers are written in English. However, if they are in any other language, a developer without knowledge of this language will almost perceive the code to be undocumented or even obfuscated. In absence of industrial data, academia is not aware of the extent of the problem of non-English comments and identifiers in practice. In this paper, we propose an approach for the language identification of source-code comments and identifiers. With the approach, a large-scale study has been conducted of the natural language of source-code comments and identifiers, analyzing multiple open-source and industry systems. The results show that a significant amount of the industry projects contain comments and identifiers in more than one language, whereas none of the analyzed open-source systems has this problem.
Year
DOI
Venue
2015
10.1109/ICSM.2015.7332491
ICSME
Keywords
Field
DocType
natural language,identifiers,source-code documentation,program development,program maintenance,English,language identification,source-code comments,open-source system,industry system,industry project
World Wide Web,Programming language,World language,Identifier,Systems engineering,Computer science,Internal documentation,Natural language programming,Natural language,Language identification,Documentation,Language industry
Conference
ISSN
Citations 
PageRank 
1063-6773
2
0.35
References 
Authors
15
2
Name
Order
Citations
PageRank
Timo Pawelka120.35
Elmar Juergens274331.07