Title
Handling Language Variations in Open Source Bug Reporting Systems
Abstract
Natural language plays a critical role in the design, development and maintenance of software systems. For example, bug reporting systems allow users to submit reports describing observed anomalies in free form English. However, the free form aspect makes the detection of duplicate reports a challenge due to the breadth and diversity of language used by individual reporters. Tokenization, stemming and stop word removal are commonly used techniques to normalize and reduce the language space. However, the impact of typographical errors and alternate spellings has not been analyzed in the research literature. Our research indicates that handling language problems during automated bug triage analysis can lead to a boost in performance. We show that the language used in software problem reporting is too specialized to benefit from domain independent spell checkers or lexical databases. Therefore, we present a novel approach using word distance and neighbor word likelihood measures for detecting and resolving language-based issues in open-source software problem reporting. We evaluate our approach using the complete Firefox repository until March 2012. Our results indicate measurable improvements in duplicate detection results, while reducing the language space for most frequently used words by 30%. Moreover, our method is language-agnostic and does not require a pre-built dictionary, thus making it suitable for use in a variety of systems.
Year
DOI
Venue
2012
10.1109/ISSREW.2012.85
ISSRE Workshops
Keywords
Field
DocType
computational linguistics,natural language processing,program verification,public domain software,software development management,software maintenance,spelling aids,automated bug triage analysis,language variation handling,lexical database,natural language,neighbor word likelihood measure,open source bug reporting system,open source software,resolving language-based issue detection,software development,software maintenance,software system design,spelling checker,stemming technique,stop word removal,tokenization technique,word distance,Alternate Spellings,Duplicate Bug Reports,Software Maintenance,Software Reliability,String Algorithms,Typographical Errors
Tokenization (data security),Computer science,Computational linguistics,Software system,Natural language,Artificial intelligence,Natural language processing,Software maintenance,Typographical error,Software quality,Stop words
Conference
ISSN
Citations 
PageRank 
1071-9458
1
0.36
References 
Authors
8
3
Name
Order
Citations
PageRank
Sean Banerjee19613.42
Musgrove, J.210.36
Cukic, B.3121.52