Title
Understanding LDA in source code analysis
Abstract
Latent Dirichlet Allocation (LDA) has seen increasing use in the understanding of source code and its related artifacts in part because of its impressive modeling power. However, this expressive power comes at a cost: the technique includes several tuning parameters whose impact on the resulting LDA model must be carefully considered. An obvious example is the burn-in period; too short a burn-in period leaves excessive echoes of the initial uniform distribution. The aim of this work is to provide insights into the tuning parameter's impact. Doing so improves the comprehension of both, 1) researchers who look to exploit the power of LDA in their research and 2) those who interpret the output of LDA-using tools. It is important to recognize that the goal of this work is not to establish values for the tuning parameters because there is no universal best setting. Rather, appropriate settings depend on the problem being solved, the input corpus (in this case, typically words from the source code and its supporting artifacts), and the needs of the engineer performing the analysis. This work's primary goal is to aid software engineers in their understanding of the LDA tuning parameters by demonstrating numerically and graphically the relationship between the tuning parameters and the LDA output. A secondary goal is to enable more informed setting of the parameters. Results obtained using both production source code and a synthetic corpus underscore the need for a solid understanding of how to configure LDA's tuning parameters.
Year
DOI
Venue
2014
10.1145/2597008.2597150
ICPC
Keywords
Field
DocType
applications,hyper-parameters,latent dirichlet allocation,source code topic models,natural language processing
Data mining,Latent Dirichlet allocation,Source code,Computer science,Uniform distribution (continuous),Exploit,Software,Artificial intelligence,Expressive power,Machine learning,Comprehension
Conference
Citations 
PageRank 
References 
20
0.69
12
Authors
4
Name
Order
Citations
PageRank
David Binkley12098169.44
Daniel Heinz2212.05
Dawn Lawrie368544.50
Justin Overfelt4221.05