Title
HarpLDA+: Optimizing latent dirichlet allocation for parallel efficiency
Abstract
Latent Dirichlet Allocation (LDA) is a widely used machine learning technique in topic modeling and data analysis. Training large LDA models on big datasets involves dynamic and irregular computation patterns and is a major challenge to both algorithm optimization and system design. In this paper, we present a comprehensive benchmarking of our novel synchronized LDA training system HarpLDA+ based on Hadoop and Java. It demonstrates impressive performance when compared to three other MPI/C++ based state-of-the-art systems, which are LightLDA, F+NomadLDA, and WarpLDA. HarpLDA+ uses optimized collective communication with a timer control for load balance, leading to stable scalability in both shared-memory and distributed systems. We demonstrate in the experiments that HarpLDA+ is effective in reducing synchronization and communication overhead and outperforms the other three LDA training systems.
Year
DOI
Venue
2017
10.1109/BigData.2017.8257932
2017 IEEE International Conference on Big Data (Big Data)
Keywords
DocType
ISSN
parallel efficiency,latent dirichlet allocation,topic modeling,data analysis,LDA models,big datasets,irregular computation patterns,algorithm optimization,system design,comprehensive benchmarking,MPI/C++ based state-of-the-art systems,shared-memory,distributed systems,LDA training systems,machine learning technique,HarpLDA+,dynamic computation patterns,LDA training system,load balancing,Hadoop,Java
Conference
2639-1589
ISBN
Citations 
PageRank 
978-1-5386-2716-7
1
0.36
References 
Authors
0
12
Name
Order
Citations
PageRank
Bo Peng192.91
Bingjing Zhang252125.17
Langshi Chen310.36
Mihai Avram410.70
Robert Henschel510610.85
Craig A. Stewart625942.68
Shaojuan Zhu710.36
Emily Mccallum810.36
Lisa Smith910.36
Tom Zahniser1010.36
Jon Omer1110.36
Judy Qiu1232.07