Abstract | ||
---|---|---|
Latent Dirichlet Allocation (LDA) is a widely used machine learning technique in topic modeling and data analysis. Training large LDA models on big datasets involves dynamic and irregular computation patterns and is a major challenge to both algorithm optimization and system design. In this paper, we present a comprehensive benchmarking of our novel synchronized LDA training system HarpLDA+ based on Hadoop and Java. It demonstrates impressive performance when compared to three other MPI/C++ based state-of-the-art systems, which are LightLDA, F+NomadLDA, and WarpLDA. HarpLDA+ uses optimized collective communication with a timer control for load balance, leading to stable scalability in both shared-memory and distributed systems. We demonstrate in the experiments that HarpLDA+ is effective in reducing synchronization and communication overhead and outperforms the other three LDA training systems. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1109/BigData.2017.8257932 | 2017 IEEE International Conference on Big Data (Big Data) |
Keywords | DocType | ISSN |
parallel efficiency,latent dirichlet allocation,topic modeling,data analysis,LDA models,big datasets,irregular computation patterns,algorithm optimization,system design,comprehensive benchmarking,MPI/C++ based state-of-the-art systems,shared-memory,distributed systems,LDA training systems,machine learning technique,HarpLDA+,dynamic computation patterns,LDA training system,load balancing,Hadoop,Java | Conference | 2639-1589 |
ISBN | Citations | PageRank |
978-1-5386-2716-7 | 1 | 0.36 |
References | Authors | |
0 | 12 |
Name | Order | Citations | PageRank |
---|---|---|---|
Bo Peng | 1 | 9 | 2.91 |
Bingjing Zhang | 2 | 521 | 25.17 |
Langshi Chen | 3 | 1 | 0.36 |
Mihai Avram | 4 | 1 | 0.70 |
Robert Henschel | 5 | 106 | 10.85 |
Craig A. Stewart | 6 | 259 | 42.68 |
Shaojuan Zhu | 7 | 1 | 0.36 |
Emily Mccallum | 8 | 1 | 0.36 |
Lisa Smith | 9 | 1 | 0.36 |
Tom Zahniser | 10 | 1 | 0.36 |
Jon Omer | 11 | 1 | 0.36 |
Judy Qiu | 12 | 3 | 2.07 |