Title
Model averaging in distributed machine learning: a case study with Apache Spark
Abstract
The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)—the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark’s execution: we can significantly improve Spark’s performance by leveraging the well-known “model averaging” (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.
Year
DOI
Venue
2021
10.1007/s00778-021-00664-7
The VLDB Journal
Keywords
DocType
Volume
Distributed machine learning, Apache Spark MLlib, Generalized linear models, Latent Dirichlet allocation
Journal
30
Issue
ISSN
Citations 
4
1066-8888
0
PageRank 
References 
Authors
0.34
8
7
Name
Order
Citations
PageRank
Y Guo100.34
Zhipeng Zhang2112.20
Jiawei Jiang38914.60
Wentao Wu439430.53
Ce Zhang580383.39
Bin Cui61843124.59
Jianzhong Li76324.23