Model averaging in distributed machine learning: a case study with Apache Spark - Citegraph

Paper Info

Title
Model averaging in distributed machine learning: a case study with Apache Spark

Abstract
The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)—the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark’s execution: we can significantly improve Spark’s performance by leveraging the well-known “model averaging” (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.

Year	DOI	Venue
2021	10.1007/s00778-021-00664-7	The VLDB Journal
Keywords	DocType	Volume
Distributed machine learning, Apache Spark MLlib, Generalized linear models, Latent Dirichlet allocation	Journal	30
Issue	ISSN	Citations
4	1066-8888	0
PageRank	References	Authors
0.34	8	7

Authors (7 rows)

Cited by (0 rows)

References (8 rows)

Name	Order	Citations	PageRank
Y Guo	1	0	0.34
Zhipeng Zhang	2	11	2.20
Jiawei Jiang	3	89	14.60
Wentao Wu	4	394	30.53
Ce Zhang	5	803	83.39
Bin Cui	6	1843	124.59
Jianzhong Li	7	63	24.23

1