Abstract | ||
---|---|---|
We study distributed machine learning in heterogeneous environments in this work. We first conduct a systematic study of existing systems running distributed stochastic gradient descent; we find that, although these systems work well in homogeneous environments, they can suffer performance degradation, sometimes up to 10x, in heterogeneous environments where stragglers are common because their synchronization protocols cannot fit a heterogeneous setting. Our first contribution is a heterogeneity-aware algorithm that uses a constant learning rate schedule for updates before adding them to the global parameter. This allows us to suppress stragglers' harm on robust convergence. As a further improvement, our second contribution is a more sophisticated learning rate schedule that takes into consideration the delayed information of each update. We theoretically prove the valid convergence of both approaches and implement a prototype system in the production cluster of our industrial partner Tencent Inc. We validate the performance of this prototype using a range of machine-learning workloads. Our prototype is 2-12x faster than other state-of-the-art systems, such as Spark, Petuum, and TensorFlow; and our proposed algorithm takes up to 6x fewer iterations to converge. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1145/3035918.3035933 | SIGMOD Conference |
Field | DocType | Citations |
Convergence (routing),Data mining,Stochastic gradient descent,Synchronization,Spark (mathematics),Computer science,Homogeneous,Server,Database,Distributed computing | Conference | 37 |
PageRank | References | Authors |
1.12 | 39 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jiawei Jiang | 1 | 89 | 14.60 |
Bin Cui | 2 | 1843 | 124.59 |
Ce Zhang | 3 | 803 | 83.39 |
Lele Yu | 4 | 70 | 6.93 |