Abstract | ||
---|---|---|
Communication overhead hinders the scalability of large-scale distributed training Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity 1 - beta which measures the network connectivity. On large and sparse networks where 1 - beta -> 0, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging into Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from Omega(beta(4)n(3)/(1-beta)(4)) to Omega(beta(4)n(3)H(4)) for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period H. Its transient-stage complexity is also superior to Local SGD which has order Omega(n(3)H(4)). Empirical results of large-scale training on image classification (ResNet50) and language modeling (BERT) validate our theoretical findings. |
Year | Venue | DocType |
---|---|---|
2021 | INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | Conference |
Volume | ISSN | Citations |
139 | 2640-3498 | 0 |
PageRank | References | Authors |
0.34 | 0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yiming Chen | 1 | 5 | 1.48 |
Kun Yuan | 2 | 0 | 0.68 |
Yingya Zhang | 3 | 44 | 1.97 |
Pan Pan | 4 | 0 | 0.34 |
Yinghui Xu | 5 | 172 | 20.23 |
Wotao Yin | 6 | 5038 | 243.92 |