Abstract | ||
---|---|---|
Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce SKETCHED-SGD(4), an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. We show that SKETCHED-SGD has favorable convergence rates on several classes of functions. When considering all communication - both of gradients and of updated model weights - SKETCHED-SGD reduces the amount of communication required compared to other gradient compression methods from O(d) or O(W) to O(log d), where d is the number of model parameters and W is the number of workers participating in training. We run experiments on a transformer model, an LSTM, and a residual network, demonstrating up to a 40x reduction in total communication cost with no loss in final model performance. We also show experimentally that SKETCHED-SGD scales to at least 256 workers without increasing communication cost or degrading model performance. |
Year | Venue | DocType |
---|---|---|
2019 | ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | Conference |
Volume | ISSN | Citations |
32 | 1049-5258 | 0 |
PageRank | References | Authors |
0.34 | 0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Nikita Ivkin | 1 | 26 | 3.90 |
Daniel Rothchild | 2 | 0 | 0.68 |
Enayat Ullah | 3 | 0 | 2.37 |
Vladimir Braverman | 4 | 357 | 34.36 |
I. Stoica | 5 | 21406 | 1710.11 |
Arora, Raman | 6 | 18 | 1.17 |