Abstract | ||
---|---|---|
Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance.In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants. |
Year | Venue | Keywords |
---|---|---|
2018 | NeurIPS | neural networks,stochastic gradient descent,neural network,high frequency,first step |
DocType | Volume | Citations |
Conference | abs/1806.03791 | 1 |
PageRank | References | Authors |
0.36 | 13 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Lingjiao Chen | 1 | 22 | 3.34 |
Hongyi Wang | 2 | 19 | 3.83 |
Zhao, Jinman | 3 | 2 | 0.70 |
Dimitris S. Papailiopoulos | 4 | 797 | 40.11 |
Paraschos Koutris | 5 | 347 | 26.63 |