Abstract | ||
---|---|---|
Adaptive stochastic gradient methods such as ADAGRAD have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of ADAGRAD is expected to attain better performance, however in high dimensions it is computationally impractical. We present ADA-LR and RADAGRAD two computationally efficient approximations to full-matrix ADAGRAD based on randomized dimensionality reduction. They are able to capture dependencies between features and achieve similar performance to full-matrix ADAGRAD but at a much smaller computational cost. We show that the regret of ADA-LR is close to the regret of full-matrix ADAGRAD which can have an up-to exponentially smaller dependence on the dimension than the diagonal variant. Empirically, we show that ADA-LR and RADAGRAD perform similarly to full-matrix ADAGRAD. On the task of training convolutional neural networks as well as recurrent neural networks, RADAGRAD achieves faster convergence than diagonal ADAGRAD. |
Year | Venue | DocType |
---|---|---|
2016 | ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016) | Conference |
Volume | ISSN | Citations |
29 | 1049-5258 | 1 |
PageRank | References | Authors |
0.35 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Krummenacher, Gabriel | 1 | 9 | 0.97 |
McWilliams, Brian | 2 | 105 | 5.90 |
Yannic Kilcher | 3 | 8 | 4.28 |
joachim m buhmann | 4 | 4363 | 730.34 |
Nicolai Meinshausen | 5 | 8 | 2.55 |