Abstract | ||
---|---|---|
Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we explore the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which we find surprisingly effective when training these very deep architectures. Unlike classical weight noise, gradient noise injection is complementary to advanced stochastic optimization algorithms such as Adam and AdaGrad. The technique not only helps to avoid overfitting, but also can result in lower training loss. We see consistent improvements in performance across an array of complex models, including state-of-the-art deep networks for question answering and algorithm learning. We observe that this optimization strategy allows a fully-connected 20-layer deep network to escape a bad initialization with standard stochastic gradient descent. We encourage further application of this technique to additional modern neural architectures. |
Year | Venue | Field |
---|---|---|
2015 | arXiv: Machine Learning | Stochastic optimization,Stochastic gradient descent,Computer science,Turing machine,Artificial intelligence,Overfitting,Initialization,Deep learning,Gaussian noise,Machine learning,Gradient noise |
DocType | Volume | Citations |
Journal | abs/1511.06807 | 56 |
PageRank | References | Authors |
2.61 | 19 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Arvind Neelakantan | 1 | 408 | 17.77 |
Luke Vilnis | 2 | 328 | 17.06 |
Quoc V. Le | 3 | 8501 | 366.59 |
Ilya Sutskever | 4 | 25814 | 1120.24 |
Łukasz Kaiser | 5 | 2307 | 89.08 |
Karol Kurach | 6 | 234 | 13.37 |
James Martens | 7 | 1239 | 142.60 |