Title
Adding Gradient Noise Improves Learning for Very Deep Networks.
Abstract
Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we explore the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which we find surprisingly effective when training these very deep architectures. Unlike classical weight noise, gradient noise injection is complementary to advanced stochastic optimization algorithms such as Adam and AdaGrad. The technique not only helps to avoid overfitting, but also can result in lower training loss. We see consistent improvements in performance across an array of complex models, including state-of-the-art deep networks for question answering and algorithm learning. We observe that this optimization strategy allows a fully-connected 20-layer deep network to escape a bad initialization with standard stochastic gradient descent. We encourage further application of this technique to additional modern neural architectures.
Year
Venue
Field
2015
arXiv: Machine Learning
Stochastic optimization,Stochastic gradient descent,Computer science,Turing machine,Artificial intelligence,Overfitting,Initialization,Deep learning,Gaussian noise,Machine learning,Gradient noise
DocType
Volume
Citations 
Journal
abs/1511.06807
56
PageRank 
References 
Authors
2.61
19
7
Name
Order
Citations
PageRank
Arvind Neelakantan140817.77
Luke Vilnis232817.06
Quoc V. Le38501366.59
Ilya Sutskever4258141120.24
Łukasz Kaiser5230789.08
Karol Kurach623413.37
James Martens71239142.60