Adding Gradient Noise Improves Learning for Very Deep Networks. - Citegraph

Paper Info

Title
Adding Gradient Noise Improves Learning for Very Deep Networks.

Abstract
Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we explore the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which we find surprisingly effective when training these very deep architectures. Unlike classical weight noise, gradient noise injection is complementary to advanced stochastic optimization algorithms such as Adam and AdaGrad. The technique not only helps to avoid overfitting, but also can result in lower training loss. We see consistent improvements in performance across an array of complex models, including state-of-the-art deep networks for question answering and algorithm learning. We observe that this optimization strategy allows a fully-connected 20-layer deep network to escape a bad initialization with standard stochastic gradient descent. We encourage further application of this technique to additional modern neural architectures.

Year	Venue	Field
2015	arXiv: Machine Learning	Stochastic optimization,Stochastic gradient descent,Computer science,Turing machine,Artificial intelligence,Overfitting,Initialization,Deep learning,Gaussian noise,Machine learning,Gradient noise
DocType	Volume	Citations
Journal	abs/1511.06807	56
PageRank	References	Authors
2.61	19	7

Authors (7 rows)

Cited by (56 rows)

References (19 rows)

Name	Order	Citations	PageRank
Arvind Neelakantan	1	408	17.77
Luke Vilnis	2	328	17.06
Quoc V. Le	3	8501	366.59
Ilya Sutskever	4	25814	1120.24
Łukasz Kaiser	5	2307	89.08
Karol Kurach	6	234	13.37
James Martens	7	1239	142.60

1