A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks. - Citegraph

Paper Info

Title
A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks.

Abstract
Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. While a line of recent work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs, it does not explain the good generalization performance of the gradient-based methods for learning over-parameterized DNNs. In this work, we take a step further, and prove that under certain assumption on the data distribution that is milder than linear separability, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error (i.e., population error). This leads to an algorithmic-dependent generalization error bound for deep learning. To the best of our knowledge, this is the first result of its kind that can explain the good generalization performance of over-parameterized deep neural networks learned by gradient descent.

Year	Venue	DocType
2019	arXiv: Learning	Journal
Volume	Citations	PageRank
abs/1902.01384	3	0.38
References	Authors
0	2

Authors (2 rows)

Cited by (3 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yuan Cao	1	548	35.60
Quanquan Gu	2	1116	78.25

1