Title | ||
---|---|---|
A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks. |
Abstract | ||
---|---|---|
Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. While a line of recent work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs, it does not explain the good generalization performance of the gradient-based methods for learning over-parameterized DNNs. In this work, we take a step further, and prove that under certain assumption on the data distribution that is milder than linear separability, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error (i.e., population error). This leads to an algorithmic-dependent generalization error bound for deep learning. To the best of our knowledge, this is the first result of its kind that can explain the good generalization performance of over-parameterized deep neural networks learned by gradient descent. |
Year | Venue | DocType |
---|---|---|
2019 | arXiv: Learning | Journal |
Volume | Citations | PageRank |
abs/1902.01384 | 3 | 0.38 |
References | Authors | |
0 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yuan Cao | 1 | 548 | 35.60 |
Quanquan Gu | 2 | 1116 | 78.25 |