Abstract | ||
---|---|---|
Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of non-convergence of Adam, but their efficiency turns out to be unsatisfactory in practice. In this paper, we provide a new insight into the non-convergence issue of Adam as well as other adaptive learning rate methods. We argue that there exists an inappropriate correlation between gradient $g_t$ and the second moment term $v_t$ in Adam ($t$ is the timestep), which results in that a large gradient is likely to have small step size while a small gradient may have a large step size. We demonstrate that such unbalanced step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating $v_t$ and $g_t$ will lead to unbiased step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates $v_t$ and $g_t$ by temporal shifting, i.e., using temporally shifted gradient $g_{t-n}$ to calculate $v_t$. The experiment results demonstrate that AdaShift is able to address the non-convergence issue of Adam, while still maintaining a competitive performance with Adam in terms of both training speed and generalization. |
Year | Venue | Field |
---|---|---|
2018 | international conference on learning representations | Convergence (routing),Applied mathematics,Mathematical optimization,Decorrelation,Existential quantification,Adaptive learning rate,Mathematics,Second moment of area |
DocType | Volume | Citations |
Journal | abs/1810.00143 | 2 |
PageRank | References | Authors |
0.38 | 10 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Zhiming Zhou | 1 | 19 | 3.66 |
Qingru Zhang | 2 | 2 | 0.72 |
Guansong Lu | 3 | 15 | 1.95 |
Hongwei Wang | 4 | 11 | 1.84 |
Weinan Zhang | 5 | 1228 | 97.24 |
Yong Yu | 6 | 7637 | 380.66 |