Abstract | ||
---|---|---|
Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMP outperforms the state-of-the-art policy gradient methods in various settings. |
Year | Venue | Keywords |
---|---|---|
2022 | AAAI Conference on Artificial Intelligence | Machine Learning (ML) |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Long Yang | 1 | 2 | 2.08 |
Yu Zhang | 2 | 0 | 1.01 |
Gang Zheng | 3 | 0 | 1.35 |
Qian Zheng | 4 | 0 | 0.68 |
Pengfei Li | 5 | 3 | 1.71 |
Jianhang Huang | 6 | 1 | 0.73 |
Gang Pan | 7 | 0 | 1.35 |