Title
Policy Optimization with Stochastic Mirror Descent.
Abstract
Stochastic mirror descent (SMD) keeps the advantages of simplicity of implementation, low memory requirement, and low computational complexity. However, the non-convexity of objective function with its non-stationary sampling process is the main bottleneck of applying SMD to reinforcement learning. To address the above problem, we propose the mirror policy optimization (MPO) by estimating the policy gradient via dynamic batch-size of gradient information. Comparing with REINFORCE or VPG, the proposed MPO improves the convergence rate from $\mathcal{O}({{1}/{\sqrt{N}}})$ to $\mathcal{O}({\ln N}/{N})$. We also propose VRMPO algorithm, a variance reduction implementation of MPO. We prove the convergence of VRMPO and show its computational complexity. We evaluate the performance of VRMPO on the MuJoCo continuous control tasks, results show that VRMPO outperforms or matches several state-of-art algorithms DDPG, TRPO, PPO, and TD3.
Year
Venue
DocType
2019
CoRR
Journal
Volume
ISSN
Citations 
abs/1906.10462
AAAI2022
0
PageRank 
References 
Authors
0.34
0
7
Name
Order
Citations
PageRank
Long Yang131.04
Yu Zhang201.01
Gang Zheng355.23
Qian Zheng44413.91
Peng-Fei Li55620.94
Jianhang Huang600.68
Gang Pan71501123.57