Policy Optimization with Stochastic Mirror Descent. - Citegraph

Paper Info

Title
Policy Optimization with Stochastic Mirror Descent.

Abstract
Stochastic mirror descent (SMD) keeps the advantages of simplicity of implementation, low memory requirement, and low computational complexity. However, the non-convexity of objective function with its non-stationary sampling process is the main bottleneck of applying SMD to reinforcement learning. To address the above problem, we propose the mirror policy optimization (MPO) by estimating the policy gradient via dynamic batch-size of gradient information. Comparing with REINFORCE or VPG, the proposed MPO improves the convergence rate from $\mathcal{O}({{1}/{\sqrt{N}}})$ to $\mathcal{O}({\ln N}/{N})$. We also propose VRMPO algorithm, a variance reduction implementation of MPO. We prove the convergence of VRMPO and show its computational complexity. We evaluate the performance of VRMPO on the MuJoCo continuous control tasks, results show that VRMPO outperforms or matches several state-of-art algorithms DDPG, TRPO, PPO, and TD3.

Year	Venue	DocType
2019	CoRR	Journal
Volume	ISSN	Citations
abs/1906.10462	AAAI2022	0
PageRank	References	Authors
0.34	0	7

Authors (7 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Long Yang	1	3	1.04
Yu Zhang	2	0	1.01
Gang Zheng	3	5	5.23
Qian Zheng	4	44	13.91
Peng-Fei Li	5	56	20.94
Jianhang Huang	6	0	0.68
Gang Pan	7	1501	123.57

1