Abstract | ||
---|---|---|
We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways, giving rise to several algorithm variants. Our algorithm draws on connections to existing literature on black-box optimization and u0027RL as an inferenceu0027 and it can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) [Abdolmaleki et al., 2018a], or as an extension of Trust Region Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [Abdolmaleki et al., 2017b; Hansen et al., 1997] to a policy iteration scheme. Our comparison on 31 continuous control tasks from parkour suite [Heess et al., 2017], DeepMind control suite [Tassa et al., 2018] and OpenAI Gym [Brockman et al., 2016] with diverse properties, limited amount of compute and a single set of hyperparameters, demonstrate the effectiveness of our method and the state of art results. Videos, summarizing results, can be found at goo.gl/HtvJKR . |
Year | Venue | DocType |
---|---|---|
2018 | arXiv: Learning | Journal |
Volume | Citations | PageRank |
abs/1812.02256 | 2 | 0.36 |
References | Authors | |
19 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Abbas Abdolmaleki | 1 | 46 | 12.82 |
Jost Tobias Springenberg | 2 | 1126 | 62.86 |
Jonas Degrave | 3 | 26 | 2.39 |
Steven Bohez | 4 | 48 | 8.99 |
Yuval Tassa | 5 | 1097 | 52.33 |
Dan Belov | 6 | 21 | 1.73 |
Nicolas Heess | 7 | 1762 | 94.77 |
Martin Riedmiller | 8 | 5655 | 366.29 |