Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics. - Citegraph

Paper Info

Title
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics.

Abstract
Value-based reinforcement-learning algorithms are currently state-of-the-art in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are currently limited by their need for an on-policy critic, which severely constraints how the critic is learned. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free actor-critic reinforcement-learning algorithm for continuous states and discrete actions, with off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we show approximates Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable and, contrary to other state-of-the-art algorithms, unusually forgiving for poorly-configured hyper-parameters. BDPI is significantly more sample-efficient compared to Bootstrapped DQN, PPO, A3C and ACKTR, on a variety of tasks. Source code: https://github.com/vub-ai-lab/bdpi.

Year	Venue	Field
2019	BNAIC/BENELEARN	Source code,Bootstrapping,Thompson sampling,Artificial intelligence,Machine learning,Mathematics,Reinforcement learning
DocType	Volume	Citations
Journal	abs/1903.04193	0
PageRank	References	Authors
0.34	29	4

Authors (4 rows)

Cited by (0 rows)

References (29 rows)

Name	Order	Citations	PageRank
Denis Steckelmacher	1	0	2.03
Hélène Plisnier	2	0	0.68
Diederik M. Roijers	3	198	24.72
Ann Nowé	4	0	0.68

1