Abstract | ||
---|---|---|
We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a \gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings. |
Year | Venue | DocType |
---|---|---|
2022 | International Conference on Machine Learning | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shantanu Thakoor | 1 | 0 | 0.68 |
Rowland, Mark | 2 | 49 | 7.39 |
diana borsa | 3 | 11 | 5.00 |
William Dabney | 4 | 270 | 17.86 |
Rémi Munos | 5 | 2240 | 157.06 |
André Barreto | 6 | 12 | 5.65 |