Abstract | ||
---|---|---|
In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks. |
Year | Venue | Field |
---|---|---|
2019 | ICLR | Computer science,Artificial intelligence,Backtracking,Recall,Machine learning,Reinforcement learning |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Anirudh Goyal Alias Parth Goyal | 1 | 2 | 1.37 |
Philemon Brakel | 2 | 236 | 11.60 |
William Fedus | 3 | 49 | 5.01 |
Soumye Singhal | 4 | 0 | 0.68 |
Timothy P. Lillicrap | 5 | 4377 | 170.65 |
Sergey Levine | 6 | 3377 | 182.21 |
Hugo Larochelle | 7 | 7692 | 488.99 |
Yoshua Bengio | 8 | 42677 | 3039.83 |