Abstract | ||
---|---|---|
We address the problem of policy evaluation in discounted, tabular Markov decision processes and provide instance-dependent guarantees on the l(infinity)-error under a generative model. We establish both asymptotic and nonasymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a nonasymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve nonasymptotic, instance-dependent optimality up to logarithmic factors. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1137/20M1331524 | SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE |
Keywords | DocType | Volume |
temporal difference learning, Polyak-Ruppert averaging, variance reduction | Journal | 3 |
Issue | Citations | PageRank |
4 | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Koulik Khamaru | 1 | 2 | 2.40 |
Ashwin Pananjady | 2 | 40 | 9.69 |
Ruan Feng | 3 | 0 | 0.34 |
Martin J. Wainwright | 4 | 7398 | 533.01 |
Michael I. Jordan | 5 | 31220 | 3640.80 |