Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis - Citegraph

Paper Info

Title
Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

Abstract
We address the problem of policy evaluation in discounted, tabular Markov decision processes and provide instance-dependent guarantees on the l(infinity)-error under a generative model. We establish both asymptotic and nonasymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a nonasymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve nonasymptotic, instance-dependent optimality up to logarithmic factors.

Year	DOI	Venue
2021	10.1137/20M1331524	SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE
Keywords	DocType	Volume
temporal difference learning, Polyak-Ruppert averaging, variance reduction	Journal	3
Issue	Citations	PageRank
4	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Koulik Khamaru	1	2	2.40
Ashwin Pananjady	2	40	9.69
Ruan Feng	3	0	0.34
Martin J. Wainwright	4	7398	533.01
Michael I. Jordan	5	31220	3640.80

1