Abstract | ||
---|---|---|
Modem tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm which makes use of the feature mapping and obtains a (O) over tilde (d root T/(1 - gamma)(2)) regret, where d is the dimension of the feature space, T is the time horizon and gamma is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing to a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by Omega (d root T/(1 - gamma)(1.5)). Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a (1 - gamma)(-0.5) factor. |
Year | Venue | DocType |
---|---|---|
2021 | INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | Conference |
Volume | ISSN | Citations |
139 | 2640-3498 | 0 |
PageRank | References | Authors |
0.34 | 5 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Dongruo Zhou | 1 | 50 | 9.91 |
Jiafan He | 2 | 0 | 2.03 |
Quanquan Gu | 3 | 1116 | 78.25 |