Abstract | ||
---|---|---|
We identify a class of stochastic control problems with highly random rewards and high discount factor which induce high levels of statistical error in the estimated action-value function. This produces significant levels of max-operator bias in Q-learning, which can induce the algorithm to diverge for millions of iterations. We present a bias-corrected Q-learning algorithm with asymptotically unbiased resistance against the max-operator bias, and show that the algorithm asymptotically converges to the optimal policy, as Q-learning does. We show experimentally that bias-corrected Q-learning performs well in a domain with highly random rewards where Q-learning and other related algorithms suffer from the max-operator bias. |
Year | DOI | Venue |
---|---|---|
2013 | 10.1109/ADPRL.2013.6614994 | Adaptive Dynamic Programming And Reinforcement Learning |
Keywords | Field | DocType |
learning (artificial intelligence),stochastic systems,action-value function estimation,asymptotically unbiased resistance,bias-corrected Q-learning algorithm,discount factor,max-operator bias control,optimal policy,statistical error,stochastic control problems | Mathematical optimization,Discounting,Q-learning,Operator (computer programming),Mathematics,Stochastic control | Conference |
ISSN | Citations | PageRank |
2325-1824 | 4 | 0.51 |
References | Authors | |
8 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Donghun Lee | 1 | 228 | 34.37 |
Boris Defourny | 2 | 25 | 6.26 |
Warren B. Powell | 3 | 1614 | 151.46 |