Title
Risk Aversion Operator For Addressing Maximization Bias In Q-Learning
Abstract
In Q-learning, the reduced chance of converging to the optimal policy is partly caused by the estimated bias of action values. The estimation of action values usually leads to biases like the overestimation and underestimation thus it hurts the current policy. The values produced by the maximization operator are overestimated, which is well known as the maximization bias. For correcting the bias, the values are reduced towards the underestimation by the double estimators operator. However, according to the proposed analysis, the performances of the two operators (the maximization operator and the double estimators operator) rely on the undetermined dynamic of environment in which the estimated bias results from not only the difference between the current policy and optimal policy, but also the sampling error of reward. The sampling error which is increased by the operators leads to the risk of converging to the non-optimal policy. In order to reduce the risk, this paper proposes a flexible operator which takes account of the most visited action value instead of the greedy value, named Risk Aversion operator which is inspired by the humans & x2019; response to the uncertainty. Based on this operator, the Risk Aversion Q-learning is proposed; the boundary of action values and the convergence are proven. In three demonstration tasks whose optimal policy is known, the proposed algorithm increases the chance of converging to the optimal policy.
Year
DOI
Venue
2020
10.1109/ACCESS.2020.2977400
IEEE ACCESS
Keywords
DocType
Volume
Uncertainty, Licenses, Convergence, Task analysis, Heuristic algorithms, Sociology, Statistics, Reinforcement learning, maximization bias, value iteration, Q-learning, risk aversion
Journal
8
ISSN
Citations 
PageRank 
2169-3536
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Bi Wang100.34
Xuelian Li200.34
Zhiqiang Gao334939.84
Yangjun Zhong400.34