Abstract | ||
---|---|---|
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperatureu0027 parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $tilde O(L^{3/2} sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice. |
Year | Venue | Field |
---|---|---|
2018 | arXiv: Learning | Discrete mathematics,Mathematical optimization,Time horizon,Regret,Upper and lower bounds,Bellman equation,Tilde,Principle of maximum entropy,Boltzmann constant,Mathematics,Reinforcement learning |
DocType | Volume | Citations |
Journal | abs/1807.09647 | 1 |
PageRank | References | Authors |
0.35 | 36 | 1 |
Name | Order | Citations | PageRank |
---|---|---|---|
Brendan O'Donoghue | 1 | 172 | 10.19 |