Title
Variational Bayesian Reinforcement Learning with Regret Bounds.
Abstract
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperatureu0027 parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $tilde O(L^{3/2} sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.
Year
Venue
Field
2018
arXiv: Learning
Discrete mathematics,Mathematical optimization,Time horizon,Regret,Upper and lower bounds,Bellman equation,Tilde,Principle of maximum entropy,Boltzmann constant,Mathematics,Reinforcement learning
DocType
Volume
Citations 
Journal
abs/1807.09647
1
PageRank 
References 
Authors
0.35
36
1
Name
Order
Citations
PageRank
Brendan O'Donoghue117210.19