Tuning continual exploration in reinforcement learning: An optimality property of the Boltzmann strategy. - Citegraph

Paper Info

Title
Tuning continual exploration in reinforcement learning: An optimality property of the Boltzmann strategy.

Abstract
This paper presents a model allowing to tune continual exploration in an optimal way by integrating exploration and exploitation in a common framework. It first quantifies exploration by defining the degree of exploration of a state as the entropy of the probability distribution for choosing an admissible action in that state. Then, the exploration/exploitation tradeoff is formulated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at the states. In other words, maximize exploitation for constant exploration. This formulation leads to a set of nonlinear iterative equations reminiscent of the value-iteration algorithm and demonstrates that the Boltzmann strategy based on the Q-value is optimal in this sense. Convergence of those equations to a local minimum is proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path. Furthermore, if the graph of states is directed and acyclic, the nonlinear equations can easily be solved by a single backward pass from the destination state. Stochastic shortest-path problems and discounted problems are also studied, and links between our algorithm and the SARSA algorithm are examined. The theoretical results are confirmed by simple simulations showing that the proposed exploration strategy outperforms the @e-greedy strategy.

Year	DOI	Venue
2008	10.1016/j.neucom.2007.11.040	Neurocomputing
Keywords	Field	DocType
boltzmann strategy,reinforcement learning markov decision processes exploration and exploitation maximum entropy shortest-path problems randomized strategy,reinforcement learning,exploitation tradeoff,e-greedy strategy,sarsa algorithm,destination state,exploration strategy,proposed exploration strategy,continual exploration,quantifies exploration,optimality property,constant exploration,markov decision processes,markov decision process,probability distribution,nonlinear equation,global optimization,maximum entropy,shortest path,value iteration,cumulant,shortest path problem,bellman equation	Convergence (routing),Mathematical optimization,Nonlinear system,Shortest path problem,Markov decision process,Bellman equation,Probability distribution,Artificial intelligence,Principle of maximum entropy,Machine learning,Mathematics,Reinforcement learning	Journal
Volume	Issue	ISSN
71	13-15	Neurocomputing
Citations	PageRank	References
5	0.52	17
Authors
5

Authors (5 rows)

Cited by (5 rows)

References (17 rows)

Name	Order	Citations	PageRank
Youssef Achbany	1	73	5.77
François Fouss	2	256	22.94
Luh Yen	3	383	28.82
Alain Pirotte	4	916	260.52
Marco Saerens	5	1221	87.07

1