Abstract | ||
---|---|---|
We provide a bound on the first moment of the error in the Q-function estimate resulting from fixed step-size algorithms applied to finite state-space, discounted reward Markov decision problems. Motivated by Tsitsiklis’ proof for the decreasing step-size case, we decompose the Q-learning update equations into a dynamical system driven by a noise sequence and another dynamical system whose state variable is the Q-learning error, i.e., the difference between the true Q-function and the estimate. A natural persistence of excitation condition allows us to sample the system periodically and derive a simple scalar difference equation from which the convergence properties and bounds on the error of the Q-learning algorithm can be derived. |
Year | DOI | Venue |
---|---|---|
2012 | 10.1016/j.sysconle.2012.08.014 | Systems & Control Letters |
Keywords | Field | DocType |
Q-learning,Markov decision processes,Stochastic approximation | Differential equation,Mathematical optimization,Markov chain,Q-learning,Markov decision process,Moment (mathematics),State variable,Stochastic approximation,Dynamical system,Mathematics | Journal |
Volume | Issue | ISSN |
61 | 12 | 0167-6911 |
Citations | PageRank | References |
3 | 0.62 | 4 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Carolyn L. Beck | 1 | 401 | 60.19 |
Srikant, R. | 2 | 6868 | 544.90 |