Title
Online Model-Free N-Step HDP With Stability Analysis.
Abstract
Because of a powerful temporal-difference (TD) with <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> [TD( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> )] learning method, this paper presents a novel <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> -step adaptive dynamic programming (ADP) architecture that combines TD( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> ) with regular TD learning for solving optimal control problems with reduced iterations. In contrast with a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">backward view</italic> learning of TD( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> ) that is required an extra parameter named eligibility traces to update at the end of each episode (offline training), the new design in this paper has <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">forward view</italic> learning, which is updated at each time step (online training) without needing the eligibility trace parameter in various applications without mathematical models. Therefore, the new design is called the online model-free <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> -step action-dependent (AD) heuristic dynamic programming [NSHDP( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> )]. NSHDP( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> ) has three neural networks: the critic network (CN) with regular one-step TD [TD(0)], the CN with <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> -step TD learning [or TD( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> )], and the actor network (AN). Because the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">forward view</italic> learning does not require any extra eligibility traces associated with each state, the NSHDP( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> ) architecture has low computational costs and is memory efficient. Furthermore, the stability is proven for NSHDP( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> ) under certain conditions by using Lyapunov analysis to obtain the uniformly ultimately bounded (UUB) property. We compare the results with the performance of HDP and traditional action-dependent HDP( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> ) [ADHDP( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> )] with different <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> values. Moreover, a complex nonlinear system and 2-D maze problem are two simulation benchmarks in this paper, and the third one is an inverted pendulum simulation benchmark, which is presented in the supplemental material part of this paper. NSHDP( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> ) performance is examined and compared with other ADP methods.
Year
DOI
Venue
2020
10.1109/TNNLS.2019.2919614
IEEE transactions on neural networks and learning systems
Keywords
DocType
Volume
Mathematical model,Stability analysis,Dynamic programming,Programming,Training,Computer architecture,Learning systems
Journal
31
Issue
ISSN
Citations 
4
2162-237X
5
PageRank 
References 
Authors
0.41
14
2
Name
Order
Citations
PageRank
Seaar Al-Dabooni1101.82
Wunsch II Donald C.2135491.73