Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems
Correspondence to: Prof. Ding Wang, Faculty of Information Technology, Beijing University of Technology, No.100, Pingleyuan, Chaoyang District, Beijing 100124, China. E-mail:
Currently, there are a large number of tracking problems in the industry concerning nonlinear systems with unknown dynamics. In order to obtain the optimal control policy, a multi-step adaptive critic tracking control (MsACTC) algorithm is developed in this paper. By constructing a steady control law, the tracking problem is transformed into a regulation problem. The MsACTC algorithm has an adjustable convergence rate during the iterative process by incorporating a multi-step policy evaluation mechanism. The convergence proof of the algorithm is provided. In order to implement the algorithm, three neural networks are built, including the model network, the critic network, and the action network. Finally, two numerical simulation examples are given to verify the effectiveness of the algorithm. Simulation results show that the MsACTC algorithm has satisfactory performance in terms of the applicability, tracking accuracy, and convergence speed.
In practical engineering applications, the controller design should not only meet the basic performance requirements but also need to further improve the control effect and reduce costs [1–4]. However, real industrial systems are often complex nonlinear systems [5, 6]. Solving the Hamilton-Jacobi-Bellman (HJB) equations is an unavoidable obstacle when designing optimal control policies for nonlinear systems . Considering that the analytical solution of the HJB equation is difficult to obtain, adaptive dynamic programming (ADP) based on the actor-critic framework approximates the solution of the HJB equation by the iteration [8, 9]. The ADP algorithm exhibits strong adaptive and optimization capabilities by incorporating the advantages of reinforcement learning, neural networks, and dynamic programming [10, 11]. Therefore, the ADP algorithm has a satisfactory performance in solving the HJB equation. Through continuous development, the ADP algorithm has become one of the key methods for solving optimal control problems of nonlinear systems [12–14]. In terms of the iterative form, the ADP algorithms can be categorized into value iteration (Ⅵ) and policy iteration (PI) [15, 16]. PI has faster convergence but requires an initial admissible control policy . However, it is difficult to obtain an initial admissible control policy for nonlinear systems. VI does not require an initial admissible control policy but has a slower convergence rate .
In the industry, we often need to solve more complex tracking problems for nonlinear systems [19–24], including the consensus tracking problem for multi-agent systems  and the output tracking control for single-agent systems . In fact, the regulation problem can be seen as a simplified special case of the tracking problem. To overcome these challenges, some scholars have started to develop optimal learning control algorithms based on the ADP framework for solving tracking problems . Currently, tracking control algorithms based on the ADP framework can be categorized into three groups. The first class of tracking control algorithms constructs the augmented system with respect to the original system and the reference trajectory and then builds the cost function based on the state vector and the control input of the augmented system [27, 28]. Although this type of tracking control algorithm has a well-established theory, it cannot completely eliminate tracking errors. The second class of tracking control algorithms utilizes the square of the tracking error at the next moment to build a novel utility function . With the novel utility function, the tracking error can be completely eliminated. However, the second class of tracking control algorithms is currently only applicable to affine nonlinear systems with known system models, which greatly reduces the application value of the algorithm. The third type of tracking control algorithm transforms the tracking problem into the regulation problem by constructing a virtual steady control . Thanks to the steady control, the third type of tracking control algorithm also completely eliminates tracking errors and is suitable for nonlinear systems. It should be emphasized that solving the steady control imposes a large computational burden on the algorithm.
Therefore, we expected that a speed-up mechanism could be designed for the tracking control algorithm with steady control to improve the convergence speed. Inspired by the idea of eligibility traces, Al-Dabooni
In this context, a multi-step adaptive critic tracking control (MsACTC) algorithm is established by introducing a multi-step policy evaluation mechanism into the steady tracking control algorithm. Compared to the general tracking control algorithm with steady control, the MsACTC algorithm exhibits higher learning efficiency as the evaluation step increases. Furthermore, the convergence proof of the MsACTC algorithm is given. The MsACTC algorithm is successfully implemented through neural networks and the least square method. Finally, we verify the optimization ability and learning efficiency of the proposed algorithm through simulations of two nonlinear systems. Table 1 explains the meaning of the mathematical symbols used in this article.
Summary of mathematical symbols
|The set of all
||The norm of vector
|The set of all
||The absolute value of
||The transpose symbol|
|The set of all non-negative integers||The set of all positive integers||A compact subset of
2. PROBLEM DESCRIPTION
Consider such a class of discrete-time nonlinear systems
The reference trajectory can be denoted as
We assume that there exists a steady control
The tracking control policy
According to equations (1)-(5), we can obtain the following error system:
The cost function is defined as
So far, the tracking problem of the system (1) has been successfully transformed into the regulation problem of the system (6).
3. MULTI-STEP ADAPTIVE CRITIC TRACKING CONTROL ALGORITHM
It should be noted that equation (8) is the HJB equation, and its analytical solution is difficult to obtain directly. In addition, since the system function
3.1. Algorithm design
According to equation (6), we have
Construct the sequence of iterative cost functions
The policy evaluation is expressed as follows:
By iterating continuously between equations (11) and (12), the cost function and the control policy converge to their optimal values as
3.2. Theoretical analysis
To ensure that the cost function
Theorem 1If the sequences
then the following conclusions hold:
Proof. First, we prove the conclusion 1). When (14) holds, we have
So far, we have shown that the conclusion 1) holds for
By further derivation, we get
Based on inequations (16) and (17), we have
According to mathematical induction, the proof of the conclusion 1) is completed.
Next, we prove the conclusion 2). Since the sequence of cost functions
By observation, equations (20) and (8) are equivalent. This means that
Theorem 2Based on equations (1)-(6),
Proof. According to the error system (6), we have
Besides, we have
The proof of Theorem 2 is completed. In this section, Theorem 1 is successfully proved by utilizing mathematical induction, and it provides a theoretical justification for the convergence of the MsACTC algorithm. Theorem 2 gives an equivalence between
Remark 1 To ensure that the condition (14) holds,
4. IMPLEMENTATION OF ALGORITHM
To implement the algorithm, three neural networks are constructed, including a model network (MN), a critic network (CN), and an action network (AN). Figure 1 illustrates the overall structure of the MsACTC algorithm. It should be noted that the training of the MN is carried out independently. The training of the CN and the AN need to cooperate with each other iteratively. It should be noted that the implementation approach used in this paper is similar to that in .
Figure 1. The structure diagram of the MsACTC algorithm. MsACTC: Multi-step adaptive critic tracking control.
The MN is used to estimate the dynamics of the system (1) and to solve the steady control
In order to make the MN with higher recognition accuracy, we utilize the neural network toolbox in Matlab to train the MN. The training parameter settings are given in Section 5.
After the MN is trained, according to equation (4), we can obtain the steady control
The CN is used to approximate the cost function, and its output can be expressed as
A sample set
The AN is used to approximate the tracking control policy, and its output can be expressed as
Considering the sample set
5. SIMULATION RESULT
In this section, we complete two simulations. Example 1 is to illustrate the effectiveness of the MsACTC algorithm and the correctness of Theorem 1. Example 2 is to show that the condition (14) is not sufficiently necessary for the convergence of the cost function.
5.1. Example 1
Consider the modified Van der Pol's oscillator system
The reference trajectory is defined as
Before the neural networks are trained, we make a sample set of
We construct the activation functions for the CN and the AN as
In order to satisfy the condition (14), the weight vector
In order to compare the convergence speeds under different values of
Figure 3. Convergence processes of network weights with different values of
We apply the optimal control policy obtained by the MsACTC algorithm for
5.2. Example 2
Consider the modified torsional pendulum system
Due to the excellent adaptive ability, the MsACTC algorithm does not have strict requirements for the settings of the parameters. Therefore, the parameter values and the setting of activation functions are kept the same as those in Example 1. In Example 2, we set
Figure 8. Convergence processes of network weights with different values of
Even if the condition (14) does not hold,
6. CONCLUSIONS AND OUTLOOK
In this paper, the MsACTC algorithm is developed by introducing the multi-step policy evaluation mechanism into the adaptive critic tracking control algorithm. For nonlinear systems with unknown models, the MN is built based on data to estimate the state vector in the future time. By solving the steady control
Formal analysis, validation, writing - original draft: Li X
Methodology, Supervision, Writing - review and editing: Ren J
Investigation, Supervision, Writing - review and editing: Wang D
Availability of data and materials
Financial support and sponsorship
This work was supported in part by the National Key Research and Development Program of China under Grant 2021ZD0112302 and in part by the National Natural Science Foundation of China under Grant 62222301, Grant 61890930-5, and Grant 62021003.
Conflicts of interest
All authors declared that there are no conflicts of interest.
Ethical approval and consent to participate
Consent for publication
© The Author(s) 2023.
1. Maamoun KSA, Karimi HR. Reinforcement learning-based control for offshore crane load-landing operations. Complex Eng Syst 2022;2:12.
2. Wei Q, Liu D, Shi G, Liu Y. Multibattery optimal coordination control for home energy management systems via distributed iterative adaptive dynamic programming. IEEE Trans Ind Electron 2015;62:4203-14.
3. Firdausiyah N, Taniguchi E, Qureshi AG. Multi-agent simulation-adaptive dynamic programming based reinforcement learning for evaluating joint delivery systems in relation to the different locations of urban consolidation centres. Transp Res Proc 2020;46:125-32.
4. Li S, Ding L, Gao H, Liu YJ, Huang L, Deng Z. ADP-based online tracking control of partially uncertain time-delayed nonlinear system and application to wheeled mobile robots. IEEE Trans Cybern 2020;50:3182-94.
5. Sun T, Sun XM. An adaptive dynamic programming scheme for nonlinear optimal control with unknown dynamics and its application to turbofan engines. IEEE Trans Ind Inf 2021;17:367-76.
6. Davari M, Gao W, Jiang ZP, Lewis FL. An optimal primary frequency control based on adaptive dynamic programming for islanded modernized microgrids. IEEE Trans Automat Sci Eng 2021;18:1109-21.
7. Zhao M, Wang D, Qiao J, Ha M, Ren J. Advanced value iteration for discrete-time intelligent critic control): a survey. Artif Intell Rev 2023;56:12315-46.
8. Lewis FL, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag 2009;9:32-50.
9. Liu D, Xue S, Zhao B, Luo B, Wei Q. Adaptive dynamic programming for control): a survey and recent advances. IEEE Trans Syst Man Cybern Syst 2021;51:142-60.
10. Zhang H, Cui L, Zhang X, Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Trans Neural Netw 2011;22:2226-36.
11. Wang D, Wang J, Zhao M, Xin P, Qiao J. Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control. IEEE/CAA J Autom Sinica 2023;10:1797-809.
12. Wang D, Hu L, Zhao M, Qiao J. Dual event-triggered constrained control through adaptive critic for discrete-time zero-sum games. IEEE Trans Syst Man Cybern Syst 2023;53:1584-95.
13. Fairbank M, Alonso E, Prokhorov D. Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks. IEEE Trans Neural Netw Learn Syst 2012;23:1671-76.
14. Gao W, Deng C, Jiang Y, Jiang ZP. Resilient reinforcement learning and robust output regulation under denial-of-service attacks. Automatica 2022;142:110366.
15. Liu D, Wei Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 2013;25:621-34.
16. Wei Q, Liu D, Lin H. Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE Trans Cybern 2015;46:840-53.
17. Luo B, Liu D, Huang T, Yang X, Ma H. Multi-step heuristic dynamic programming for optimal control of nonlinear discrete-time systems. Inf Sci 2017;411:66-83.
18. Heydari A. Stability analysis of optimal adaptive control using value iteration with approximation errors. IEEE Trans Automat Contr 2018;63:3119-26.
19. Li C, Ding J, Lewis FL, Chai T. A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. Automatica 2021;129:109687.
20. Kiumarsi B, Alqaudi B, Modares H, Lewis FL, Levine DS. Optimal control using adaptive resonance theory and Q-learning. Neurocomputing 2019;361:119-25.
21. Shang Y. Consensus tracking and containment in multiagent networks with state constraints. IEEE Trans Syst Man Cybern Syst 2023;53:1656-65.
22. Shang Y. Scaled consensus and reference tracking in multiagent networks with constraints. IEEE Trans Netw Sci Eng 2022;9:1620.
23. Zhang J, Yang D, Zhang H, Wang Y, Zhou B. Dynamic event-based tracking control of boiler turbine systems with guaranteed performance. IEEE Trans Automat Sci Eng 2023; doi: 10.1109/TASE.2023.3294187.
24. Gao W, Mynuddin M, Wunsch DC, Jiang ZP. Reinforcement learning-based cooperative optimal output regulation via distributed adaptive internal model. IEEE Trans Neural Netw Learn Syst 2022;33:5229-40.
25. Luo B, Liu D, Huang T, Liu J. Output tracking control based on adaptive dynamic programming with multistep policy evaluation. IEEE Trans Syst Man Cybern Syst 2017;49:2155-65.
26. Wang D, Li X, Zhao M, Qiao J. Adaptive critic control design with knowledge transfer for wastewater treatment applications. IEEE Trans Ind Inf 2023; doi: 10.1109/TII.2023.3278875.
27. Ming Z, Zhang H, Li W, Luo Y. Neurodynamic programming and tracking control for nonlinear stochastic systems by PI algorithm. IEEE Trans Circuits Syst Ⅱ Express Briefs 2022;69:2892-96.
28. Lu J, Wei Q, Liu Y, Zhou T, Wang FY. Event-triggered optimal parallel tracking control for discrete-time nonlinear systems. IEEE Trans Syst Man Cybern Syst 2022;52:3772-84.
29. Zhao M, Wang D, Ha M, Qiao J. Evolving and incremental value iteration schemes for nonlinear discrete-time zero-sum games. IEEE Trans Cybern 2023;53:4487-99.
30. Al-Dabooni S, Wunsch DC. Online model-free n-step HDP with stability analysis. IEEE Trans Neural Netw Learn Syst 2020;31:1255-69.
31. Ha M, Wang D, Liu D. A novel value iteration scheme with adjustable convergence rate. IEEE Trans Neural Netw Learn Syst 2023;34:7430-42.
Cite This Article
Li X, Ren J, Wang D. Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems. Complex Eng Syst 2023;3:20. http://dx.doi.org/10.20517/ces.2023.28
Li X, Ren J, Wang D. Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems. Complex Engineering Systems. 2023; 3(4): 20. http://dx.doi.org/10.20517/ces.2023.28
Li, Xin, Jin Ren, Ding Wang. 2023. "Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems" Complex Engineering Systems. 3, no.4: 20. http://dx.doi.org/10.20517/ces.2023.28
Li, X.; Ren J.; Wang D. Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems. Complex. Eng. Syst. 2023, 3, 20. http://dx.doi.org/10.20517/ces.2023.28
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at firstname.lastname@example.org.