Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems
Abstract
Currently, there are a large number of tracking problems in the industry concerning nonlinear systems with unknown dynamics. In order to obtain the optimal control policy, a multi-step adaptive critic tracking control (MsACTC) algorithm is developed in this paper. By constructing a steady control law, the tracking problem is transformed into a regulation problem. The MsACTC algorithm has an adjustable convergence rate during the iterative process by incorporating a multi-step policy evaluation mechanism. The convergence proof of the algorithm is provided. In order to implement the algorithm, three neural networks are built, including the model network, the critic network, and the action network. Finally, two numerical simulation examples are given to verify the effectiveness of the algorithm. Simulation results show that the MsACTC algorithm has satisfactory performance in terms of the applicability, tracking accuracy, and convergence speed.
Keywords
1. INTRODUCTION
In practical engineering applications, the controller design should not only meet the basic performance requirements but also need to further improve the control effect and reduce costs [1–4]. However, real industrial systems are often complex nonlinear systems [5, 6]. Solving the Hamilton-Jacobi-Bellman (HJB) equations is an unavoidable obstacle when designing optimal control policies for nonlinear systems [7]. Considering that the analytical solution of the HJB equation is difficult to obtain, adaptive dynamic programming (ADP) based on the actor-critic framework approximates the solution of the HJB equation by the iteration[8, 9]. The ADP algorithm exhibits strong adaptive and optimization capabilities by incorporating the advantages of reinforcement learning, neural networks, and dynamic programming [10, 11]. Therefore, the ADP algorithm has a satisfactory performance in solving the HJB equation. Through continuous development, the ADP algorithm has become one of the key methods for solving optimal control problems of nonlinear systems [12–14]. In terms of the iterative form, the ADP algorithms can be categorized into value iteration (Ⅵ) and policy iteration (PI) [15, 16]. PI has faster convergence but requires an initial admissible control policy [17]. However, it is difficult to obtain an initial admissible control policy for nonlinear systems. VI does not require an initial admissible control policy but has a slower convergence rate [18].
In the industry, we often need to solve more complex tracking problems for nonlinear systems [19–24], including the consensus tracking problem for multi-agent systems [22] and the output tracking control for single-agent systems [25]. In fact, the regulation problem can be seen as a simplified special case of the tracking problem. To overcome these challenges, some scholars have started to develop optimal learning control algorithms based on the ADP framework for solving tracking problems [26]. Currently, tracking control algorithms based on the ADP framework can be categorized into three groups. The first class of tracking control algorithms constructs the augmented system with respect to the original system and the reference trajectory and then builds the cost function based on the state vector and the control input of the augmented system [27, 28]. Although this type of tracking control algorithm has a well-established theory, it cannot completely eliminate tracking errors. The second class of tracking control algorithms utilizes the square of the tracking error at the next moment to build a novel utility function [19]. With the novel utility function, the tracking error can be completely eliminated. However, the second class of tracking control algorithms is currently only applicable to affine nonlinear systems with known system models, which greatly reduces the application value of the algorithm. The third type of tracking control algorithm transforms the tracking problem into the regulation problem by constructing a virtual steady control [29]. Thanks to the steady control, the third type of tracking control algorithm also completely eliminates tracking errors and is suitable for nonlinear systems. It should be emphasized that solving the steady control imposes a large computational burden on the algorithm.
Therefore, we expected that a speed-up mechanism could be designed for the tracking control algorithm with steady control to improve the convergence speed. Inspired by the idea of eligibility traces, Al-Dabooni et al. designed an online n-step ADP algorithm with higher learning efficiency[30]. Ha et al. introduced a relaxation factor in the framework of the offline ADP algorithm[31]. By adjusting the value of the relaxation factor, the convergence rate of the cost function could also be increased or decreased. Zhao et al. designed an incremental ADP algorithm by constructing a new type of cost function, which showed higher learning efficiency in solving zero-sum game problems for nonlinear systems[29]. Luo et al. successfully combined the advantages of VI and PI by designing a multi-step policy evaluation mechanism[17, 25, 32]. The convergence speed of the algorithm was greatly improved by increasing the policy evaluation step, which did not require admissible control policies.
In this context, a multi-step adaptive critic tracking control (MsACTC) algorithm is established by introducing a multi-step policy evaluation mechanism into the steady tracking control algorithm. Compared to the general tracking control algorithm with steady control, the MsACTC algorithm exhibits higher learning efficiency as the evaluation step increases. Furthermore, the convergence proof of the MsACTC algorithm is given. The MsACTC algorithm is successfully implemented through neural networks and the least square method. Finally, we verify the optimization ability and learning efficiency of the proposed algorithm through simulations of two nonlinear systems. Table 1 explains the meaning of the mathematical symbols used in this article.
Summary of mathematical symbols
The set of all | The norm of vector | The |
The set of all | The absolute value of | The transpose symbol |
The set of all non-negative integers | The set of all positive integers | A compact subset of |
2. PROBLEM DESCRIPTION
Consider such a class of discrete-time nonlinear systems
where
Assumption 1
Assumption 2
The reference trajectory can be denoted as
where
We assume that there exists a steady control
The tracking control policy
According to equations (1)-(5), we can obtain the following error system:
The cost function is defined as
where
and
So far, the tracking problem of the system (1) has been successfully transformed into the regulation problem of the system (6).
3. MULTI-STEP ADAPTIVE CRITIC TRACKING CONTROL ALGORITHM
It should be noted that equation (8) is the HJB equation, and its analytical solution is difficult to obtain directly. In addition, since the system function
3.1. Algorithm design
According to equation (6), we have
where
Construct the sequence of iterative cost functions
The policy evaluation is expressed as follows:
By iterating continuously between equations (11) and (12), the cost function and the control policy converge to their optimal values as
where
3.2. Theoretical analysis
To ensure that the cost function
Theorem 1If the sequences
then the following conclusions hold:
1)
2)
Proof. First, we prove the conclusion 1). When (14) holds, we have
So far, we have shown that the conclusion 1) holds for
By further derivation, we get
Based on inequations (16) and (17), we have
According to mathematical induction, the proof of the conclusion 1) is completed.
Next, we prove the conclusion 2). Since the sequence of cost functions
which means
By observation, equations (20) and (8) are equivalent. This means that
Theorem 2Based on equations (1)-(6),
Proof. According to the error system (6), we have
Besides, we have
The proof of Theorem 2 is completed. In this section, Theorem 1 is successfully proved by utilizing mathematical induction, and it provides a theoretical justification for the convergence of the MsACTC algorithm. Theorem 2 gives an equivalence between
Remark 1 To ensure that the condition (14) holds,
4. IMPLEMENTATION OF ALGORITHM
To implement the algorithm, three neural networks are constructed, including a model network (MN), a critic network (CN), and an action network (AN). Figure 1 illustrates the overall structure of the MsACTC algorithm. It should be noted that the training of the MN is carried out independently. The training of the CN and the AN need to cooperate with each other iteratively. It should be noted that the implementation approach used in this paper is similar to that in [22].
Figure 1. The structure diagram of the MsACTC algorithm. MsACTC: Multi-step adaptive critic tracking control.
The MN is used to estimate the dynamics of the system (1) and to solve the steady control
where
In order to make the MN with higher recognition accuracy, we utilize the neural network toolbox in Matlab to train the MN. The training parameter settings are given in Section 5.
After the MN is trained, according to equation (4), we can obtain the steady control
The CN is used to approximate the cost function, and its output can be expressed as
where
A sample set
where
where
The AN is used to approximate the tracking control policy, and its output can be expressed as
where
Considering the sample set
where
where
5. SIMULATION RESULT
In this section, we complete two simulations. Example 1 is to illustrate the effectiveness of the MsACTC algorithm and the correctness of Theorem 1. Example 2 is to show that the condition (14) is not sufficiently necessary for the convergence of the cost function.
5.1. Example 1
Consider the modified Van der Pol's oscillator system
The reference trajectory is defined as
Before the neural networks are trained, we make a sample set of
We construct the activation functions for the CN and the AN as
In order to satisfy the condition (14), the weight vector
In order to compare the convergence speeds under different values of
Figure 3. Convergence processes of network weights with different values of
We apply the optimal control policy obtained by the MsACTC algorithm for
5.2. Example 2
Consider the modified torsional pendulum system
Due to the excellent adaptive ability, the MsACTC algorithm does not have strict requirements for the settings of the parameters. Therefore, the parameter values and the setting of activation functions are kept the same as those in Example 1. In Example 2, we set
Figure 8. Convergence processes of network weights with different values of
Even if the condition (14) does not hold,
6. CONCLUSIONS AND OUTLOOK
In this paper, the MsACTC algorithm is developed by introducing the multi-step policy evaluation mechanism into the adaptive critic tracking control algorithm. For nonlinear systems with unknown models, the MN is built based on data to estimate the state vector in the future time. By solving the steady control
DECLARATIONS
Authors' contributions
Formal analysis, validation, writing - original draft: Li X
Methodology, Supervision, Writing - review and editing: Ren J
Investigation, Supervision, Writing - review and editing: Wang D
Availability of data and materials
Not applicable.
Financial support and sponsorship
This work was supported in part by the National Key Research and Development Program of China under Grant 2021ZD0112302 and in part by the National Natural Science Foundation of China under Grant 62222301, Grant 61890930-5, and Grant 62021003.
Conflicts of interest
All authors declared that there are no conflicts of interest.
Ethical approval and consent to participate
Not applicable
Consent for publication
Not applicable
Copyright
© The Author(s) 2023.
REFERENCES
1. Maamoun KSA, Karimi HR. Reinforcement learning-based control for offshore crane load-landing operations. Complex Eng Syst 2022;2:13.
2. Wei Q, Liu D, Shi G, Liu Y. Multibattery optimal coordination control for home energy management systems via distributed iterative adaptive dynamic programming. IEEE Trans Ind Electron 2015;62:4203-14.
3. Firdausiyah N, Taniguchi E, Qureshi AG. Multi-agent simulation-adaptive dynamic programming based reinforcement learning for evaluating joint delivery systems in relation to the different locations of urban consolidation centres. Transp Res Proc 2020;46:125-32.
4. Li S, Ding L, Gao H, Liu YJ, Huang L, Deng Z. ADP-based online tracking control of partially uncertain time-delayed nonlinear system and application to wheeled mobile robots. IEEE Trans Cybern 2020;50:3182-94.
5. Sun T, Sun XM. An adaptive dynamic programming scheme for nonlinear optimal control with unknown dynamics and its application to turbofan engines. IEEE Trans Ind Inf 2021;17:367-76.
6. Davari M, Gao W, Jiang ZP, Lewis FL. An optimal primary frequency control based on adaptive dynamic programming for islanded modernized microgrids. IEEE Trans Automat Sci Eng 2021;18:1109-21.
7. Zhao M, Wang D, Qiao J, Ha M, Ren J. Advanced value iteration for discrete-time intelligent critic control: a survey. Artif Intell Rev 2023;56:12315-46.
8. Lewis FL, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag 2009;9:32-50.
9. Liu D, Xue S, Zhao B, Luo B, Wei Q. Adaptive dynamic programming for control: a survey and recent advances. IEEE Trans Syst Man Cybern Syst 2021;51:142-60.
10. Zhang H, Cui L, Zhang X, Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Trans Neural Netw 2011;22:2226-36.
11. Wang D, Wang J, Zhao M, Xin P, Qiao J. Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control. IEEE/CAA J Autom Sinica 2023;10:1797-809.
12. Wang D, Hu L, Zhao M, Qiao J. Dual event-triggered constrained control through adaptive critic for discrete-time zero-sum games. IEEE Trans Syst Man Cybern Syst 2023;53:1584-95.
13. Fairbank M, Alonso E, Prokhorov D. Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks. IEEE Trans Neural Netw Learn Syst 2012;23:1671-6.
14. Gao W, Deng C, Jiang Y, Jiang ZP. Resilient reinforcement learning and robust output regulation under denial-of-service attacks. Automatica 2022;142:110366.
15. Liu D, Wei Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 2013;25:621-34.
16. Wei Q, Liu D, Lin H. Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE Trans Cybern 2015;46:840-53.
17. Luo B, Liu D, Huang T, Yang X, Ma H. Multi-step heuristic dynamic programming for optimal control of nonlinear discrete-time systems. Inf Sci 2017;411:66-83.
18. Heydari A. Stability analysis of optimal adaptive control using value iteration with approximation errors. IEEE Trans Automat Contr 2018;63:3119-26.
19. Li C, Ding J, Lewis FL, Chai T. A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. Automatica 2021;129:109687.
20. Kiumarsi B, Alqaudi B, Modares H, Lewis FL, Levine DS. Optimal control using adaptive resonance theory and Q-learning. Neurocomputing 2019;361:119-25.
21. Shang Y. Consensus tracking and containment in multiagent networks with state constraints. IEEE Trans Syst Man Cybern Syst 2023;53:1656-65.
22. Shang Y. Scaled consensus and reference tracking in multiagent networks with constraints. IEEE Trans Netw Sci Eng 2022;9:1620-9.
23. Zhang J, Yang D, Zhang H, Wang Y, Zhou B. Dynamic event-based tracking control of boiler turbine systems with guaranteed performance. IEEE Trans Automat Sci Eng 2023;in press.
24. Gao W, Mynuddin M, Wunsch DC, Jiang ZP. Reinforcement learning-based cooperative optimal output regulation via distributed adaptive internal model. IEEE Trans Neural Netw Learn Syst 2022;33:5229-40.
25. Luo B, Liu D, Huang T, Liu J. Output tracking control based on adaptive dynamic programming with multistep policy evaluation. IEEE Trans Syst Man Cybern Syst 2017;49:2155-65.
26. Wang D, Li X, Zhao M, Qiao J. Adaptive critic control design with knowledge transfer for wastewater treatment applications. IEEE Trans Ind Inf 2023;in press.
27. Ming Z, Zhang H, Li W, Luo Y. Neurodynamic programming and tracking control for nonlinear stochastic systems by PI algorithm. IEEE Trans Circuits Syst Ⅱ Express Briefs 2022;69:2892-6.
28. Lu J, Wei Q, Liu Y, Zhou T, Wang FY. Event-triggered optimal parallel tracking control for discrete-time nonlinear systems. IEEE Trans Syst Man Cybern Syst 2022;52:3772-84.
29. Zhao M, Wang D, Ha M, Qiao J. Evolving and incremental value iteration schemes for nonlinear discrete-time zero-sum games. IEEE Trans Cybern 2023;53:4487-99.
30. Al-Dabooni S, Wunsch DC. Online model-free n-step HDP with stability analysis. IEEE Trans Neural Netw Learn Syst 2020;31:1255-69.
31. Ha M, Wang D, Liu D. A novel value iteration scheme with adjustable convergence rate. IEEE Trans Neural Netw Learn Syst 2023;34:7430-42.
Cite This Article
How to Cite
Li, X.; Ren, J.; Wang, D. Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems. Complex Eng. Syst. 2023, 3, 20. http://dx.doi.org/10.20517/ces.2023.28
Download Citation
Export Citation File:
Type of Import
Tips on Downloading Citation
Citation Manager File Format
Type of Import
Direct Import: When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.
Indirect Import: When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.
Comments
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.