# Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems

*Complex Eng Syst*2023;3:20.

## Abstract

Currently, there are a large number of tracking problems in the industry concerning nonlinear systems with unknown dynamics. In order to obtain the optimal control policy, a multi-step adaptive critic tracking control (MsACTC) algorithm is developed in this paper. By constructing a steady control law, the tracking problem is transformed into a regulation problem. The MsACTC algorithm has an adjustable convergence rate during the iterative process by incorporating a multi-step policy evaluation mechanism. The convergence proof of the algorithm is provided. In order to implement the algorithm, three neural networks are built, including the model network, the critic network, and the action network. Finally, two numerical simulation examples are given to verify the effectiveness of the algorithm. Simulation results show that the MsACTC algorithm has satisfactory performance in terms of the applicability, tracking accuracy, and convergence speed.

## Keywords

*,*multi-step policy evaluation

*,*nonlinear systems

*,*optimal tracking control

## 1. INTRODUCTION

In practical engineering applications, the controller design should not only meet the basic performance requirements but also need to further improve the control effect and reduce costs ^{[1–4]}. However, real industrial systems are often complex nonlinear systems ^{[5, 6]}. Solving the Hamilton-Jacobi-Bellman (HJB) equations is an unavoidable obstacle when designing optimal control policies for nonlinear systems ^{[7]}. Considering that the analytical solution of the HJB equation is difficult to obtain, adaptive dynamic programming (ADP) based on the actor-critic framework approximates the solution of the HJB equation by the iteration^{[8, 9]}. The ADP algorithm exhibits strong adaptive and optimization capabilities by incorporating the advantages of reinforcement learning, neural networks, and dynamic programming ^{[10, 11]}. Therefore, the ADP algorithm has a satisfactory performance in solving the HJB equation. Through continuous development, the ADP algorithm has become one of the key methods for solving optimal control problems of nonlinear systems ^{[12–14]}. In terms of the iterative form, the ADP algorithms can be categorized into value iteration (Ⅵ) and policy iteration (PI) ^{[15, 16]}. PI has faster convergence but requires an initial admissible control policy ^{[17]}. However, it is difficult to obtain an initial admissible control policy for nonlinear systems. VI does not require an initial admissible control policy but has a slower convergence rate ^{[18]}.

In the industry, we often need to solve more complex tracking problems for nonlinear systems ^{[19–24]}, including the consensus tracking problem for multi-agent systems ^{[22]} and the output tracking control for single-agent systems ^{[25]}. In fact, the regulation problem can be seen as a simplified special case of the tracking problem. To overcome these challenges, some scholars have started to develop optimal learning control algorithms based on the ADP framework for solving tracking problems ^{[26]}. Currently, tracking control algorithms based on the ADP framework can be categorized into three groups. The first class of tracking control algorithms constructs the augmented system with respect to the original system and the reference trajectory and then builds the cost function based on the state vector and the control input of the augmented system ^{[27, 28]}. Although this type of tracking control algorithm has a well-established theory, it cannot completely eliminate tracking errors. The second class of tracking control algorithms utilizes the square of the tracking error at the next moment to build a novel utility function ^{[19]}. With the novel utility function, the tracking error can be completely eliminated. However, the second class of tracking control algorithms is currently only applicable to affine nonlinear systems with known system models, which greatly reduces the application value of the algorithm. The third type of tracking control algorithm transforms the tracking problem into the regulation problem by constructing a virtual steady control ^{[29]}. Thanks to the steady control, the third type of tracking control algorithm also completely eliminates tracking errors and is suitable for nonlinear systems. It should be emphasized that solving the steady control imposes a large computational burden on the algorithm.

Therefore, we expected that a speed-up mechanism could be designed for the tracking control algorithm with steady control to improve the convergence speed. Inspired by the idea of eligibility traces, Al-Dabooni *et al.* designed an online *n*-step ADP algorithm with higher learning efficiency^{[30]}. Ha *et al.* introduced a relaxation factor in the framework of the offline ADP algorithm^{[31]}. By adjusting the value of the relaxation factor, the convergence rate of the cost function could also be increased or decreased. Zhao *et al.* designed an incremental ADP algorithm by constructing a new type of cost function, which showed higher learning efficiency in solving zero-sum game problems for nonlinear systems^{[29]}. Luo *et al.* successfully combined the advantages of VI and PI by designing a multi-step policy evaluation mechanism^{[17, 25, 32]}. The convergence speed of the algorithm was greatly improved by increasing the policy evaluation step, which did not require admissible control policies.

In this context, a multi-step adaptive critic tracking control (MsACTC) algorithm is established by introducing a multi-step policy evaluation mechanism into the steady tracking control algorithm. Compared to the general tracking control algorithm with steady control, the MsACTC algorithm exhibits higher learning efficiency as the evaluation step increases. Furthermore, the convergence proof of the MsACTC algorithm is given. The MsACTC algorithm is successfully implemented through neural networks and the least square method. Finally, we verify the optimization ability and learning efficiency of the proposed algorithm through simulations of two nonlinear systems. Table 1 explains the meaning of the mathematical symbols used in this article.

Summary of mathematical symbols

The set of all | The norm of vector | The |

The set of all | The absolute value of | The transpose symbol |

The set of all non-negative integers | The set of all positive integers | A compact subset of |

## 2. PROBLEM DESCRIPTION

Consider such a class of discrete-time nonlinear systems

where

**Assumption 1**

**Assumption 2**

The reference trajectory can be denoted as

where

We assume that there exists a steady control

The tracking control policy

According to equations (1)-(5), we can obtain the following error system:

The cost function is defined as

where

and

So far, the tracking problem of the system (1) has been successfully transformed into the regulation problem of the system (6).

## 3. MULTI-STEP ADAPTIVE CRITIC TRACKING CONTROL ALGORITHM

It should be noted that equation (8) is the HJB equation, and its analytical solution is difficult to obtain directly. In addition, since the system function

### 3.1. Algorithm design

According to equation (6), we have

where

Construct the sequence of iterative cost functions

The policy evaluation is expressed as follows:

By iterating continuously between equations (11) and (12), the cost function and the control policy converge to their optimal values as

where

### 3.2. Theoretical analysis

To ensure that the cost function

**Theorem 1***If the sequences *

*then the following conclusions hold:*

*1) *

*2) *

**Proof.** First, we prove the conclusion 1). When (14) holds, we have

So far, we have shown that the conclusion 1) holds for

By further derivation, we get

Based on inequations (16) and (17), we have

According to mathematical induction, the proof of the conclusion 1) is completed.

Next, we prove the conclusion 2). Since the sequence of cost functions

which means

By observation, equations (20) and (8) are equivalent. This means that

**Theorem 2***Based on equations (1)-(6), *

**Proof.** According to the error system (6), we have

Besides, we have

The proof of Theorem 2 is completed. In this section, Theorem 1 is successfully proved by utilizing mathematical induction, and it provides a theoretical justification for the convergence of the MsACTC algorithm. Theorem 2 gives an equivalence between

**Remark 1** To ensure that the condition (14) holds,

## 4. IMPLEMENTATION OF ALGORITHM

To implement the algorithm, three neural networks are constructed, including a model network (MN), a critic network (CN), and an action network (AN). Figure 1 illustrates the overall structure of the MsACTC algorithm. It should be noted that the training of the MN is carried out independently. The training of the CN and the AN need to cooperate with each other iteratively. It should be noted that the implementation approach used in this paper is similar to that in ^{[22]}.

Figure 1. The structure diagram of the MsACTC algorithm. MsACTC: Multi-step adaptive critic tracking control.

The MN is used to estimate the dynamics of the system (1) and to solve the steady control

where

In order to make the MN with higher recognition accuracy, we utilize the neural network toolbox in Matlab to train the MN. The training parameter settings are given in Section 5.

After the MN is trained, according to equation (4), we can obtain the steady control

The CN is used to approximate the cost function, and its output can be expressed as

where

A sample set

where

where

The AN is used to approximate the tracking control policy, and its output can be expressed as

where

Considering the sample set

where

where

## 5. SIMULATION RESULT

In this section, we complete two simulations. Example 1 is to illustrate the effectiveness of the MsACTC algorithm and the correctness of Theorem 1. Example 2 is to show that the condition (14) is not sufficiently necessary for the convergence of the cost function.

### 5.1. Example 1

Consider the modified Van der Pol's oscillator system

The reference trajectory is defined as

Before the neural networks are trained, we make a sample set of

We construct the activation functions for the CN and the AN as

In order to satisfy the condition (14), the weight vector

In order to compare the convergence speeds under different values of

Figure 3. Convergence processes of network weights with different values of

We apply the optimal control policy obtained by the MsACTC algorithm for

### 5.2. Example 2

Consider the modified torsional pendulum system

Due to the excellent adaptive ability, the MsACTC algorithm does not have strict requirements for the settings of the parameters. Therefore, the parameter values and the setting of activation functions are kept the same as those in Example 1. In Example 2, we set

Figure 8. Convergence processes of network weights with different values of

Even if the condition (14) does not hold,

## 6. CONCLUSIONS AND OUTLOOK

In this paper, the MsACTC algorithm is developed by introducing the multi-step policy evaluation mechanism into the adaptive critic tracking control algorithm. For nonlinear systems with unknown models, the MN is built based on data to estimate the state vector in the future time. By solving the steady control

## DECLARATIONS

### Authors' contributions

Formal analysis, validation, writing - original draft: Li X

Methodology, Supervision, Writing - review and editing: Ren J

Investigation, Supervision, Writing - review and editing: Wang D

### Availability of data and materials

Not applicable.

### Financial support and sponsorship

This work was supported in part by the National Key Research and Development Program of China under Grant 2021ZD0112302 and in part by the National Natural Science Foundation of China under Grant 62222301, Grant 61890930-5, and Grant 62021003.

### Conflicts of interest

All authors declared that there are no conflicts of interest.

### Ethical approval and consent to participate

Not applicable

### Consent for publication

Not applicable

### Copyright

© The Author(s) 2023.

## REFERENCES

1. Maamoun KSA, Karimi HR. Reinforcement learning-based control for offshore crane load-landing operations. *Complex Eng Syst* 2022;2:13.

2. Wei Q, Liu D, Shi G, Liu Y. Multibattery optimal coordination control for home energy management systems via distributed iterative adaptive dynamic programming. *IEEE Trans Ind Electron* 2015;62:4203-14.

3. Firdausiyah N, Taniguchi E, Qureshi AG. Multi-agent simulation-adaptive dynamic programming based reinforcement learning for evaluating joint delivery systems in relation to the different locations of urban consolidation centres. *Transp Res Proc* 2020;46:125-32.

4. Li S, Ding L, Gao H, Liu YJ, Huang L, Deng Z. ADP-based online tracking control of partially uncertain time-delayed nonlinear system and application to wheeled mobile robots. *IEEE Trans Cybern* 2020;50:3182-94.

5. Sun T, Sun XM. An adaptive dynamic programming scheme for nonlinear optimal control with unknown dynamics and its application to turbofan engines. *IEEE Trans Ind Inf* 2021;17:367-76.

6. Davari M, Gao W, Jiang ZP, Lewis FL. An optimal primary frequency control based on adaptive dynamic programming for islanded modernized microgrids. *IEEE Trans Automat Sci Eng* 2021;18:1109-21.

7. Zhao M, Wang D, Qiao J, Ha M, Ren J. Advanced value iteration for discrete-time intelligent critic control: a survey. *Artif Intell Rev* 2023;56:12315-46.

8. Lewis FL, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. *IEEE Circuits Syst Mag* 2009;9:32-50.

9. Liu D, Xue S, Zhao B, Luo B, Wei Q. Adaptive dynamic programming for control: a survey and recent advances. *IEEE Trans Syst Man Cybern Syst* 2021;51:142-60.

10. Zhang H, Cui L, Zhang X, Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. *IEEE Trans Neural Netw* 2011;22:2226-36.

11. Wang D, Wang J, Zhao M, Xin P, Qiao J. Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control. *IEEE/CAA J Autom Sinica* 2023;10:1797-809.

12. Wang D, Hu L, Zhao M, Qiao J. Dual event-triggered constrained control through adaptive critic for discrete-time zero-sum games. *IEEE Trans Syst Man Cybern Syst* 2023;53:1584-95.

13. Fairbank M, Alonso E, Prokhorov D. Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks. *IEEE Trans Neural Netw Learn Syst* 2012;23:1671-6.

14. Gao W, Deng C, Jiang Y, Jiang ZP. Resilient reinforcement learning and robust output regulation under denial-of-service attacks. *Automatica* 2022;142:110366.

15. Liu D, Wei Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. *IEEE Trans Neural Netw Learn Syst* 2013;25:621-34.

16. Wei Q, Liu D, Lin H. Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. *IEEE Trans Cybern* 2015;46:840-53.

17. Luo B, Liu D, Huang T, Yang X, Ma H. Multi-step heuristic dynamic programming for optimal control of nonlinear discrete-time systems. *Inf Sci* 2017;411:66-83.

18. Heydari A. Stability analysis of optimal adaptive control using value iteration with approximation errors. *IEEE Trans Automat Contr* 2018;63:3119-26.

19. Li C, Ding J, Lewis FL, Chai T. A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. *Automatica* 2021;129:109687.

20. Kiumarsi B, Alqaudi B, Modares H, Lewis FL, Levine DS. Optimal control using adaptive resonance theory and Q-learning. *Neurocomputing* 2019;361:119-25.

21. Shang Y. Consensus tracking and containment in multiagent networks with state constraints. *IEEE Trans Syst Man Cybern Syst* 2023;53:1656-65.

22. Shang Y. Scaled consensus and reference tracking in multiagent networks with constraints. *IEEE Trans Netw Sci Eng* 2022;9:1620-9.

23. Zhang J, Yang D, Zhang H, Wang Y, Zhou B. Dynamic event-based tracking control of boiler turbine systems with guaranteed performance. *IEEE Trans Automat Sci Eng* 2023;in press.

24. Gao W, Mynuddin M, Wunsch DC, Jiang ZP. Reinforcement learning-based cooperative optimal output regulation via distributed adaptive internal model. *IEEE Trans Neural Netw Learn Syst* 2022;33:5229-40.

25. Luo B, Liu D, Huang T, Liu J. Output tracking control based on adaptive dynamic programming with multistep policy evaluation. *IEEE Trans Syst Man Cybern Syst* 2017;49:2155-65.

26. Wang D, Li X, Zhao M, Qiao J. Adaptive critic control design with knowledge transfer for wastewater treatment applications. *IEEE Trans Ind Inf* 2023;in press.

27. Ming Z, Zhang H, Li W, Luo Y. Neurodynamic programming and tracking control for nonlinear stochastic systems by PI algorithm. *IEEE Trans Circuits Syst Ⅱ Express Briefs* 2022;69:2892-6.

28. Lu J, Wei Q, Liu Y, Zhou T, Wang FY. Event-triggered optimal parallel tracking control for discrete-time nonlinear systems. *IEEE Trans Syst Man Cybern Syst* 2022;52:3772-84.

29. Zhao M, Wang D, Ha M, Qiao J. Evolving and incremental value iteration schemes for nonlinear discrete-time zero-sum games. *IEEE Trans Cybern* 2023;53:4487-99.

30. Al-Dabooni S, Wunsch DC. Online model-free *n*-step HDP with stability analysis. *IEEE Trans Neural Netw Learn Syst* 2020;31:1255-69.

31. Ha M, Wang D, Liu D. A novel value iteration scheme with adjustable convergence rate. *IEEE Trans Neural Netw Learn Syst* 2023;34:7430-42.

## Cite This Article

Export citation file: **BibTeX** | **EndNote** | **RIS**

**OAE Style**

Li X, Ren J, Wang D. Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems. *Complex Eng Syst* 2023;3:20. http://dx.doi.org/10.20517/ces.2023.28

**AMA Style**

Li X, Ren J, Wang D. Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems. *Complex Engineering Systems*. 2023; 3(4): 20. http://dx.doi.org/10.20517/ces.2023.28

**Chicago/Turabian Style**

Xin Li, Jin Ren, Ding Wang. 2023. "Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems" *Complex Engineering Systems*. 3, no.4: 20. http://dx.doi.org/10.20517/ces.2023.28

**ACS Style**

Li, X.; Ren J.; Wang D. Multi-step policy evaluation for adaptive-critic-based tracking control towards nonlinear systems. *Complex. Eng. Syst.* **2023**, *3*, 20. http://dx.doi.org/10.20517/ces.2023.28

## About This Article

### Copyright

**Open Access**This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Data & Comments

### Data

**Views**

**Downloads**

**Citations**

**Comments**

**7**

### Comments

Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.

^{0}