Recent Progress in Reinforcement Learning and Adaptive Dynamic Programming for Advanced Control Applications

2024-01-27 06:49DingWangNingGaoDerongLiuJinnaLiandFrankLewisLife
IEEE/CAA Journal of Automatica Sinica 2024年1期

Ding Wang ,,, Ning Gao , Derong Liu ,,,Jinna Li ,,, and Frank L.Lewis , Life,

Abstract—Reinforcement learning (RL) has roots in dynamic programming and it is called adaptive/approximate dynamic programming (ADP) within the control community.This paper reviews recent developments in ADP along with RL and its applications to various advanced control fields.First, the background of the development of ADP is described, emphasizing the significance of regulation and tracking control problems.Some effective offline and online algorithms for ADP/adaptive critic control are displayed, where the main results towards discrete-time systems and continuous-time systems are surveyed, respectively.Then, the research progress on adaptive critic control based on the event-triggered framework and under uncertain environment is discussed, respectively, where event-based design, robust stabilization, and game design are reviewed.Moreover, the extensions of ADP for addressing control problems under complex environment attract enormous attention.The ADP architecture is revisited under the perspective of data-driven and RL frameworks,showing how they promote ADP formulation significantly.Finally, several typical control applications with respect to RL and ADP are summarized, particularly in the fields of wastewater treatment processes and power systems, followed by some general prospects for future research.Overall, the comprehensive survey on ADP and RL for advanced control applications has demonstrated its remarkable potential within the artificial intelligence era.In addition, it also plays a vital role in promoting environmental protection and industrial intelligence.

I.INTRODUCTION

ARTIFICIAL intelligence (AI) generally refers to the intelligence exhibited through machines that humans make.The definition of AI is very broad, starting from the legends of robots and androids in Greek mythology to the well-known Turing Test, and now to the development of various intelligent algorithms in the framework of machine learning [1]–[4].The AI technology is gradually changing our lives, from computer vision, big data processing, intelligent automation, smart factories, etc., to other aspects.

As the hottest technology in the 21st century, AI cannot be developed without machine learning.Machine learning, as the core of AI, is the foundation of computer intelligence.Reinforcement learning (RL) [5]–[7] is one of the top three approaches of machine learning, along with supervised learning and unsupervised learning.RL emphasizes the interaction between the environment and the agent, with a focus on the long-term interaction to change its policies.Through its interactions with the environment, the agent can modify future actions or control policies based on the response to its stimulating actions.

It should be emphasized that RL does not necessarily require a perfect environment model or huge computing resources.RL is inseparable from dynamic programming [8],[9].Traditional dynamic programming has been investigated considerably in theory, which can provide the key foundation of RL.However, this technique requires the assumption of an exact system model, which is extravagant for large-scale complex nonlinear systems.Besides, such methods are severely limited in solving Hamilton-Jacobi-Bellman (HJB) equations of nonlinear systems as the dimensionality of states and controls increase [8].Therefore, adaptive/approximate dynamic programming (ADP) [10]–[13], a method combining RL,dynamic programming, and neural networks, was skillfully proposed.

ADP has been widely used to solve a range of optimal control problems for complex nonlinear systems in unknown environments.As main algorithmic frameworks, value iteration (VI) and policy iteration (PI) have been intensively promoted.The initialization requirements for VI and PI are different.Unlike PI, which must start with an initial admissible control law, the initial control law of VI has no strict requirements.But from the iterative control point of view, PI presents a more stable mechanism.Both of these two algorithms are attracting more and more attention from the control community [14], [15].Due to their respective properties, VI has received more attention in the discrete-time domain, while PI is more commonly applied to the continuous-time domain.

There have been many classical reviews and monographs[13], [16]–[19] that summarize and discuss ADP/RL.They have brought in the profound influence and inspiration on their successors.However, it is rare to find a paper that integrates the regulator problem, the tracking control problem, the multi-agent problem, the robustness of uncertain systems, and the event-triggered mechanism, especially with discussions on both discrete-time and continuous-time cases.In this paper,we aim to discuss recent research progress on these problems,primarily focusing on discrete-time systems while supplementing some excellent work on continuous-time systems.Fig.1 illustrates some of the key technologies in the ADP field involved in this paper.In order to promote the further development of ADP, this paper provides a comprehensive overview of theoretical research, algorithm implementation,and related applications.It covers the latest research advances and also analyzes and predicts the future trends of ADP.This paper consists of the following parts: 1) basic background,2) recent progress of ADP in the field of optimal control, 3)development of ADP in the event-triggered framework, 4)development of ADP in complex environments and the combination of ADP with other advanced control methods, 5) the impact of the data-driven method and RL on ADP technology,6) typical control applications of ADP and RL, and 7) discussion of the possible future directions for ADP.

Fig.1.Taxonomy diagram of related methods in this survey.

II.OPTIMAL REGULATION AND TRACKING WITH ADP

Generally speaking, ADP-based algorithms can be performed offline or online.In this section, we focus on the problem of optimal regulation and optimal tracking control, with a detailed overview.

A. Offline Optimal Regulation With ADP

1)Discrete-Time Systems: Consider the following affine discrete-time nonlinear systems:

TABLE I BASIC TERMS OF THE GENERAL ADP STRUCTURE

Fig.2.The general structure of ADP.

Besides, the action-dependent and goal-representation versions are also used sometimes for these structures.Taking HDP as an example, the action-dependent HDP (ADHDP)[21] consists of three parts: the controlled object, the critic network, and the action network.It is capable of achieving optimal control without using system information.Compared with ADHDP, a network was added to goal-representation HDP (GrHDP) [22], which can be associated with the critic network and the action network.The goal network can generate, control, calculate, and plan more accurate system signals.It also improves the learning ability of the control system.In Table II, we show the comparison of the input and output of ADHDP with GrHDP, where

is the internal reinforcement signal and 0<ı<1 is the discount factor.On the basis of [20], the optimal control problem was solved for nonlinear systems with control constraints by DHP [23], [24].To overcome symmetric input constraints,Wanget al.[23] introduced the DHP framework involving a new nonquadratic performance index and used the data-based methods to derive efficient system models.Then, for a class of nonlinear systems with asymmetric constraints, Wanget al.[24] defined an innovative nonquadratic function and expanded the application scope of the DHP framework.Besides, Xuet al.[25] introduced a control barrier function into the utility function.Unlike [23], [24], HDP was used to solve state-constrained optimal control problems in [25].

TABLE II COMPARISON OF THE INPUTS AND THE OUTPUTS BETWEEN ADHDP AND GRHDP

Compared to VI [20], monotonicity is various for iterative cost function sequence with the more general initial condition in general VI (GVI) [26].The GVI algorithm can be initialized by a positive semi-definite functionV0(x)=xTΦx, where Φis a positive semi-definite matrix.Furthermore, the iterative cost function was proved to satisfy the inequality

where 0 ≤α ≤1, 1 ≤β<∞, and 1 ≤θ<∞.

Then, by using a novel convergence analysis, Weiet al.[27]proved the convergence and optimality of GVI.In addition, it was shown that the termination criterion in [20] could not guarantee admissibility of the near-optimal control policy.The admissibility termination criterion was proposed as

Following [27], Haet al.[28] presented a new admissibility condition of GVI described by

and policy improvement

In [31], the convergence and stability of PI were analyzed for the first time, where the initial admissible control law was obtained by trial-and-error.The iteration process can ensure that all iterative control laws were stable.Compared with [31],the admissible control law can be obtained more conveniently in [27] and [28].Based on [31], Liuet al.[15] proposed the generalized PI (GPI) for optimal control of system (1), where the convergence and optimality properties were guaranteed.In essence, VI [20] and PI [31] were both special cases of GPI.Since many systems could only be locally stabilized, in order to solve the regionality existing in discrete-time optimal control, an invariant PI method was proposed by Zhuet al.[32],where the suitable region for the new policy was updated.

In addition, there are some works on the combination of VI and PI.In [33], Luoet al.introduced an adaptive method to solve the Bellman equation, which balanced VI and PI by adding a balance factor.It is noted that the algorithm in [33]can accelerate the iterative process and do not need the initial admissible control law.To obtain the stable iterative control policy, Heydari [34] proposed stabilizing VI, where the initial admissibleu0was evaluated to implement the VI.Based on[34], Haet al.[28] developed an integrated VI method based on GVI, which was used to generate the admissible control law.InTableIII, wesummarizetheinitialconditionsand monotonicityofGVI(V0≤V1),GVI(V0≥V1),Stabilizing VI,andIntegratedVI.IntegratedVIconsistsofGVI(V0≤V1)andStabilizingVI, whereGVI (V0≤V1)provides the initial admissible control policy for Stabilizing VI.Therefore, in the following table, integrated VI only represents the monotonicity of its core component.

TABLE III CLASSIFICATION OF VI ALGORITHMS

2)Continuous-Time Systems: Compared with discrete-time systems, most literature focuses on the PI method for continuous-time systems even though the VI strategy still can be used as in [35].We consider the continuous-time systems

wherex(t)∈Rn,u(t)∈Rm,f(·)∈Rn, andg(·)∈Rn×mrepresent the state vector, the control vector, the drift dynamics,and the input dynamics, respectively.Assume thatf(0)=0 and the system is stabilized on the operation region.For systems in the strict feedback form with uncertain dynamics,Zargarzadehet al.[36] utilized neural networks to estimate the cost function by using state measurement.In [37], a databased continuous-time PI algorithm was proposed, where a critic-identifier was introduced to estimate the cost function and the Hamiltonian of the admissible policy.Differently from [38], the algorithm in [37] was used in continuous-time systems and did not require samples of the input and output trajectories of the system.Compared with [36], the method proposed in [37] could be extended to multicontroller systems.In addition, to release the computational burden, a novel distributed PI algorithm was established in [39].The iterative control policy could be updated one by one.The above works[35]–[39] all focused on time-invariant nonlinear systems.Moreover,V(·) andu(·) both relied on the system state.In[40], for time-varying nonlinear systems, Weiet al.developed a novel PI algorithm, where the optimality and stability were discussed.It is worth noting that a mass of literatures concentrate on the progress of VI algorithms and their structures are similar to those of discrete-time systems.Bian and Jiang [41] extended VI to continuous-time nonlinear systems.

B. Online Optimal Regulation With ADP

1)Discrete-Time Systems: As mentioned in [34], differently from offline ADP, online ADP needs to be implemented through selecting the initial control policy, and improving it according to some criteria until it converges to the optimal value.Note that the key difference between offline ADP and online ADP is that the control policy generated by offline ADP keeps unchanged during the controlled stage of systems,whereas the control policy will be updated in online ADP.

First, consider online ADP for the discrete-time systems.Usually, the optimal cost function and the optimal control policy are approximated by neural networks as

and

respectively, where ϕcand θaare the weight vectors of target neural networks, εcand εaare the bias terms, σ (·) and δ (·) are the activation function vectors.The optimal cost function is estimated by the critic network

and the optimal control policy is estimated by the action network

where ϕˆcand θˆaare the estimated values of ϕcand θa, respectively.SinceV∗x(k) andu∗x(k) satisfy the HJB equation,we get

Substituting (13) and (14) to the HJB equation, we obtain

In the above research on online ADP, the approximate optimal control policy was updated by tuning the weights of neural networks.Besides, the improved control law can be acquired by PI or VI.Since the iterative control policy obtained by PI is stable, PI is widely used in online control.However, there are also some works on updating the control policy by VI.For example, in [34], Heydari proposed an online algorithm based on stabilizing VI, where the system was controlled under different iterative policies.In [14], combining the stability condition of GVI and the concept of attraction domain, the novel online algorithm was introduced by Haet al., where the current control law was chosen by the location of the current state.

2)Continuous-Time Systems: For continuous-time systems,the principle of online ADP is similar to that of discrete-time systems.Here, we display some main progress on online methods with PI.For the weakly coupled nonlinear systems, a data-based online learning algorithm was established by Liet al.[43], where the original optimal control problem of the weakly coupled systems was transformed into three reducedorder optimal control problems.In [38], Heet al.introduced a novel online PI method, where the technique of neural-network-based online linear differential inclusion was used for the first time.In addition, to solve optimal synchronization of multi-agent systems, an off-policy RL algorithm was presented in [44], where dynamic models of the agents were not required.

C. Optimal Tracking Design With ADP

1)Discrete-Time Systems: With the development of aviation, navigation, and other fields in recent years, the research interest in optimal tracking design has gradually increased within the control community.Here, we need to concentrate on the optimal tracking control problem.Define the desired tracking trajectory as

Considering original system (1), the tracking errore(k) is described as

Assume that there exists the steady controlud(k) to make the following equation hold:

The objective of the optimal tracking control problem is to find the optimal control lawux(k), which can force the system output to track the reference trajectory.This can be obtained by minimizing the performance index or the cost function.Hence, the choice of the cost function is of importance without doubt.Generally, we choose the form of the cost function according to the control objective.Wanget al.[23] applied DHP to implement the tracking control design towards nonaffine discrete-time systems, where the discount factor was considered.After that, actuator saturation was also considered in [24].It is noted that the form of the utility function in [45]–[47] is given by

whereue(k)=ux(k)-ud(k).Since it is not convenient to calculate the reference control policyud(k), some scholars choose other forms of utility function.For example, Kiumarsi and Lewis [48] introduced a partially model-free ADP method.In this work, an optimal tracking control of nonlinear systems with input constraints is achieved by using a discounted performance function based on the augmented system.In [49], Linet al.proposed a policy gradient algorithm and used experience replay for optimal tracking design.They used the Lyapunov’s direct method to prove the uniform ultimate boundedness (UUB) of the closed-loop system.The utility function in [48] and [49] is described as

Even though the steady control is avoided in (21), it can not eventually eliminate the tracking error.To deal with this problem, Liet al.[50] developed a novel utility function given by

The optimality of VI and PI was analyzed.In addition, Haet al.[51] also analyzed the system stability of the VI algorithm for the novel utility function with a discount factor.

2)Continuous-Time Systems: There are also a few works on the continuous-time systems.In [52], Gao and Jiang solved the optimal output regulation problem by ADP and RL, where ADP was for the first time combined with the output regulation problem for adaptive optimal tracking control with disturbance attenuation.However, this approach requires partial knowledge of the system dynamics.To overcome this difficulty, in [53], the integral RL algorithm was introduced to achieve optimal online control, where the off-policy integral RL was employed to obtain the optimal control feedback gain for the first time.In addition, differently from [52], the algorithm in [53] relieved the computational burden.Then, in [54],Fuet al.proposed a robust approximate optimal tracking method.In order to relax the assumption that the reference signal must be continuous in continuous-time systems, a new Lyapunov function was proposed without knowing the derivative information of the tracking error.

In particular, ADP also plays a pivotal role in the optimal control of linear systems, such as the linear quadratic regulation (LQR) problem and tracking problem [55]–[60].Generally speaking, considering optimal control for nonlinear systems, the HJB equation is usually solved to acquire the optimal control policy.However, the linear system is a special case, which has good properties.The solution of the HJB equation can be transformed into the solution of the algebraic Riccati equations, so as to obtain the exact optimal control law.In [56], Rizvi and Lin proposed an online Q-learning method based on output feedback to tackle the LQR problem.Wanget al.[57] developed an optimal LQR based on the discounted VI algorithm and provided a series of criteria to judge the stability of the systems.In [58], the LQR problem was solved for the continuous-time systems with unknown system dynamics and without an initial stabilizing strategy.The proposed controller was updated continuously by utilizing the measurable input-output data to avoid instability.For the same uncertain systems, Rizvi and Lin [59] proposed a model-free static output feedback controller based on RL, which avoided the influence of the exploration bias problem.In addition,researchers also pay much attention to the optimal tracking design for linear systems.For networked control systems with uncertain dynamics, Jianget al.[60] developed a Q-learning algorithm to obtain the online optimal control policy based on measurable data with network-induced dropouts.

III.EVENT-TRIGGERED CONTROL WITH ADP

In this section, we mainly introduce the application of event-triggered technology under the ADP framework.It is discussed for discrete-time systems and continuous-time systems, respectively.

As an advanced aperiodic control method, event-triggered control plays a vital role in decreasing the computational burden, and enhancing the resource utilization rate.In short, the purpose of introducing the event-triggered mechanism is to reduce the updating times of the controller by decreasing the sampling times of the system state.Unlike the time-triggered control method, event-triggered control is designed with triggering conditions that are required to satisfy the stability of the controlled system.The control input is updated only when this triggering condition is violated.Conversely, if the triggering condition is not violated, the zero-order hold is able to keep the control input unchanged until the next event is triggered.

A. Event-Triggered Control for Discrete-Time Systems

For discrete-time systems, the event-triggered technology has been widely used in the adaptive critic framework.In [61],Donget al.used the event-triggered method to solve the optimal control problem under the HDP framework, and proved that the controlled system was asymptotically stable.An event-triggered near-optimal control algorithm was proposed for the affine nonlinear dynamics with constrained inputs in[62].In addition, a special cost function was introduced and the system stability was analyzed.In [63], a novel adaptive control approach with disturbance rejection was designed for linear discrete-time systems.In [64], Zhaoet al.proposed a new event-driven method via direct HDP.Then, the UUB of the system states and the weights in the control policy networks was proven.In [65] and [66], a novel event-triggered optimal tracking method was developed to control the affine system.It is worth noting that the triggering condition in these two works only acts on the time step and the updating stage of weights is not involved in the iterative process.For systems whose models are known, by using the event-triggered control approach, not only the reference trajectory can be tracked,but also the computational burden can effectively be reduced.In [67], Wanget al.proposed an event-based DHP method,where three kinds of neural networks were used to identify nonlinear systems, estimate the gradient of the cost function,and approximate the tracking control law.In addition, the stability of the event-based controlled system was proved by the theorem of input-to-state stability and the control scheme was applied to wastewater treatment simulation platform.

Then, the event-triggered error vector is defined as

wherekj≤k

u∗(x(kj))

The corresponding optimal control can be obtained by

Next, we introduce several triggering conditions commonly used in combination with adaptive critic methods.

1) Suppose there exists a positive number I satisfying

In addition, the inequality //ϑ(k+1)//≤//x(k+1)// holds.By referring to [61], [62], [67], a triggering condition was designed as

We can obtain different levels of triggering effect by appropriately adjusting the parameter I.

2) According to the updating method of neural networks in[64], we assume that the activation function σain the action network satisfies

for allx1,x2∈X, where P is a positive constant and X is the domain of system dynamics.

Lemma 1[64]: Let (29) hold for the nonlinear system (1).Assume the triggering condition is defined as follows:

where 0 ≤β<1 andwˆais the weight of the action network.λmin(·) and λmax(·) represent the minimal and maximal eigenvalues of a matrix, respectively.In addition, we make the action network learning rate satisfy

and the critic network learning rate satisfy

Then, we can declare that the event-based control input can guarantee the UUB of the controlled system.

3) The triggering condition described below can only be applied to the time-based case.For the iterative process, the traditional time-triggered method is adopted.In [65], [66], a triggering condition was defined as follows:

where

According to the results in [65], the adjustable parameterγplays an essential role in the event-triggered optimal control.If the main emphasis is on optimizing the cost function,γshould be chosen as small as possible.On the contrary, when considering resource utilization,γshould be chosen as large as possible.Therefore, the selection ofγshould be determined according to the actual need.

B. Event-Triggered Control for Continuous-Time Systems

There are extensive studies of event-triggered control methods within the framework of ADP for continuous-time systems.In [68], Luoet al.designed an event-triggered optimal control method directly based on the solution of the HJB equation.In addition, the stability of the system and the lower bound on the interexecution times were proved theoretically.In [69], for a class of nonlinear multi-agent systems, novel event-triggered and asynchronous edge-event triggered mechanisms were designed for the leader and all edges, respectively.In [70], Huoet al.developed a decentralized event-triggered control method to aperiodically update each auxiliary subsystem.In [71], a different event-based decentralized control scheme was proposed.They used codesign strategies to trade-off control policies and triggering thresholds to simultaneously achieve optimization of subsystem performance and reduction of computational burden.

Considering the continuous-time nonlinear system (10), we assumef+guto be Lipschitz continuous on Ω that contains the origin.Weassumethatthere existsan admissible controlux(t)andthecostfunctionisdefinedas

Next, the optimal control law under the time-triggered mechanism is defined as

The event-triggered mechanism is similar to that of discretetime systems.Therefore, we define the state as

for allj∈N.The optimal control law under the event-triggered mechanism can be expressed as

For conventional event-triggered control, the design of triggering conditions is inevitable.Next, we introduce two triggering conditions under continuous-time environments.

1) This triggering condition is established based on a reasonable Lipschitz condition.

with α>0 being a constant.Then, it is proved that the controlled system was asymptotically stable by using this triggering condition.

The main purpose of the event-triggered technology is to reduce the waste of communication resources and improve computational efficiency.In recent years, networked control systems have attracted extensive attention.There is also an increasing amount of work aimed at reducing the energy consumption of network interfaces and ensuring the sustainability of networked control systems.Some related studies can be found in [73], [74].

IV.ROBUST CONTROL AND GAME DESIGN WITH ADP

In modern engineering systems, the real control plants are always affected by changes derived from the system model,external environment, and other factors.Hence, it is of great importance to attain the robust control strategy to avoid the influence of uncertainties.The problem of robust control can be turned into a problem of optimal control, which is a useful method for attaining the robust controller.However, for complex nonlinear systems, it is difficult to solve the optimal control problem.To deal with this dilemma, the ADP method is utilized.In this section, recent research progress of ADP is described, such as using ADP to solve the problem of robust control,H∞control, and multi-player game design.In addition, some other advanced control methods with ADP are supplemented at the end of this section.

A. Robust Control Design With ADP

By utilizing ADP, robust controllers can be designed based on the obtained optimal control strategy.Compared to traditional methods, controllers guided by ADP can not only stabilize the system, but also optimize the performance of systems.The recent work on robust control is analyzed from both discrete-time and continuous-time aspects in this section.

1)Discrete-Time Systems: We consider a class of discretetime nonlinear systems with uncertain terms as

crete-time HJB equation (49) becomes

By choosing an appropriate utility function, robust stabilization was transformed into an optimal control problem for nominal systems [75]–[77].In [76], the idea of solving the generalized HJB equation was employed to derive a robust control policy for discrete-time nonlinear systems subject to matched uncertainties.A neural network was used as the function approximator.In addition, Liet al.[77] proposed an adaptive interleaved RL algorithm to find the robust controller of discrete-time nonlinear systems subject to matched or mismatched uncertainties.An action-critic structure was given to skillfully handle experiments.The convergence of the proposed algorithm and the UUB of the system were proved.An appropriate utility function was chosen as

Notethat there is a new term βx(k) in the utility function com(pare)d( to t)he traditional expression ofxT(k)Qx(k)+uTx(k)Rux(k).Tripathyet al.[78] introduced a virtual input to compensate the effect of uncertainties.By defining a sufficient condition, the stable control law of the mismatched system was derived.At the same time, the stability of the uncertain system was proved.The uncertainty can be decomposed in matched and mismatched components as

2)Continuous-Time Systems: For continuous-time nonlinear systems, the principle of robust control with ADP is similar to that of discrete-time systems.Considering uncertainties,the continuous-time nonlinear system is defined as

and the corresponding nominal system is defined as in (10).In order to obtain the optimal feedback control lawwe need to minimize the cost function

where ρ>0, and the utility functionr(x,u)≥0.Compared with the normal form, it is worth noting that the cost function(55) is modified to reflect matched uncertainties.We assume the control inputu∈Ψ(Ω), where Ψ(Ω) is the set of admissible control laws on Ω.Then, the nonlinear Lyapunov equation can be expressed as

According to (56), we define the Hamiltonian as

Considering (56)–(58), the optimal cost function satisfies the HJB equation

ADP-based robust control schemes can be divided into the following categories: least-squares-based transformation methods [79], adaptive-critic-based transformation methods [80],data-based transformation methods [81], robust ADP methods[82], [83], and so on.In [84], Wang proposed an adaptive method based on the recurrent neural network to solve the robust control problem.A cost function with the additional utility function was defined to counteract the effect of perturbations on the system and the stability of the relevant nominal system was proved.The application scope of the ADP method was further expanded.In [85], the robust control was transformed into an optimal tracking control problem by introducing an auxiliary system including a steady-state part and a transient part, and the stability of the transient tracking error was analyzed.Panget al.[86] studied the robustness of PI for addressing the continuous-time infinite-horizon LQR problem.

B. H ∞ Control Design With ADP

dynamicalsystemscontaining external disturbancesand

InH∞controldesign,acontrollawisconstructedfor uncertainties.According to the principle of minimax optimality, theH∞control problem is usually described as two-playe r zero-sum differential games.In order to obtain the controller that minimizes the cost function in the worst case, we need to find the Nash equilibrium solution corresponding to the Hamilton-Jacobi-Isaacs (HJI) equation.However, for general nonlinear systems, it is hard to obtain the analytical solution of the HJI equation, which is similar to the difficulty encountered in solving the nonlinear optimal control problem.In recent years, ADP has been widely used for solvingH∞control problems.

1)Discrete-Time Systems: Consider the following discretetime nonlinear system with external disturbances:

We define the cost function as follows:

In [87], [88], theH∞tracking control problem was studied by using the data-based ADP algorithm.Houet al.[87] proposed an action-disturbance-critic structure to ensure that the minimum cost function and the optimal control policy were obtained.Liuet al.[88] transformed the time-delay optimal tracking control problem with disturbances into a zero-sum game problem.An ADP-basedH∞tracking control method was proposed.A dual event-triggered constrained control scheme based on DHP [89] was used to solve the zero-sum game problem and was eventually applied to the F-16 aircraft system.A disturbance-based neural network was added to the action-critic structure by Zhonget al.[90].They relaxed the requirement for system information by defining a new type of the performance index.This approach extended the applicability of the ADP algorithm and was the first implementation of model-free globalized dual heuristic programming (GDHP).

2)Continuous-Time Systems: Consider a class of continuous-time nonlinear systems with external disturbances

In practical applications, the exact system dynamics are often difficult to obtain.The identification method can also produce unpredictable errors.For continuous-time unknown nonlinear zero-sum game problems, Zhuet al.[91] proposed an iterative ADP method by efficiently using online data to train the neural network.In [92], a novel distributedH∞optimal tracking control scheme was designed for a class of physically interconnected large-scale nonlinear systems in the presence of the strict-feedback form, the external disturbance, and saturating actuators.

C. Game Design With ADP

Modern control systems are becoming more and more complex with many decision makers, who compete and cooperate with each other.As an essential theory for multiple participants to find optimal solutions, game theory is also increasingly studied in the field of control.In accordance with the cooperation pattern among the players, it can be divided into zero-sum and nonzero-sum games, or non-cooperative and cooperative games.In the zero-sum game, the players of the game are not cooperative.However, in a nonzero-sum game,there is a possibility of cooperation among the players so that each of them gets very high performance.Similarly, game theory can be combined with ADP techniques to solve optimal control problems.With the rapid development of iterative ADP, a lot of new methods have been emerged to deal with games forNplayers [21], [93]–[99].

1)Discrete-Time Systems: Consider a class of discrete-time systems withNplayers

The optimal cost functions are given as

which is known as the discrete-time HJB equation.Then, we can obtain the optimal control law

Zhanget al.[21] combined game theory and the PI algorithm to solve the multiplayer zero-sum game problem based on ADHDP.This method not only ensured the system to achieve stability but also minimized the performance index function for each player.Songet al.[93] divided the off-policyN-coupled Hamilton-Jacobi (HJ) equations into an unknown parameter part and a system operating data part.In this way, the HJ equation can be solved without the system dynamics.Therefore, this approach was very effective for solving multiplayer non-zero-sum game problems with unknown system dynamics.For the domain shift problem,Raghavanet al.[94] compensated for the optimal desired shift by constructing a zero-sum game and proposed a direct errordriven learning scheme.

2)Continuous-Time Systems: Consider the following continuous-time systems withNplayers:

whereujwithj=0,1,...,Nrepresents the control input.Then, we define the cost function as

wher eQk(x) is a positive definite function andRk jrepresents a positive definite matrix with appropriate dimensions.

Assuming that the cost function is continuously differentiable, the Hamiltonian associated with thekth player is defined as

and the optimal control law can be obtained by

Inspired by zero-sum and nonzero-sum game theory, Lv and Ren [98] proposed a solution for the multiplayer mixed-zerosum nonlinear games.They defined two value functions containing performance indicators for zero-sum games and nonzero-sum games, respectively.The optimal strategy of each player was obtained without using the action network and the stability of the system was proved.In addition, Zhanget al.[99] developed a novel near-optimal control scheme for unknown nonlinear nonzero-sum differential games via the event-based ADP algorithm.

D. Other Advanced Control Methods With ADP

With the development of ADP technology, more and more advanced control methods have been improved.This section shows the application of ADP techniques in decentralized,distributed, and multi-agent systems.Meanwhile, the research progress related to the ADP/RL technique in the field of model predictive control (MPC) is displayed.

Modern control systems usually consist of several subsystems with essential interconnections.It is difficult to analyze large-scale systems by using classical centralized control techniques.Therefore, using decentralized or distributed control strategies is usually preferred to solve optimal control problems for several subsystems.Yanget al.[100], [101] not only studied the decentralized stability problem subject to asymmetric constraints, but also transformed the decentralized control problem into a set of optimal control problems by introducing discounted cost functions in the auxiliary subsystems.Tonget al.[102] developed an adaptive fuzzy decentralized control method for optimal control problems of large-scale nonlinear systems with strict-feedback form.They proposed two controllers, i.e., a feedforward controller and a feedback controller, to ensure that the tracking error of the closed-loop system converges to a small range.Without using the dynamic matrix of all subsystems, Songet al.[103] developed a novel parallel PI algorithm to implement the decentralized sliding mode control scheme.

In [104], taking the unknown discrete-time system dynamics into account, a local Q-function-based ADP method was introduced to address the optimal consensus control problem.Besides, a distributed PI technique was developed by the defined local Q function, which was proved to converge to the solutions of the coupled HJB equations.Fuet al.[105] developed a distributed optimal observer for the discrete-time nonlinear active leader with unknown dynamics.It is worth mentioning that the design of the distributed optimal observer based on ADP was developed via the action-critic framework.For the continuous-time distributed system, due to the limited transmission rate of communication channels and the limited bandwidth in some shared communication networks, time delay is an inescapable factor when dealing with the consensus problem.Therefore, in [106], for high-order integrator systems with matched external disturbances, the fixed-time leader-follower consensus problem was coped with by constructing the distributed observer.

Jianget al.[107] estimated the leader’s state and dynamics through an adaptive distributed observer, and used a modelstate-input structure to solve the regulation equations of each follower.In addition, the stability of the system was analyzed independently.In [108], Sargolzaeiet al.introduced a Lyapunov-based method, which reduced false-data-injection attacks in real time for a centralized multi-agent system with additive disturbances and input delays.Besides, the condition of the persistence of excitation was hard to verify.Huanget al.[109] redesigned the updating laws of the action and critic components to ensure the stability of the system by introducing the persistence of excitation and additional constraints.In addition, the study of tracking control of multiagent systems has attracted significant attention due to its broad background of applications.For example, Gaoet al.[110] first integrated ADP with the internal model principle to investigate the problem of cooperative adaptive optimal tracking control.A distributed control policy based on the datadriven technique was put forward for the leader model with external disturbances.Furthermore, the stability of its closedloop system was also demonstrated.

MPC methods mainly solve optimal control problems with constraints [111]–[118].There is a very similar theoretical scheme between ADP and MPC.The core of the two methods is to solve the optimal control problem and obtain the corresponding control policy.Furthermore, the control policy should be able to ensure stability.Therefore, the combination of MPC and ADP is a promising and important direction.In[112], Bertsekas pointed out the relationship between MPC and ADP.The core idea and mathematical essence of them were proposed based on PI.Dual-mode MPC has been combined with the action-critic structure to improve the performance and guarantee stability [113].Based on these, Huet al.[114] introduced a model predictive ADP method for path planning of unmanned ground vehicles at the road intersection.RL has been widely used in feedback control problems[115].In general, the closed-loop stability with MPC is guaranteed and various MPC strategies have been proposed.However, the performance of MPC and its stability guarantee are limited by an accurate model of the system.Accurate system models are difficult to obtain in real control systems.Generally, states and actions are continuous and it is almost impossible to represent them accurately.Therefore, function approximation tools must be used [116].Several studies combined the advantages of RL with MPC to solve optimal control problems and generated a new field [117].Zanon and Gros [118]proposed the combination of RL and MPC to exploit advantages of both methods and then obtained an optimal and safe controller.Meanwhile, it ensured the robustness of MPC based on RL.Subsequently, the data-driven MPC using RL has become an effective approach [119].

V.BOOSTING ADP VIA DATA UTILIZATION AND RL

The concept of RL appeared earlier than ADP.The work of psychologist Skinner and his followers studied how animals learn to change their behaviors according to the result of reward and punishment.The latest work in the field of RL still uses the traditional reward “r” instead of the utility function“U”.RL emphasizes immediate reward over the known utility function.Although the focus of ADP is different from RL and the work is relatively independent, the ideas of many methods show that they have common roots.Werbos first combined RL with DP to build a framework that approximates the Bellman equation and proposed HDP in the 1970s.The original proposition of this approach was essentially the same as the formulation of TD in RL [6].Similarly, ADHDP and Q-learning both employed the state-action function to evaluate the current policy [10].Overall, ADP/RL is a class of algorithms obtained from solving optimal control problems by approximation methods.

Markov decision process (MDP) is a mathematical framework for obtaining the optimal decision in stochastic dynamic systems.As a key theory of RL, almost all RL problems can be modeled as MDPs.In this paper, MDP is denoted as follows:

where S is the state set of the environment, A is the action set, P is the state transition probability, R is the reward set,and γ ∈(0,1] is the discount factor.The agent (often called controller in control theory) chooses actions to generate a trajectory sequence τ ={s0,a0,r0,s1,a1,r1,...}.In RL, the goal is to find the optimal policy that maximizes rewards or minimizes penalties for the agent to interact with the environment.

In the early stages of ADP/RL, the theoretical and algorithmic progress was slow due to the limitations of hardware facilities and system information.The development of system identification techniques has made it available to model nonlinear systems using data-driven methods, thereby opening up a new era of research [6], [120]–[128].In [6], Lewis and Liu illustrated the contribution of stochastic encoder-decoder predictor and principal component analysis in modeling the world through a brain-like approach, as well as emphasized the importance of neural networks.Some model-based approaches have shown promising results.Lee and Lee [122] defined this type of methods as J-learning (based on the value function).The Bellman optimality equation can be expressed as

Pang and Jiang [123] used the model-based method to discuss the robustness of PI for the LQR problem, and proposed an off-policy optimistic least-squares PI algorithm.They exploited the dynamical information of the system in the derivation process and incorporated the stochastic perturbations.Luet al.[124] demonstrated the stability of closed-loop systems using optimal parallel controllers with augmented performance index functions for tracking control.They extended the practical problems to the virtual space through parallel system theory, and used methods such as neural networks to model systems and achieve optimal control.

However, this model-based learning approach can only be effective in the state space based on empirical information.The calculated control actions and performance predictions are constrained by the amount of information.Different from the model-based learning method, Q-learning proposed by Warkins and Dayan [125] used the Q function to represent the value of the action in the current state.This type of function already contains information about the system and the utility function.Compared with J-learning, it is easier to obtain control policies by using Q-learning, especially for unknown nonlinear systems.The Bellman optimality equation can be expressed as

Note that the above formula is described for deterministic systems.Liet al.[126] solved the optimal switching problem of autonomous subsystems and analyzed the boundedness of the approximation error in the iterative process.Jianget al.[127] used Q-learning to improve the convergence speed of optimal policies for path planning and obstacle avoidance problems.In [95], a new off-policy model-free approach was used to study the networked multi-player game.At the same time, they achieved optimal control in systems with networkinduced delays and demonstrated the convergence of the algorithm.In addition, Penget al.[96] proposed an internal reinforce Q-learning scheme, and analyzed the convergence and system stability related to the iterative algorithm.Based on local information from neighbors, they designed a special internal reward signal to enhance the agent’s ability to receive long-term information.The model-free idea applied in the field of control is only the tip of the iceberg.

TD is an RL algorithm that can learn directly from the environment without requiring the complete trajectory sequence.Sarsa and Q-learning are two classic TD algorithms.The Sarsa improves and evaluates the same algorithm (on-policy).However, Q-learning uses data sampled from other policies to improve the target policy (off-policy).Next, we introduce two accelerated methods that can be applied to TD algorithms.

The first method is experience replay which is mainly used to overcome the problems of correlated data and non-stationary distribution.It can improve data utilization efficiency[129], [130].Pieters and Wiering [131] proposed an algorithm combining experience replay with Q-learning.The simulation results showed that the performance of the algorithm was significantly improved over the traditional Q-learning algorithm.Experience replay technique is not only used in Qlearning but also can be combined with other deep RL algorithms, which have achieved good performance in improving convergence speed and data utilization efficiency [132].Many scholars in the field of control are inspired to combine ADP algorithm with experience replay to improve the performance of the algorithm.For discrete-time nonlinear systems, Luoet al.[133] designed a model-free optimal tracking controller by using policy gradient ADP designs with experience replay.It was realized based on the action-critic structure, which was applied to approximate the iterative Q function and the iterative control policy.The convergence of the iterative algorithm was established through theoretical analysis.

The second method is called eligibility traces.The traditional Q-learning is the case with only one-step estimate.If more information on traces is considered, updating the policy will be more efficacious [134].The eligibility traces method can combine multi-step information to update unknown parameters.Eligibility traces were first introduced into the TD learning process to form an efficient learning algorithm named TD( λ) in [135].Considering the direction of the trace, there are forward view and backward view, respectively.Although the expressions of two algorithms are different, their intrinsic essences are the same.In engineering, backward view is generally adopted for the convenience of calculation.Inspired by the field of RL, many scholars combine ADP with both forward view and backward view of eligibility traces.Compared with the traditional ADP algorithms, the performance of these algorithms has been significantly improved [136].Al-Dabooni and Wunsch [137] proposed a forward view ADHDP( λ) algorithm by combining ADHDP with the eligibility traces and proved the UUB under certain conditions.Yeet al.[138] proposed a more accurate and faster algorithm by introducing backward view eligibility traces into GDHP.Meanwhile, the superiority of computational efficiency was verified by the simulation analysis.

In addition, inverse RL [139], [140] has received extensive attention in academia in recent years.This theory is able to solve inverse problems in control systems, machine learning,and optimization.Unlike methods that directly map from states to control inputs or use system identification to learn control policies, inverse RL methods attempt to reconstruct a more adaptive reward function.This reward function prevents small changes in the environment from making the policy unusable.Lianet al.[141] used the inverse RL method to solve the two-person zero-sum game problem, and established two algorithms according to whether the model is used or not.Overall, RL has achieved remarkable success for some complex problems [2].RL has also attracted a lot of attention from a control point of view due to the model-free property and interaction with real-world scenarios.With the application of RL algorithms in the control field, advanced methods based on learning and environmental interaction will demonstrate more powerful capabilities in future works.

VI.TYPICAL APPLICATIONS OF ADP AND RL

Compared with other optimal control methods, ADP has significant advantages in dealing with complex nonlinear systems.Due to the strong ability, ADP is widely used in many fields such as wastewater treatment, smart power grid, intelligent transportation, aerospace, aircraft, robotics, and logistics.

A. Wastewater Treatment Applications

The control of the wastewater treatment process is a typical complex nonlinear control problem, and it is also one of the difficulties in the field of process control.Accompanied by a large number of interferences, biochemical reaction mechanisms are very complex.There are many factors that can influence the effect of wastewater treatment, such as the dissolved oxygen concentration and the nitrate concentration.A large part of research is based on the Benchmark Simulation Model No.1 (BSM1) platform for verification.The goal of designing the controller is to reduce energy consumption and cost as much as possible to ensure the effluent quality meets the national discharge standard and the stable operation of the device.The design framework for control of wastewater treatment plants is shown in Fig.3.

Fig.3.The design framework for control of wastewater treatment plants.

Control of a single variable has been considered.For example, the online ADP scheme was proposed in [142] by using the echo state network as the function approximation tool.The high-performance control of dissolved oxygen variable in the wastewater treatment plant was realized.

To improve the efficiency of wastewater treatment, many scholars consider both the dissolved oxygen and nitrogen concentration.For example, Wanget al.[67] combined the DHP algorithm and the event-triggered mechanism to improve resource utilization and applied it to multi-variable tracking control of wastewater treatment.By using PI and the experience replay mechanism, Yanget al.[129] proposed a dynamic priority policy gradient ADP method and applied it to solve multi-variable control of wastewater treatment without the system model.

In the process of wastewater treatment, the setpoint of operating variables is generally set by manual experience.Considering the uncertain environment and disturbance factors, manual experience is often difficult to adapt to different industrial conditions, and it is difficult to balance energy consumption and water quality during operation.Many scholars have studied the optimization of the wastewater treatment process.Qiaoet al.[143] developed an online optimization control method,which not only met the requirements of effluent water quality,but also reduced the operating cost of the system.For the setpoint of dissolved oxygen, a model-free RL algorithm [144]was proposed that could learn autonomously and actively adjust the setpoint of dissolved oxygen.

B. Power System Applications

Power system is a kind of complex nonlinear plants with multiple variables.The emergence of smart grid has opened up a new direction of power systems.The smart grid design includes renewable energy generation, transmission, storage,distribution, and optimization of household appliances, and so on.

Recently, the ADP/RL algorithm has been widely used in the field of the smart grid due to its advantages.An ADHDP method was applied to solve the residential energy scheduling problem [145], which effectively improves the power consumption efficiency.For multi-battery energy storage systems with time-varying characteristics, a new ADP-based algorithm was proposed in [146].The robust stabilization of mismatched nonlinear systems was achieved by combining auxiliary systems and policy learning techniques under the condition of dynamic uncertainties [83].Experimental verification was carried out on a power system.An adaptive optimal datadriven control method was presented based on ADP/RL for three-phase grid-connected inverter of the virtual synchronous generator [147].To ensure the stable operation of smart grids with load variations and multiple renewable generations, a robust intelligent algorithm was proposed in [148].It utilized a neural identifier to reconstruct the unknown dynamical system and derived approximate optimal control and worst-case disturbance laws.Wanget al.[22] proposed an ADP method with augmented terms based on the GrHDP framework.They constructed new weight updating rules by adding adjustable parameters and successfully applied them to a large power system.

C. Other Applications

The ADP method has also been applied to other fields such as intelligent transportation [149], [150], robotics [7], [51],[127], [151], aerospace [152], [153], smart homes [154],[155], and cyber security [156]–[160], among others.Liuet al.[149] proposed a distributed computing method to implement switch-based ADP and verified the effectiveness of the method by using two cases of urban traffic and architecture.The method divided the system into multiple agents.To avoid switching policy conflicts, a heuristic algorithm was proposed based on consensus dynamics and Nash equilibrium.Wenet al.[151] combined ADP with RL to propose a direct online HDP approach for knee robot control and clinical application in human subjects.For the optimal attitude-tracking problem of hypersonic vehicles, Hanet al.[152] and Zhaoet al.[153] developed a novel PI algorithm and an observation-based RL framework, respectively, which ensured the system stability in the presence of random disturbances.Weiet al.[154] proposed a deep RL method to control the air conditioning system by recognizing facial expression information to improve the work efficiency of employees.Hosseinlooet al.[155] established an event-based microclimate control algorithm to achieve an optimal balance between energy consumption and occupant comfort.With the widespread application of cyber-physical systems, their security issues have received wide attention.Nguyen and Reddi [156] provided a very comprehensive survey of RL technology routes for cyber security and discussed future research directions.For nonlinear discrete-time systems with event-triggered [157] and stochastic communication protocols [158], Wanget al.constructed different action-critic frameworks and discussed the boundedness of the error and the stability of the system based on Lyapunov theory, respectively.More and more ADP-based methods [159], [160] are focusing on improving cyber security.With the rapid development of ADP/RL, its applications will be more extensive.

VII.SUMMARY AND PROSPECT

ADP and RL have made significant progress in theoretical research and practical applications, showing great potential in future tasks.This paper explores the theoretical work and application scenarios by analyzing discrete-time and continuous-time systems, focusing on developing advanced intelligent learning and control.With the current complex system environment and tasks, there are still many theoretical and algorithmic problems that have not yet been solved.Through the present analysis of ADP, this paper concludes several essential directions.

1) Most of the current ADP schemes assume that the function approximation process is exact.However, with the increase of the number of network layers and iterations, the approximation error caused by the function approximator can not be ignored.In the actual iterative process, each step of the function approximator results in an approximation error that propagates to the next iteration.In other words, these approximation errors may change in future iterations, leading to the emergence of a “resonance” type phenomenon, and affecting the reliability of the solution.Therefore, both theoretical and practical applications of ADP need to consider the convergence of ADP algorithms in the presence of approximation errors in policy evaluation and policy improvement.

2) The ADP approach currently addresses mostly systems with low-dimensional states and controls.There is no effective solution for high-dimensional, continuous state and control spaces in real complex systems.With the development of RL and even deep RL, the optimal regulation and trajectory tracking for high-dimensional systems are possible using big data technology.It is important to propose ADP methods with fast convergence and low computational complexity by introducing different forms of relaxation factors.

3) It is of great importance to utilize advanced network technologies to decrease communication traffic and prolong device lifespan.The round-robin protocol, the try-once-discard protocol, the stochastic communication protocol, and the event-triggered protocol are essential in improving performance and saving resources.Based on these protocols, the combination of the ADP technology with decentralized control, robust control, and MPC is crucial in achieving optimal control while minimizing resource consumption.

4) In recent years, the study of brain science and brain-like intelligence has attracted significant interest from researchers worldwide.The optimality theory is closely related to the study of understanding brain intelligence.Most organisms in nature want to conserve limited resources and achieve their goals in parallel optimally.It is important to consider brainlike intelligence to extend ADP and attain optimal decision and intelligent control of complex systems in an online method.To ensure the stability, convergence, optimality, and robustness of the brain-like intelligence algorithms for ADP, it still requires efforts of a large number of scholars.

5) The field of ADP has a wealth of results that can guide many systems in a theoretical sense to achieve optimal objectives.In practice, however, for a large number of nonlinear systems, abrupt changes in control inputs and the construction of dynamical systems are extremely challenging.Parallel control can be seen as a virtual reality interactive control method.It reconstructs the actual system based on the real input and output data.By combining ADP with parallel control, the control strategy will be greatly improved for real physical systems in the future.