A guidance method for coplanar orbital interception based on reinforcement learning

2021-10-17 09:42ZENGXinZHUYanweiYANGLepingandZHANGChengming

ZENG Xin,ZHU Yanwei,YANG Leping,and ZHANG Chengming

College of Aeronautics and Astronautics,National University of Defense Technology,Changsha 410073,China

Abstract:This paper investigates the guidance method based on reinforcement learning (RL) for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process (MDP) model,then a welldesigned RL algorithm,experience based deep deterministic policy gradient (EBDDPG),is proposed to solve it.By taking the advantage of prior information generated through the optimal control model,the proposed algorithm not only resolves the convergence problem of the common RL algorithm,but also successfully trains an efficient deep neural network (DNN) controller for the chaser spacecraft to generate the control sequence.Numerical simulation results show that the proposed algorithm is feasible and the trained DNN controller significantly improves the efficiency over traditional optimization methods by roughly two orders of magnitude.

Keywords:orbital interception,reinforcement learning (RL),Markov decision process (MDP),deep neural network (DNN).

1.Introduction

Recently,a growing number of countries and private sectors are getting involved in space,resulting in more congestion and competition in space.The technology of orbital rendezvous and interception has already become an important symbol to measure the development level of space technology [1].Generally,the optimal control sequence for interception is obtained by solving the corresponding optimal control problem,which requires much time and is not efficient enough to apply online.Hence,the technique to improve the efficiency should be further investigated.Fortunately,with the technology of artificial intelligence becoming mature,increasing works address the use of the reinforcement learning (RL) in space such as trajectory guidance [2−6],radar control [7,8],motion control [9],attitude control [10],maneuver detection [11] and landing problems [12].The deep neural network (DNN) is expected to be an ideal way to improve the efficiency of obtaining the control sequence.This paper concerns the coplanar orbital interception problem and presents a well-designed algorithm,experience based deep deterministic policy gradient (EBDDPG).It can successfully train a DNN controller which improves efficiency by roughly two orders of magnitude compared with traditional optimization methods.

On the one hand,as a special orbital pursuit-evasion game,the orbital interception problem is attracting increased attention from scholars in fields [13−16].Zeng et al17] concerned the long-distance orbital pursuit-evasion games and introduced a mixed global-local optimization strategy to improve the efficiency.Ye et al18] investigated the proximity satellite pursuit-evasion game where the pursuer carried three orthogonal thrusters,while the evader had a single thruster,and the heuristic searching method and the Newton method were introduced to solve the open-loop control.Then,the incomplete-information game was addressed in [19],and a currently optimal escape strategy was proposed based on estimation for the evader.Moreover,instead of the nearcircle orbit,the elliptical orbital proximity operations differential game was investigated in [20].

On the other hand,as the most common way for orbital interception,the coplanar orbital interception is essentially a problem of trajectory planning involving guidance and control,and the original problem is usually converted into a two-point boundary value problem (TPBVP)that employs an optimization algorithm to achieve an optimal solution [21−23].In [24],the hp-adaptive pseudospectral method was introduced to investigate the optimal rendezvous and interception with elliptic initial and target orbits,which reduced the time consumption.Moreover,in [25],the multiple targets were considered,and the design method of orbit interception based on traversing points simplified the interception and saved time and fuel.In [26],a homotopic strategy was applied based on the theory of optimal control to gain a speed advantage over traditional methods.By solving a one-sided optimal control problem to rapidly estimate the costates,Carr et al27] removed global optimization from the traditional method to significantly improve the efficiency.However,optimal guidance with time-critical autonomous performance which involves planning the trajectory online is hardly developed.This limitation is due to the fact that it takes a long time to obtain the iteration solution for a given initial state through traditional optimization methods.

The RL involves learning from environment in which the learner is not told which action to take but instead must discover which action yields the highest reward by attempting these actions [28].Recently,an increasing number of work around the RL is getting involved in space.The biggest advantage of the RL is that the trained DNN can significantly improve the efficiency of complex iterative processes of traditional optimization methods.A DNN was used to improve the efficiency of identification in [29].And,the actor-critic RL algorithm was used to create a closed-loop guidance algorithm for planetary landing in [30,31].It improves the classical feedback guidance in terms of time and can even be implemented directly on-line.Similarly,the optimal integrated guidance for pinpoint Mars landing based on the RL was investigated in [32].By converting complex three-dimensional terrains into a reward matrix,the path-planning for hopping rovers on asteroid surface is investigated by using the RL,and a DNN controller is developed to determine the optimal actions.Besides guidance,the RL is also used to design a flexible heuristic method to produce trajectory initial guesses for cis-lunar transportation missions [33].However,little attention is given to the orbital interception domain.

This paper explores the application of the RL algorithm to obtain a DNN controller for the coplanar orbital interception.The rest of this paper is divided into four sections.In Section 2,the coplanar orbital interception is formulated into a Markov decision process (MDP) model,and an optimal control model is also introduced to generate the training set.Then,the EBDDPG algorithm consisting of three phases is presented in Section 3.Moreover,the numerical simulation is given in Section 4.Finally,the conclusion is drawn in Section 5.

2.Problem formulation

In this section,to apply the RL algorithm to this problem,the coplanar orbital interception is formulated into an MDP model.Moreover,to take advantage of the prior information,the optimal control model is also introduced to generate the training set.

The thrust to mass ratio is treated as a constant for the chaser spacecraft,whereas there is no maneuver capability for the target spacecraft.Both spacecraft are initially in a circular orbit.The chaser spacecraft tries to approach the target spacecraft as quickly as possible.Hence,the objective function is simply

wheretfdenotes the terminal time.

The Earth-centred inertial coplanar coordinate is shown inFig.1,in which theXaxis points to the initial position of the chaser spacecraft and theYaxis is in the coplanar plane making the gradient of the azimuth of the target spacecraft positive.The state variables for the chaser spacecraft are radiusrc,velocityvc,azimuth θc,and velocity azimuthAc,while the state variables for the target spacecraft are radiusrtand azimuth θt.The control variable for the chaser spacecraft is the thrust direction angle φc.The thrust magnitude and mass for the chaser spacecraft areTandm,respectively.The subscriptcdenotes the chaser spacecraft whiletdenotes the target spacecraft.

Fig.1 Coplanar coordinates,state variables,and control variables

The chaser spacecraft drives the statexof the dynamic system with control variableu,and its objective is to arrive in the vicinity of the target spacecraft as quickly as possible.Hence,the dynamic system can be described as follows:

where

Then,the equations of motion for the dynamic system can be derived as follows:

The boundary condition Ψ should be

wherercfandrt fdenote the terminal radius vectors of the chaser spacecraft and the target spacecraft,respectively.It means that the mission ultimately ends up withrcf=rt f.Hence,the boundary condition can be further written as

2.1 MDP model

Consider an MDP (X,U,P,R),whereXis a set of states,Uis a set of actions,Prepresents the transition probabilities,andRis the reward function.The set of statesXmatches the space of the state vectorx,while the set of actionsUmatches the space of the control variableu.

Since the coplanar orbital interception follows the dynamic system,by introducing the time interval τ,the transition probabilities can be written as follows:

where

xXandx′Xdenote the state at timetandt+τ,respectively.uUis the control variable.gives the conditional probability of transition from statexto statex′by taking actionu.

Given the initial statex0,an episode of the coplanar orbital interception occurs when the distance between the chaser spacecraft and the target spacecraftdexceeds the presupposed maximum distancedmaxor the sum of the transitions exceeds the presupposed value ofm.An episode consisting ofntransitions is shown as follows:

The problem of the coplanar orbital interception decision-making process involves finding a strategy mapping from statexto actionuthat minimizes the sum of future rewardsJas follows:

where 0 ≤γ ≤1 is a discount factor that reduces the weight of the reward incurred further in the future.ridenotes the reward of transitioni.

Since the chaser spacecraft aims to approach the target spacecraft in minimum time,the reward function is defined as follows:

2.2 Optimal control model

The HamiltonianHand the function of the terminal conditions Φ are introduced as follows:

where λ is the costate vector conjugate to the state function and ν is the costate vector conjugate to the boundary conditions.

For an open-loop representation of an optimal feedback strategy,the necessary conditions for the costate variables can be written as follows:

subject to

wherei=1,2,···,6.

The costate function and the corresponding boundary of the costate variables for this mission can be derived as follows:

subject to

Moreover,the necessary boundary condition for the HamiltonianHcan be written as follows:

Furthermore,

For the chaser spacecraft,the optimal control variableu∗must satisfy the following equations:

Hence,for the chaser spacecraft,the optimal control variablesatisfies the following equations:

where (15) can be used to choose the correct control variable from (18) and (19).

Finally,(2),(3),(10),(11),(8),(13),(18) and (19) describe the coplanar orbital interception and constitute a TPBVP.

In conclusion,unlike the traditional optimal control model,there are no costate variables and Hamiltonian in the MDP model,which makes it more simple.Moreover,the coplanar orbital interception will not be transformed into a TPBVP requiring time-consuming iterative optimization algorithm in the MDP model.

3.Algorithm

Generally,there is no requirement of prior information in the RL.However,the environment of the coplanar orbital interception is more complex than traditional environments for the RL (for example,the cartpole,pendulum or atari).It will result in divergence applying common RL algorithms such as the deep deterministic policy gradient(DDPG) [34],advantage actor critic (A2C) [35] or trust region policy optimization (TRPO) [36] to this environment directly.

To resolve the convergence problem,the EBDDPG designed based on the DDPG algorithm [34] is introduced in this section.There are three phases in the EBDDPG:the preparation phase,the experience initialization phase and the gradient descent phase.

3.1 Preparation phase

The purpose of the preparation phase is to prepare the training setTand the validation setVfor the following phases.Both sets consist of series of record sequence labeled with an initial state.A record consists of a state,an action and the sum of future rewards as follows:

where the record sequence is generated by solving the TPBVP of the coplanar orbital interception,and the sum of future rewards is calculated according to (6).

Consider a general nonlinear optimization problem of the following form:

wherehis the objective function andprepresents the optimization parameters.The initial costate and terminal time are treated as optimization parameters as follows:

By introducing the weight coefficient vectork,the objective functionh(p) can be transformed from the extended boundary conditions Ψextas follows:

where

To solve the optimization problem stated above,the improved stochastic ranking evolution strategy (ISRES)[37] and the principal-axis method (PAM) [38] are introduced as the global optimization algorithm and the local optimization algorithm of the traditional optimization method,respectively,in this paper.

The termination conditions for the global optimization and the local optimization can be bounds on function evaluations,function value tolerance,and parameter tolerance.Hence,by specifying the fractional tolerance δhr,δprand absolute tolerance δha,δpaon the valuehand parametersp,respectively,the global optimization and the local optimization will halt when the number of function evaluations reaches maxevalNeor the optimum is found to be within the desired tolerance.The maximum number of function evaluationsNe,fractional tolerance δhr,δprand absolute tolerance δha,δpaare set as follows:

3.2 Experience initialization phase

The purpose of the experience initialization phase is to train and validate the DNNs with training setTand validation setV,respectively.Like the common DDPG algorithm,there are four DNNs,policy critic networkQ,policy actor network µ,target critic networkQ′and target actor network µ′.The policy critic network and the target critic network are used to approximate theQ-function which outputs the sum of future rewards by taking an action in a given state,while the policy actor network and the target actor network are used to approximate the strategy µ which is a function of the current state and outputs the action.By initializing the networks above with training setTand validation setVin the experience initialization phase,the convergence of the algorithm will be improved effectively.

Actually,the idea of the experience initialization phase is taken from the fuzzy inference system of fuzzy control[39],which provides the agent with prior information by fuzzy rules before training.Similarly,the training setTis treated as the rules to initialize the networks,while the validation setVis used to validate the initialization.

In this paper,the structures of the actor and the critic networks are shown inFig.2.The actor network consists of seven linear layers applying the Relu function and one linear head layer applying the Tanh function.In contrast,the critic network consists of five linear layers applying the Relu function and one linear head layer without any activation function.

Fig.2 Structure of the actor and the critic networks

Besides the DNNs,as shown inFig.3,replay memoryMis also an important part in the common DDPG algorithm.Similarly,there is a record replay memoryM′introduced in this phase.Unlike the replay memoryM,the record replay memoryM′is used to store records rather than transitions.Similarly,the record replay memoryM′can also provide records with“push”and“sample”interfaces.By randomly sampling records from the memory,the policy critic network can be initialized by the expected sum of future rewards,and the policy actor network can be initialized by the expected action.

Fig.3 Replay memory

During this phase,to evaluate the initialization,the loss of the policy critic network δcand the loss of the policy actor network δaare calculated frequently for all records in the validation set as follows:

At the end of the experience initialization phase,the parameters of the policy critic network and the policy actor network are copied to the target critic network and the target actor network,respectively.

3.3 Gradient descent phase

The purpose of the gradient descent phase is to train the DNNs.To stabilize and improve the training procedure,as shown inFig.3,a replay memory is introduced to store the transitions for reuse.By sampling from the replay memory randomly,a batch of transitions will be decorrelated,which will greatly stabilize and improve the training procedure [34].

The replay memory,in essence,is a cyclic buffer of bounded sizeM,which holds the transitions.The replay memory can also provide a random batch ofNtransitions for training.

The parameters of the DNNs are all randomly initialized,and the soft-replacement rate is fixed in the common DDPG algorithm,whereas in the gradient descent phase,the parameters of the networks have been initialized in the experience initialization phase and the soft-replacement rate is updated according to the success rate.

During the gradient descent phase,a noise processN is introduced to construct an exploration policy

where

wherejrepresents the steps elapsed and ϵdis the decay rate.As a result of the ϵ-greedy strategy,the influence of the noise decays as the steps proceed.

Since anyQ-function obeys the Bellman equation for any strategy,define the temporal difference error δ as follows:

where

Define the Huber loss L as follows:

where

The Huber loss acts like the mean squared error when the error is small but like the mean absolute error when the error is large.It is used to train the policy critic networks.

For the policy actor networks,the output of the policy critic network is treated as the loss to be minimized,so the sampled gradient ∇,which is used to update the policy actor network,is defined as follows:

For the target critic network and the target actor network,their parameters are updated during the training by soft replacement rather than gradient descent as follows:

where α is the replacement rate.

The success rate can be calculated according to the final distance between the chaser spacecraft and the target spacecraft for each episode.By introducing the success rate,the replacement rate α will be assigned by the success rate frequently during this phase.

4.Numerical simulation

The distance unit isUd=Reand the time unit isUt=whereReis the Earth radius and µeis the Earth planetary constant.

The initial radius of the chaser spacecraftrc0and target spacecraftrt0are as follows:

Their azimuths are θc0and θt0as follows:

The constant thrust-to-mass ratio for the chaser spacecraft is set as follows:

wheregis the gravitational acceleration on ground.

4.1 Training by EBDDPG

4.1.1 Preparation phase Consider the following three cases,in which the initial states are all the same except for the initial azimuths,which are equidifferent,i.e.,from each other.The initial difference of the azimuth ∆θ0=θt0−θc0is the same,i.e.,

Solve the three cases above by the algorithm stated in Section 2.2.The trajectory results are shown inFig.4,and a plot of the control variable versus time is shown inFig.5.

Fig.4 Trajectory results of three cases with equidifferent azimuths

Fig.5 Control variable vs.time for three cases with equidifferent azimuths

The results show that the shapes of the trajectory results and the values of the control variables are almost the same.Hence,the result exhibits azimuth independence,which means that the control result only depends on the initial difference of the azimuth between the chaser spacecraft and the target spacecraft rather than the azimuth itself.

As shown inFig.6,the optimized initial difference of the azimuth ∆θ0is approximately;the farther away the initial difference of the azimuth is from the optimized initial difference of the azimuth,the greater the time required for the chaser spacecraft to approach the target spacecraft.

Fig. 6 Normalized terminal time vs. initial difference of the azimuth

As a result,since the chaser spacecraft aims to approach the target spacecraft as quickly as possible in the coplanar orbital interception,by constraining the initial difference of the azimuth as follows:

the terminal time can be guaranteed as follows:

Apply the algorithm described in Section 3.1.By generating 2 701 points linearly spaced includingπ andsetViis obtained by sampling every other 100 points from the 2 701 points and assigning the initial azimuth as follows:

Similarly,setTiconsists of the remaining 2 673 points.Then,the invalidation set and the training set can be obtained by setting the discount factor as follows:

4.1.2 Experience initialization phase

The size of the memory and mini-batch is set as follows:

and the validation frequency is set to 200.

Apply the algorithm described in Section 3.2.The training process is shown inFig.7andFig.8.As shown in the figures,the loss of the critic network and the loss of the actor network decay gradually as the episode elapses and nearly converges to zero (the actor loss is not that smooth at the end,but it is enough for the following training),which means the prior information has been injected into the DNNs successfully.

Fig.7 Critic loss vs.episode

Fig.8 Actor loss vs.episode

4.1.3 Gradient descent phase

The noise process N is defined as follows:

where N represents a uniform distribution.

The parameters of the ϵ-strategy are set as follows:

The maximum distance between the chaser spacecraft and the target spacecraft is set as follows:

and the game is successful whend≤dmax.

For the environment,the state of the environment will be set as“done”,whend≤dmaxor the number of steps exceeds 250.And,ford≤dmax,the state of the environment will also be treated as“successful”.

4.2 Comparison with traditional methods

4.2.1 Time comparison

To compare the efficiency of the trained DNN controller and the traditional optimization methods,100 cases are randomly generated and solved by each one.In this simulation,the traditional optimization method is as stated in Subection 3.2.

As shown inFig.9andTable 1,on the time consumption,the mean value µ and standard deviation σ from the trained DNN controller are both smaller than those from the traditional optimization method,which means that the trained DNN controller is indeed much efficient than the traditional optimization method.More precisely,the trained DNN controller improves the efficiency over the traditional optimization method by roughly two orders of magnitude in solving the coplanar orbital interception.

Table 1 Comparison on time consumption

Fig.9 Comparison on time consumption

4.2.2 Comparison in specific case Consider the following case:

Solve this case by the trained DNN controller and the traditional optimization method.The trajectory result and control variable results are shown inFig.10andFig.11,respectively.

Fig.10 Trajectory result

Fig.11 Control variable vs.time

As shown inFig.10,the trajectory result from the trained DNN controller is nearly the same as that from the traditional method.However,as shown inFig.11,the control from the traditional method is more smooth than from the trained DNN controller,and it takes less time to finish the interception,by less than one percent.

4.3 Comparison with common RL methods

Apply the EBDDPG algorithm described in Section 3,the DDPG algorithm [34] and the actor-critic (AC) algorithm [35] to this environment.The optimizer is the Adam optimizer [40] and the learning rates for the actor and the critic networks are 10−4and 10−3,respectively.The episode versus the success rate calculated by counting every other 200 episodes of the policy actor network is shown inFig.12.

Fig.12 Success rate vs.episode

As shown inFig.12,the success rate is approximately 20% at the beginning and increases to nearly 100% as the training proceeds by EBDDPG.However,without prior information,the success rate maintains around 20% as the training proceeds by DDPG or AC.It is clear that,by taking advantage of the prior information,the convergence problem in the common RL in the coplanar orbital interception can be successfully resolved.

Actually,as stated in Subection 3.1,since the orbital interception environment is much more complex than the traditional environments designed for comparing RL algorithms,for example,the cartpole environment,applying the common reinforcement learning algorithm,for example,the DDPG algorithm or the AC algorithm,to this environment directly will always encounter the convergence problem.

4.4 Monte Carlo simulation

To test the robustness of the model proposed above,the Monte Carlo simulation is performed.The data of this scenario mission is the same as Subection 4.2.2,while the initial condition is dispersed as shown inTable 2according to the measurement accuracy.

Table 2 Dispersion used in Monte Carlo simulation

Fig.13toFig.16showN=100 histories of the state elements of the chaser spacecraft in Monte Carlo simulation.Moreover,Fig.17shows the dispersion of the final distance between the chaser spacecraft and the target spacecraft.The mean value µ and standard deviation σ of final distancedfare introduced to quantify the Monte Carlo simulation result as follows:

Fig.13 Radius of chaser vs.time t

Fig.14 Velocity of chaser vs.time t

Fig.15 Azimuth of chaser vs.time t

Fig.16 Velocity azimuth of chaser vs.time t

Fig.17 Dispersion of the final distance

As shown inFig.13,Fig.14,andFig.16,the accumulated error grows significantly as time progresses for the radius,velocity and velocity azimuth,while there is little accumulated error for the azimuth as shown inFig.15.Moreover,as shown inFig.17,the mean value of the final distance is almost the same for the trained DNN controller and the traditional optimization method,while the standard deviation for the trained DNN controller is much smaller than the traditional optimization method.It indicates that the trained DNN is more stable than the traditional optimization method.In other words,when the noise is added,by using the trained DNN controller,the final distance between the chaser and the target will be much closer than that using the traditional optimization method,and this is better for the terminal guidance to guarantee the interception.

In conclusion,the trained DNN controller is much more efficient and stabler than the traditional optimization methods,though it can only guarantee a near optimized control.However,compared with the improvement of efficiency,roughly two orders of magnitude,the difference between the numerical results,less than one percent,is acceptable.

5.Conclusions

Concentrating on the coplanar orbital interception,this paper presents a well-designed RL algorithm.Some useful conclusions are drawn as follows:

(i) The application of the RL to the coplanar orbital interception is feasible;

(ii) By taking advantage of prior information,the convergence problem in the common RL algorithm can be successfully resolved;

(iii) The control sequence generated by the trained DNN controller is much more efficient and stabler than the traditional optimization methods.Though it is a near optimized result,the difference is acceptable.

However,there are still some issues worthy of research:

(i) The method to smooth the control sequence requires further research;

(ii) The technology of adjusting the DNN controller online is worthy of exploration;

(iii) The method to handle the convergence problem of three-dimensional orbital interception needs to be investigated.

These issues will be addressed in subsequent research.