Safe Navigation for UAV-Enabled Data Dissemination by Deep Reinforcement Learning in Unknown Environments

2022-02-16 05:51FeiHuangGuangxiaLiShiweiTianJinChenGuangtengFanJinghuiChang

China Communications 2022年1期

Fei Huang,Guangxia Li,Shiwei Tian,Jin Chen,Guangteng Fan,Jinghui Chang

1 College of Communications Engineering,Army Engineering University,Nanjing 210007,China

2 The National Innovation Institute of Defense Technology,Chinese Academy of Military Sciences,Beijing 100071,China

3 The Satellite Communication Center,Beijing 102300,China

Abstract: Unmanned aerial vehicles (UAVs) are increasingly considered in safe autonomous navigation systems to explore unknown environments where UAVs are equipped with multiple sensors to perceive the surroundings.However, how to achieve UAVenabled data dissemination and also ensure safe navigation synchronously is a new challenge.In this paper,our goal is minimizing the whole weighted sum of the UAV’s task completion time while satisfying the data transmission task requirement and the UAV’s feasible flight region constraints.However,it is unable to be solved via standard optimization methods mainly on account of lacking a tractable and accurate system model in practice.To overcome this tough issue,we propose a new solution approach by utilizing the most advanced dueling double deep Q network(dueling DDQN) with multi-step learning.Specifically, to improve the algorithm, the extra labels are added to the primitive states.Simulation results indicate the validity and performance superiority of the proposed algorithm under different data thresholds compared with two other benchmarks.

Keywords: Unmanned aerial vehicles (UAVs); safe autonomous navigation; unknown environments; data dissemination; dueling double deep Q network(dueling DDQN)

I.INTRODUCTION

1.1 Background and Motivation

Unmanned aerial vehicle(UAV)is a type of powered aircraft without any pilot on board to manipulate manually,which is able to operate autonomously.UAV has a great deal of advantages, such as fewer staff, lower cost and high maneuverability compared with manned aircraft.The above-mentioned advantages make the UAV own enormous potential not only in civilian but also in military fields[1-3],such as geographic mapping,targets surveillance[4],rescue,and so on.Some of the typical applications of the UAV are target tracking and instant communication support in emergency situations.

Wireless communication using UAVs acting as aerial base stations (BSs) is a promising technology to support lots of important applications[5-8],for instance, ubiquitous coverage and on-demand data collection/transmission from/to randomly distributed devices in areas where there are absence or lack of adequate terrestrial communications infrastructures [9].Besides,on account of their high altitudes,UAVs have a greater probability of Line-of-Sight (LoS) links to users on the ground in comparison with the channel links between ground users and ground BSs.Moreover, UAVs are able to be deployed quickly to provide reliable and cost-effective network access [10].However, it should be noted that the environment in which UAVs perform emergency data sending and receiving tasks is usually unknown.At this time,the safe navigation of UAVs is the premise for the UAVs’carrying out data tasks successfully.Recently,highly autonomous trajectory planning and design of the UAV for radiation source tracking involving in obstacle avoidance (OA), namely safe navigation of the UAV in unknown environment,has aroused more and more attentions[11-13].

Traditional trajectory planning methods, for instance, dynamic programming [14] and the artificial potential field method[15]are characterized by certain optimization problems under several restricted condition, which are heavily relied on the prior knowledge of the environment, including map information and potential obstacle fields [16].However, due to the noise in the environment, the model established still lacks higher accuracy even with the prior knowledge due to the imperfect data.Moreover, in the trajectory planning, even minor changes, such as starting and ending positions also lead to the modification of the models,which could cause huge overhead.Specifically, in a completely unknown environment lack of any prior information, the ability of generalization from the previous to the fire-new scenes is of great importance.Existing methods like simultaneous localization and mapping(SLAM)[17]is a relatively good way to select, whereas SLAM is low-efficiency and time consuming.Therefore,safe and autonomous navigation of the UAV in unknown environments is still a potential research field.

As can be seen from the existing studies,researchers have carried out a lot of exploration about the navigation problem in unknown environments.However,there are few studies on safe navigation in emergency communication support scenarios.In fact, as mentioned above, safe navigation of the UAV itself is the premise for its successful implementation of data transceiving tasks.At the same time,the optimization of navigation methods of the UAV in unknown environments can further improve the efficiency of its execution of data transmitting and receiving tasks.Therefore, safe navigation of UAV-enabled data dissemination holds significant promise in a completely unknown environment, which is also the motivation of this paper.

1.2 Related Work

The research on safe navigation of UAV-enabled data dissemination is still in a very infancy stage.On one hand, nowadays UAVs are increasingly widely used in several engineering and safety applications to explore unknown environments.However,over the past few years, there have been a good deal of studies on autonomous UAV navigation system in unknown flight environment in the existing literature [16, 18-22].Specifically, the authors in[16]have focused on autonomous navigation system based on multi-sensor data fusion for unknown flight environments, where path planning and collision avoidance are considered.The authors in [18] investigated autonomous navigation of UAV in forest environments,paying special attention to autonomous navigation capability such as detection and evasion of trees.The work [19] presented a study where a robust and efficient navigation solution is proposed.It is applied on the UAV autonomous flying in a constrained but partially known indoor environment.The research in[20]presented a trajectory planning architecture where the fast replanning for reactive obstacle avoidance is able to be conducted.This work in [21] proposed a novel planning a mapping framework applicable to agile flights in unknown environments.Then the literature [22] studied UAVs’obstacle avoidance in GPS-denied environments.It is worth mentioning that when the UAVs need to perform certain tasks in completely or partially unknown environments, UAVs’ autonomous tracking for specific targets is critical.Thanks to miniaturization of sensors with a variety of functions and communication modules hardware/software,UAVs are increasingly supported by various ways to detect, track[23-30].Thus,in terms of the UAV’s safe navigation and mission implementation simultaneously in completely unknown circumstances,the UAV needs to approach the target region before the implementation of the mission by perceiving the approximate positions of the target areas using these detecting methods.

The technologies in which the UAV has been considered as an innovative tool to assist the ground cellular network to achieve more and faster information service have been popular for a long time.In particular, there are a lot of existing literature which investigated data dissemination in UAV-supported networks[9,31-35].Therefore,combined with all of the above,safe navigation of UAV-enabled data dissemination in unknown environments is raised.To solve this problem where there is no prior information, we turn to deep reinforcement learning (DRL) [36-40].The researcher in [36-38] have put forward a DRL-based method utilizing UAVs to perform navigation missions in large-scale complex environments.The studies in[39]presented that leveraging dueling double deep Q network (dueling DDQN) with multi-step learning in simultaneous navigation and radio mapping.The difference between literature[39]and our work in this paper we study is in the unknown environment, where the target devices need to be detected and approached by the UAV safely and finally the UAV implements the communication task.This work in literature [40]developed a multi-step dueling-DDQN learning algorithm for radiation source tracking.But this literature[40] has not dealt with implementing communication tasks.Unlike all above, in this paper, we investigate that UAV-enabled data dissemination which also ensures safe navigation simultaneously.

1.3 Contributions

In this paper, we investigate safe navigation of UAVenabled data dissemination by deep reinforcement learning in unknown environments.Specifically, the major contributions of this paper are highlighted and listed in the following:

• We formulate a problem framework for safe navigation of UAV-enabled data dissemination system, in which our aim is to minimize the whole weighted sum of the UAV’s task completion time,at the meantime, guaranteeing the data transmission task requirement and the UAV’s feasible flight region constraints are met.What’s better than the prior studies is that this paper not only can solve the issue where the UAV safely navigates to the destination area including OA but also the UAV completes the data distribution mission to all target ground devices.

• We propose a multi-step dueling-DDQN learning algorithm for autonomous navigation and collision avoidance.Since the original problem is coupled by flight time and required data of ground devices, we first transform this original problem into an MDP in order to utilize the learning algorithm.Next, under the premise that the ultimate goal of the UAV is to reach the destination area,the UAV needs to first complete the data release to all the target ground devices.So the entire mission process is divided into two phases and thus two subproblems are obtained.Firstly, the UAV needs to complete the closing up radiation source and data disseminating.Then, the UAV should explore the destination area and reach it.According to the stages of the entire task,we reformulate the optimization objectives and constraints of the subproblems.

• We further improve the algorithm, namely, the original states that contain only the possible positions of the UAV are improved to the current states including the possible positions of the UAV labeled with the serial numbers of the target ground devices.Therefore, in-depth simulation experiments are provided under different data thresholds.Numerical results show the convergence behaviors of the proposed algorithm.Furthermore,the performance of the proposed approach is superior to other comparison algorithms according to simulation results.

The remainder of this paper is organized as follows.We introduce the system model and state the problem formulation in the next section.Then, algorithm design is described in Section III.Next, simulation results are presented in detail in Section IV.Finally,conclusions are drawn in Section V.

II.SYSTEM MODEL AND PROBLEM FORMULATION

2.1 System Model

As shown in Figure 1, we consider a safe navigation system where a UAV is assigned to complete the data dissemination task simultaneously.The UAV equipped with sensors aims to approach some randomly distributed ground devices based on perceived signal strength transmitted by the ground devices.

Figure 1. Scenario illustration of UAV-enabled data dissemination in air-ground network.

Figure 2. DDQN with the dueling network framework.

Without loss of generality, the range where the UAV is allowed to fly is a three-dimensional cube,which can be specified byRU=[xL,xU]×[yL,yU]×[zL,zU], whereLrepresents the lower bound andUrepresents the upper bound,respectively.For the sake of simplicity, the starting and destination locations of the UAV are qI ∈R3×1and qF ∈R3×1, which are known in advance.Because the UAV’s ascending or descending frequently for avoiding various buildings and complex terrain needs to consume a great deal of energy, the UAV is fixed at a constant altitudeH.Due to the limited energy of the UAV, we consider a finite maximum flight timeT.The navigation trajectory of the UAV at timetis defined by qU(t)∈R3×1,t ∈[0,T].Then,the flight range constraints can be obtained in the following

where qL≜[xL;yL;zL],qU≜[xU;yU;zU].

The UAV disseminates data toKfixed ground devices, which are randomly distributed on the ground and cannot communicate with each other.They are contained in the setK={1,2,...k,...,K},and the 3D location of thek-th device is wk= [xk,yk,0],∀k ∈K.It’s worth noting that the locations of all ground devices are unknown for the UAV during the entire flight.The distance between the UAV and thek-th device at timetis expressed as

The channel propagation modeling method adopted in this paper is quoted from the literature [33, 35].Then, the channel gain between the UAV and thekth device at timetis modeled by

Furthermore, for mentioned-above UAV-ground links,(t) is decided by the occurrence probabilities of LoS and Non-LoS (NLoS) links.Specifically,(t)can be expressed by[32,33]

whereµ0represents the path loss per unit distanced0=1m,αrepresents the path loss exponent,meetingα ≥2.ξis the additional attenuation parameter under the NLoS link condition.Generally, the probability of LoS link depends on the UAV’s flight environment and the distribution of green plants and surrounding buildings.One widely-used approach is to model the LoS probability between the UAV and thek-th ground device at timetas a logistic function of the elevation angle,which is denoted by

The parameters that need to be explained for the above expression are listed.βk(t) =(180/π)sin-1((H/dk(t)))is the elevation angle in degree.Butbandcare fixed parameters determined by the propagation environment.Obviously,in this paper,the probability of LOS link depends on the UAV trajectory qU(t).

As a consequence,the channel gain(t)is a random variable with two different degrees of randomness, namely the occurrence probabilities of LoS and NLoS links.The expected channel power gain by averaging over the above two randomness probabilities between the UAV and thek-th ground device at timetis written as

wherePk,NLoS(t) = 1-Pk,LoS(t)is the occurrence probability of NLoS between the UAV and thek-th device at timet.

Because the locations of the ground devices are unknown for the whole system, the UAV functioned as a sensor aims to close up those ground devices at first according to the received signal strengths respectively.To distinguish the received signals transmitted from all ground devices,the frequency-division multiple access protocol is applied in this paper.Therefore, we consider the total bandwidth of the system,denoted asB.It is equally divided intoKsubcarriers allocated toKdevices separately.The UAV will cycle through the full spectrum to identify the different ground devices.By means of the signal transmission model of literature[40,41],the signal received by the UAV from thek-th device at timetis denoted by

wherep0is the transmission power of each ground device,n0is the noise received by the UAV regarding each ground device.The noisen0is written by

whereσ2is the noise power spectrum density.

Once the UAV senses the signal strengths of a ground device is greater than a certain thresholdSγ,a relatively good communication link between the ground device and the UAV will be established.Then the UAV begins to transmit information to the specific ground device.

Therefore,the expected information rate in bits per second(bps)received by thek-th device from the UAV at timetis expressed by[33]

where there Γ is the gap in the channel capacity because of the actual modulation and coding condition.PUis the transmission power of the UAV.

Hence, the total amount of information data received by each ground device over the whole periodTis expressed by

To make the UAV to complete the data dissemination mission in the least time, so the UAV chooses to fly at the fastest speedVmax.It is written as

in which(t)is the unit vector for the flight direction of the UAV meeting=1,∀t ∈[0,T].

Because the elements in this environment are completely unknown, we cannot acquire the exact positions of obstacles such as various buildings.We assume positions of all obstacles are given by Ψ.To this end,the minimum safe distancedsafebetween the UAV and all obstacles is set to avoid collisions while flying,denoted as

The goal of the UAV-enabled data dissemination system is that the UAV aims to fly to the target area with the least amount of time in the process of safe navigation while satisfying the amount of information data each ground device needs to receive.The information threshold is represented asγth,and the value is the same each ground device requires.

The investigated problem can be formulated mathematically as

(P1):

where constraint(14b)clarifies that the minimum demand of received information data for each ground device is met.Constraint(14c)indicates that the starting and ending position of the UAV.Constraint (14d) ensures that the UAV’s flight range can not exceed its feasible scope.Constraint(14e-14f)elaborates on the flight direction and speed of the UAV.Constraint(14g)makes sure that its safe navigation, preventing collisions.Additionally, this problem above is in an unknown environment and is difficult to be solved by traditional optimization methods.

2.2 A Basic Introduction of Deep Reinforcement Learning

2.2.1Brief Introduction of Reinforcement Learning

Reinforcement learning (RL) has become more and more popular and has been used in many applications recently.It has been the most common machine learning method to handle the markov decision process (MDP), which is made up of two crucial elements,namely agents and environment.There are interactions between with each other iteratively in real time.Mathematically, the fully observable MDP can be expressed by a 4-element tuple combination〈S,A,P,R〉, in whichSrepresents the state space including all the states that an agent might be in,Arepresents the action space containing all possible actions that an agent can perform.Prepresents the state transition probability space,which indicates the probability of the agent transiting to the next states′ ∈Sbased on the current states ∈Safter taking the actiona ∈A.R ∈Rrepresents the immediate reward received by the agent once the action is taken.Thus,at each discrete time stepn, when the agent in the stateSntakes an actionAn,the agent can receive an immediate rewardRn.Finally, the state where the agent is in transits to the next stateSn+1.

The actions took by the agent are determined by its policyπ ∈[0,1], in whichπ(s,a) guides the probability of an action being performed by the agent in current state.The target of the agent is to improve the effectiveness of the policyπincreasingly according to its prior experience by interacting with the environment.Furthermore, the system aims to maximize its long-term expected returnυπ, named the value function.It is denoted as follows

where 0≤γ ≤1 represents the discount factor of the total rewards.

Another important measure of reinforcement learning is action value function,expressed byQπ(s,a).It indicates the agent takes an actiona ∈Acomplying with the policyπbased on a states ∈S,defined by in the following

Moreover, at time stepn, the optimal value functioncan be directly obtained in terms of

The update ofQ-value will be denoted by

in whichφrepresents the learning rate.

2.2.2Brief Introduction of Deep Reinforcement Learning

SinceQlearning solves only tabular problems,where each pair of state and action has a correspondingQvalue, it is very limited to solving continuous states and actions or or the condition in which the sets of discretized states/actions are too many up to a point.To deal with those aforementioned intractable problems, we are able to resort to available techniques of function approximation.One of those techniques is an efficient non-linear function approximation which is through applying artificial neural networks(ANNs).The vigoroso framework of deep reinforcement learning(DRL)is derived from RL combined with function approximation on the basis of deep ANN.The action value function is approximated based on deep reinforcement learning,denoted asQ(s,a)≈(s,a;θ),where the parameterθis the weighted coefficient in the ANN.The loss function with regard to the transition〈Sn,An,Rn,Sn+1〉at stepnis expressed by

In order to converge better and faster,therefore,the loss function(19)is modified into

As to another vital technique,namely experience replay, which is the previously obtained transitions are stored in a buffer, then multiple experiences are sampled randomly to update.In this case, the correlation of adjacent data is too high due to continuous update,so the technique contributes to reduce the variance derived from constant updates.

There is a big drawback of the technique DQN is the over-estimation bias due to the maximum action selection of the Q learning.Therefore, many improved techniques have emerged to deal with this problem,such as DDQN,which applies DQN with the dueling network architecture.

2.3 Safe Navigation of UAV-enabled Data Dissemination As An MDP

Because the elements in this environment are unknown, for instance, the locations of the ground devices and obstacles,and so on,we turn to the method of DQN.The premise of using the technique of DQN for solving a practical issue is to reformulate the original problem as an MDP.With regard to the autonomous navigation and collision avoidance problem,since an MDP is formed over discrete time steps, the first thing we need to do is discretizing the total time periodTintoNtime steps with certain intervalδt,in whichT=Nδt, meeting the UAV keeps still approximately at each time step.As a result,the original problem(P1)can be transformed into problem(P2)in the following

(P2):

Consequently,the problem(P2)above can be handled by the technique of DQN method.

III.ALGORITHM DESIGN

Then, we formulate the problem(P2)as an MDP,so the 4-tuple〈S,A,P,R〉is able to be acquired concretely below

• StateS:

Traditionally,the state space is defined as all UAV possible positions within the feasible flight zone.However, the UAV needs to close up each device on the ground for data dissemination in better channel quality.The system performance results are not the same even if the UAV is flying in the same position when the UAV’s tracking different ground devices.Thus, each possible position of the UAV must be labeled by the sequence of ground devices.As the result,the state is defined by

S:{sn,k=[qU[n],k],∀k ∈K,∀n ∈[0,N]}.

• ActionA:

Since the UAV is flying at maximum speed, the only thing that the UAV can control is the flying direction[n].Therefore,the action space is denoted asA:

• Transition probabilityP:

The transition probability is determined by the constraint C2.4: qU[n+1] = qU[n] +δt[n],∀n ∈[0,N]in problemP(2).

• RewardR:

Since the entire mission process needs to be divided into two phases.Firstly, the UAV needs to complete the closing up radiation source and data disseminating process.Then,the UAV should explore the destination area and reach it.When both phases are completed, the whole mission is accomplished.Therefore,the rewards need to be set according to the stage of the task.

For this problem,due to the change in our design of the state,problemP(2)is represented as follows(P3):

The most important difference between problem(P3) and problem (P2) is that all the states in problem (P3) are labeled the sequences of target ground devices.

3.1 Closing Up Ground Devices and Data Dissemination

Because the locations of the ground devices are unknown, the UAV tracks and approaches those ground devices by the strength of the signals the UAV receives from each device.When the UAV is at the starting location,it circularly searches the entire band.The UAV tracks close to the source of radiation based on the signal strength it receives in the frequency band where the signals strength received is the strongest.So, we consider the signal strength received as the main reward to drive the UAV to approach the current target ground device.The purpose that the UAV approaches ground devices is the UAV needs to transmit information to the target ground device.Furthermore, the channel quality between the UAV and the ground device is required to be good enough,that is,only if the distance between the position projected to the ground of the UAV and the ground device is less thandtrans,namely, the signal strength received by the UAV is greater than a certain thresholdSη, the UAV begins to disseminate data to the target ground device.Once the UAV has reached the location where the distance between the UAV and the ground device is less thandtrans, we will give a rewardrgoodtraj.When the information data received by the ground device exceeds a certain thresholdγth,the ground device stops transmitting the signal.In order to encourage the UAV to reach the destination area in the least amount of time,we assume that every time step the UAV takes will lead to a penalty, represented bypstep.Likewise, to make reduce the times the UAV can fly to repeated position before, the penaltypagainis introduced.As a result,we introduce an additional weightλ, where the time step is combined with the signal strength received by the UAV.Besides, if the UAV flies outside the feasible area, we give a penaltypoutb.Similarly, since the UAV is equipped with radar sensors, if the UAV flies to a position whose distance from the obstacle is equal to the minimum safe distancedsafe, the UAV will no longer approach the obstacle, and a penaltypobstwill be given.

In short, the reward function design is summarized below

Based on the above discussion, the problem in this stage can be decomposed as follows

(P3.1):

When the UAV has traced every device on the ground, and already completed all the tasks of data dissemination,the UAV’s mission of the first phase is achieved.

3.2 Exploring to Reach Destination Area

The ultimate goal of the UAV is flying to the destination area,so the UAV needs to prompt itself to the destination area in accordance with the distance between the UAV’s current position and the final location in the second stage of the whole task.The distance between the UAV’s current position and the final location is denoted as

When the UAV arrives near the destination, it is considered to have reached the target area, satisfyingdUF[n]≤ddest,∀n ∈[0,N], whereddestrepresents the radius of the target area.Once the UAV reaches the target area,a rewardrdestis given

In accordance to the above analysis,the problem in the second stage can be decomposed in the following(P3.2):

If the UAV has reached the target area successfully,the whole task is completed and the problem is solved finally.

3.3 Multi-step Dueling-DDQN Learning Algorithm for Autonomous Navigation and Collision Avoidance

In response to this problem of autonomous navigation and collision avoidance in unknown environments,we consider that the action space is divided intoMvalues equally while keeping the state space continuous,where theMvalues are contained in the setM.

Algorithm 1. Dueling DDQN multi-step learning for safe navigation of UAV-enabled data dissemination.1: Initialize: the maximum number of episodes Nepi max, maximum number of steps per episode Nstepmax, initial exploration ε0, decaying rate γ, proximity to ground device reward rgoodtraj, reach destination area reward rdest,outbound penalty poutbound,repeated position penalty pagain,and obstacle penalty pobst.Minimum safe distance dsafe and the radius of the target area ddest.The replay memory queue D with capacity C, initial state■qU[0],1■and final state■qU[N],K+1■,Sη,γth,θ† ←θ,ε ←ε0.2: For nepi =1,...,Nepi max do 3: Initialize a slide window queue W with capacity N0,and initialize the time step n ←0.4: Repeat 5: Select action⇀vm∗from ˆA according to the ε-greedy policy,denoted by m∗,m ∈M=images/BZ_222_581_878_618_924.pngrandi(m), with prob.1-ε arg max ˆQ■sn,⇀vn;θ■, with prob.1-ε.6: Perform action⇀vn,obtain the next state sn+1 and the reward Rn according to Eq.(23)and Eq.(26).7: Store transition■sn,⇀vn,Rn,sn+1■in slide window queue W 8: If n ≥N0,calculate the N0-step accumulated return R(n-N0):n based on Eq.(29)using the stored transition in W,and store the N0-step transition■sn-N0,⇀vn-N0,R(n-N0):n,sn■in the replay memory D 9: Sample random mini-batch of N0-step transition■sj,⇀vj,Rj:j+N0,sj+N0■from D 10: If in the phase of closing up radiation source and data dissemination:Let rj =images/BZ_222_462_1436_498_1481.pngRj:j+N0 +rgoodtraj, if s ≤dtrans,Rj:j+N0 +γN0Q■sj+N0,⇀v∗;θ†■, otherwise,If in the phase of exploring to reach destination area Let rj =images/BZ_222_462_1626_498_1671.pngRj:j+N0 +rdest, if s ≤drest,Rj:j+N0 +γN0Q■sj+N0,⇀v∗;θ†■, otherwise, where⇀v∗=arg maxQ⇀v′∈A■sj+N0,⇀v′;θ■11: Execute a gradient descent step on■yj - ˆQ■sj,⇀vj;θ■■2in regard to network parameters θ 12: Until s[n]=■qU[N],K+1■or n=Nstep max.13: After running every G episodes,update the target network parameters θ† ←θ 14: end for

where formulaVrepresents the value function, and formulaArepresents the advantage function.Then,φandχseparately represent the parameters of two sides of the hidden layers.

Based on the statesn,k=,∀k ∈Kat time stepn, nextN0-step rewards will be obtained,denoted as:

Then, the loss of aN0-step DQN learning can be defined as

Besides, a double DQN learning is considered in this paper, as applied in [43].It’s able to settle the problem of overestimating the value action for DRL.Finally, the loss of aN0-step DDQN learning can be redefined as

where

The proposed dueling-DDQN multi-step learning algorithm for safe navigation of UAV-enabled data dissemination is summarized in Algorithm 1.The step 2 to step 12 makes up a episode, in which step 5 indicates the action selection strategy according toεgreedy policy,step 8 signifies performing calculations of theN0-step accumulated return,step 10 obtains theQ-value of main network based on the returns of two different stages,step 13 denotes that updating the target network after everyGepisodes.

What’s different about the proposed algorithm from dueling-DDQN multi-step learning algorithm come up with before is that we have added a label to each state of the UAV,which is denoted as[qU[n],k].The states studied in the previous study did not need to be labeled during the whole task completion process, but it cannot be directly applied to the scenario investigated in this paper.Because the task in this paper is divided into several stages, and the rewards may be different in the same state(the trajectory of the UAV)in different stages.Only if each state has a unique reward,thesystem operation will not be chaotic.In order to make each state generate a unique reward, we innovate the algorithm by tagging the state of the UAV.The simulation results shown in the next section also indicate that our improved algorithm can converge faster and achieve better performance.

Table 1. Key notations and symbols used in this paper.

IV.SIMULATIONS AND DISCUSSION

In this section, we conduct several sets of simulation experiments to provide the evaluation for the performance of our proposed algorithm.

4.1 Basic Setup

Numerical simulations are performed to evaluate the performance of the proposed algorithms.In this paper,we consider an urban area whose size is 100×100m2on the two-dimensional plane, where some tall buildings are distributed at random, as shown in Figure 3.

Figure 3. The 3D view of the environment.

This scope can be extended to three-dimensional space, which is challenging to a certain extent, left for later work.Furthermore,K= 3 ground devices are randomly scattered on the ground.Assuming the UAV flies at the fixed altitudeH= 50m, which is below the heights of the buildings.The starting and destination locations of the UAV are designated as(8,15,H) and (90,95,H), respectively.Because the environment is unknown, the locations of the ground devices that the UAV needs to approach and the locations of the buildings are all unknown.The transmission power of each ground device is 30mW.The transmission power of the UAV is 3W.The channel correlation parameters are set as follows:B= 2×105Hz,σ2=-169dBm/Hz ,b= 10,c= 0.6,ξ= 0.2,µ0= 2.5 and Γ = 2.5.Refer to the parameter setting from literature[39],the dueling DQN network is a kind of fully-connected feedforward ANN composed of 5 hidden layers.It’s actually divided into two parts.The first part is the first 4 hidden layers, containing 512,256,128,and 128 neurons,separately.Nevertheless,the other part is also the last hidden layer,namely the dueling layer.It includesM+1 neurons, where only one neuron is assigned to value function and the others are assigned to the action advantage function.Other relevant parameters are shown in TABLE 1.Our simulations is performed by Python 3.6 with Tensor-Flow 1.14.0 and Keras 2.2.5.Beyond that, we compare the performance of the proposed algorithm with that of two benchmark schemes, which will be introduced in detail in the following.To certify the reliability of the proposed algorithm performance, three scenarios are investigated where the information data thresholds are set different.The thresholds mentioned above are 1Mbits,5Mbits and 10Mbits,respectively.

4.2 System Performance

Firstly, to offer an intuitive perspective of the UAV’s flight path,we give the plane trajectory of the UAV directly from above the vertical,as shown in Figure 4a,5a and 6a.To facilitate identification,the red star represents the starting position,the red triangle represents the ending position and the black pentagon represents ground device,separately.In order to see the positions of the obstacles more visually,we use the blue squares on the plane representing the obstacles in the environment.Obviously,it can be observed that the UAV first approaches each ground device and then successfully reaches the target area no matter what the threshold is set.Furthermore, the UAV can successfully avoid all obstacles and complete safe navigation process.When the UAV is avoiding obstacles,it does not surround the obstacles to avoid them, but quickly bypasses the obstacles and flies directly to the next target ground device.Every time the UAV approaches a target ground device, the UAV flies in the shortest path as possible on the premise of safe navigation.Once the required information data for the target ground device is satisfied, the UAV will approach the next target ground device immediately.In the simulations of three different data thresholds, trajectories selected to be shown are obtained from the last 10 episodes, which proves that trajectories are not acquired by chance.Some of the 10 trajectories are overlapped.It indicates that the UAV is flying along the same path for one part of the entire flight.In the 10 trajectories under different information theshold,the UAV first arrives at the vicinity of each ground device in turn, which is the positions where the quality of the air-ground channel reaches the requirement.Next, the UAV disseminates data, and then flies to the destination point and lands.

Next, to verify the superiority of the proposed algorithm,we present two benchmarks for comparison.In the first compared algorithm, the states are still the location labeled the target ground device, but the choice of UAV’s action is random, not based onεgreedy strategy.In the second compared algorithm,the states are only the locations of the UAV not containing the target ground device label.The average return of the UAV over episodes is got by sliding window whose length is 30 in the basis of the original return,shown in Figure 4b,5b and 6b.It can be observed that the higher and more stable return can be obtained by the proposed algorithm than other two algorithms through 5000 episodes.Moreover,the reason why the algorithm combining the UAV’s location and the serial number of target ground device as the input has better performance than the one without labels is when the UAV approaches different ground devices, it will get different rewards even the UAV is in the same location.In order to distinguish the different rewards of the UAV at the same position when the UAV is approaching different ground device,the locations of the UAV are labeled by sequence of target ground device.As a result,considering serial number of target ground device as one part of the state is probably more appropriate for this kind of scenario in this paper.At last,the simulation results indicate that the proposed algorithm where the UAV’s locations are labeled the serial number of target ground device as the input has better performance in truth than the other two benchmarks.

Figure 4. UAV navigation trajectory and average return under threshold γth =1Mbits.

Figure 5. UAV navigation trajectory and average return under threshold γth =5Mbits.

Figure 6. UAV navigation trajectory and average return under threshold γth =10Mbits.

Figure 7. Step over episode under different thresholds.

Since the overall goal of the UAV is flying to the destination region in the shortest time on the premise of completing the task, the flight time making use of the proposed algorithm is compared with that based on other two algorithms, shown from Figure 7.It is observed that the UAV’s flight time becomes shorter and shorter until it converges with the increase of the episodes, where the length of the slide window = 30.The flight time of the UAV is the finite maximum flight time utilizing two comparison algorithms, indicating that the maximum operating time has arrived, but the overall task has still not been completed yet.The time of flight line used by the comparison algorithm is that baseline.Furthermore,as can be seen from the Figure 7,the more data required by the ground target devices,the longer time the flight of the UAV will take.

V.CONCLUSION

In this paper, we have investigated safe na.vigation of UAV-enabled data dissemination in unknown environments.To overcome the challenges of unknown circumstances, an algorithm applied for autonomous navigation system is proposed for unknown flight environments, which leverages the dueling DDQN with multi-step learning.Specially, we consider minimizing the whole weighted sum of the UAV’s task completion time,satisfying the data transmission task requirement at the same time.Numerical results have validated the superiority of the proposed algorithm over other benchmark methods.More research about the related scenario in which multiple UAV clusters are combined to accomplish more and more complicated missions are left to be investigated in the future.Moreover,the 3D flying path of the UAV will be also left to future research work.

ACKNOWLEDGEMENT

This work was supported by the National Natural Science Foundation of China(No.61931011).

China Communications2022年1期

China Communications的其它文章: Service Customized Space-Air-Ground Integrated Network for Immersive Media: Architecture,Key Technologies,and Prospects; Holistic Service-Based Architecture for Space-Air-Ground Integrated Network for 5G-Advanced and Beyond; Civil Aircraft Assisted Space-Air-Ground Integrated Networks: Architecture Design and Coverage Analysis; ESMD-Flow: An Intelligent Flow Forwarding Scheme with Endogenous Security Based on Mimic Defense in Space-Air-Ground Integrated Network; Intelligent Passive Detection of Aerial Target in Space-Air-Ground Integrated Networks; 6G Service Coverage with Mega Satellite Constellations