Multi-Agent Few-Shot Meta Reinforcement Learning for Trajectory Design and Channel Selection in UAV-Assisted Networks

2022-04-20 05:57ShiyangZhouYufanChengXiaLeiHuanhuanDuan
China Communications 2022年4期

Shiyang Zhou,Yufan Cheng,Xia Lei,Huanhuan Duan

National Key Laboratory of Science and Technology on Communications,University of Electronic Science and Technology of China,Chengdu 611731,China

Abstract:Unmanned aerial vehicle(UAV)-assisted communications have been considered as a solution of aerial networking in future wireless networks due to its low-cost,high-mobility,and swift features.This paper considers a UAV-assisted downlink transmission,where UAVs are deployed as aerial base stations to serve ground users.To maximize the average transmission rate among the ground users,this paper formulates a joint optimization problem of UAV trajectory design and channel selection,which is NP-hard and non-convex.To solve the problem,we propose a multi-agent deep Q-network(MADQN)scheme.Specifcally,the agents that the UAVs act as perform actions from their observations distributively and share the same reward.To tackle the tasks where the experience is insuffcient,we propose a multi-agent meta reinforcement learning algorithm to fast adapt to the new tasks.By pretraining the tasks with similar distribution,the learning model can acquire general knowledge.Simulation results have indicated the MADQN scheme can achieve higher throughput than fxed allocation.Furthermore,our proposed multiagent meta reinforcement learning algorithm learns the new tasks much faster compared with the MADQN scheme.

Keywords:UAV;trajectory design;channel selection;MADQN;meta reinforcement learning

I.INTRODUCTION

1.1 Motivation

Due to the fexible and swift features,Unmanned aerial vehicles(UAVs)have been seen as a solution of aerial networking and a complement of terrestrial communication infrastructure[1].In fact,UAVs have been widely deployed in variety of civil and military applications,such as communication relaying[2],weather monitoring,surveillance[3],cargo transport[4],and rescue[5,6].The UAVs offer high-speed transmissions for these applications.UAV-assisted wireless communications can provide a wireless connectivity in the area not covered by the communication infrastructures[7].Compared with the fxed terrestrial wireless communications,UAV-assisted communications are signifcantly less affected by small-scale fading and shadowing.And the UAVs are much cheaper to establish a temporary communication link if properly deployed.Furthermore,The UAVs can provide quick-response and on-demand services for the ground users in emergency situations[8].

In UAV-assisted wireless networks,jamming can signifcantly degrade the channel capacity due to its features of high power and variability[9].Therefore,selecting a proper channel to transmit becomes important.Conventional channel selection algorithms require channel state information(CSI),which is easily affected by estimation error.Moreover,conventional channel selection is insuffcient for dynamic jamming.Hence,the model-free reinforcement learning(RL)algorithm is a more considerate method for channel selection[10,11].Moreover,properly designing UAV trajectory can reduce the affect of fading and avoid shadowing and jamming,which achieve higher transmit throughput especially in jamming-existed situations.However,RL algorithm needs plenty of experience,which is hard to fully collect due to the variability of jamming.Meta learning,also known as learning to learn,is commonly used for few-shot learning[12].To fast adapt to new situations where the jamming patterns have never appered,this paper proposes a multi-agent meta reinforcement learning algorithm for trajectory design and channel selection in a UAVassisted wireless network.

1.2 Related Works

Recently,many researches concentrate on trajectory design and channel selection for UAV-assisted wireless networks.Among them,[13]studies a resource management problem from game-theoretic perspective in UAV communication networks.To maximize the minimum-rate ratio over the ground users with the consideration of delay constraint,the authors of[14]propose a joint optimization algorithm on resource allocation including trajectory design and bandwidth allocation.By contrast,[15]considers the constraints of the relay power and the interference power,which protect the communication of primary user(PU).Similarily,[16]uses an iterative algorithm which optimizes the UAV trajectory and bandwidth to minimize the power consumption of the UAVs and the users under some constraints.UAV 3D placement and resource allocation in a multi-UAV assisted network are studied in[17]to minimize the average transmit power.However,these conventional algorithms based on optimization theories have a high computational complexity and require prior and perfect channel state information(CSI).

Reinforcement learning(RL)based algorithms,which adjust their policy from the feedback of the environment to improve the performance without full CSI,have shown great potential to solve the problem at a low complexity.The Q-learning[18],which searches the maximal long-term reward in the stateaction table,is mainly used to solve control problems.[19]investigates dynamic resource allocation in multiple UAVs-assisted wireless networks to maximize long-term reward using multi-agent Q-learning.The authors of[20]investigate a multi-agent Qlearning placement scheme to maximize the instantaneous transmission rate.In[21],to maximize the average secrecy rate,Q-learning is applied to fnd the optimal UAV trajectory and proper resource allocation.For the lack of capacity representing continuous and large-scale state in Q-learning algorithm,the deep Q-network(DQN)[22,23]represents the Q-table by a neural network.The trajectory design of multiple UAVs applying DQN algorithm is studied in[24]to improve the throughput of the wireless communication system.However,RL-based algorithms need to learn from a lot of experiences,which is not suitable for making quick decisions in a new environment.

To this end,meta reinforcement learning algorithms are investigated for few-shot learning.[25]designs a recurrent neural network(RNN),whose input includes the action and reward in the past,reuses a series of interrelated tasks to train the RNN.[26]proposes a RL2algorithm including a “fast” RL algorithm and a “slow” RL algorithm: the “fast” RL is a computation whose state is stored in the RNN activations,and the weights of the RNN are learned by the“slow”RL.[27]proposes an algorithm for fast adaptation,which is compatible with any model trained by gradient descent.Meta reinforcement learning algorithms are mainly applied on virtual games and simple control problems,the application on trajectory design and channel selection in UAV-assisted wireless networks has not been studied at present.

1.3 Contributions

Motivated by the opportunities and challenges,this paper proposes a multi-agent meta reinforcement learning strategy for trajectory design and channel selection in jamming-existed situations,which achieves much better performance with insuffcient training compared with conventional reinforcement learning.According to different types of jamming,the problem needed to be solved is divided into multiple tasks.Our goal is to fnd a learning model that can fast adapt to new tasks.The proposed meta reinforcement learning algorithm involves inner algorithm and outer algorithm.The inner algorithm is designed as a deep Q-network(DQN),which learns from experience in a task.The outer algorithm is a learning model,which learns how to learn by gradient descent from the experience in multiple tasks.In the optimization problem,the objective is to maximize the average achievable channel capacity among the ground users in the UAV-assisted wireless network.The main contributions of this paper are summarized as follows:

1.First,we provide a joint optimization scheme for trajectory design and channel selection based on the multi-agent DQN(MADQN).

2.Then,we propose a meta reinforcement learning scheme to fast adapt new tasks.The learning model acquires knowledge via the learned tasks with similar distribution.Hence,it trains much faster in new tasks by learning from part of the knowledge.

3.Finally,simulations are provided to validate the performance of our proposed meta reinforcement learning scheme.The results demonstrate the effectiveness the MADQN scheme.Furthermore,with only a few episodes,our proposed meta reinforcement learning algorithm achieves signifcant performance compared with the benchmarkers.

1.4 Organization

The rest of the paper is organized as follows.In Section II,we introduce the UAV-assisted downlink wireless communication system model and formulate the optimization problem of trajectory design and channel selection.We provide multi-agent meta reinforcement learning algorithm for trajectory design and channel selection in Section III.Simulation results are provided in Section IV and conclusions are drawn in Section V.

II.SYSTEM MODEL AND PROBLEM FORMULATION

In this section,we frst introduce the considered UAVassisted wireless communication system model and the jamming model,then formulate a joint optimization problem of trajectory design and channel selection for the UAV-assisted wireless communication.

2.1 Signal Model

We consider a downlink transmission scenario whereMUAVs are deployed as aerial base stations to serveNground users,and a jamming car moves around the users and tries to block the communication,as shown in Figure 1.We assume the UAVs have been pre-placed in their service areas at a certain altitude,and the ground users are stationary compared to UAVs movement.The UAVs serve their associated ground users on one of theLchannels.And each of the ground users is only served by one UAV via a timedivision multiple access(TDMA).In practice,the UAVs make decisions every time slot,whose lengthδtshould be well-designed to take consideration of the accuracy and complexity.

Figure 1.The downlink transmission scenario: the coordinate of the UAVs and selected channels directly influence the channel capacity.

Figure 2.The structure of meta learning.

2.2 Jamming Model

2.3 Problem Formulation

in which(12b)restricts the trajectory range of the UAVs.

Problem(12)is challenging to solve due to the nonconvex and NP-hard features as the function of variables Q and C.Hence,we propose an MADQN scheme for trajectory design and channel selection,which fnds the optimal solution by searching in a small fraction of the whole space.Furthermore,a multi-agent meta reinforcement learning algorithm for trajectory design and channel selection is proposed to fast adapt to the situation which has never encountered.

III.PROPOSED ALGORITHM

In this section,we frst provide an MADQN scheme for trajectory design and channel selection,and then propose a multi-agent few-shot meta reinforcement learning algorithm to fast adapt to new tasks.

3.1 Multi-Agent Deep Q-Network

The interaction between the UAVs and environment can be formulated as a Markov decision process(MDP).For each UAV,which is designed as an agent,the MDP is defned as a 4-tuple(S,A,Pa,R),where

·Sis the set of states,which consists of the sensed power of each channel and its 3D coordinate.

·Ais the set of actions,which consists of channel selection and trajectory design.

·Pa(s,s′)denotes the probability leading to the next states′after taking actionain statesto the next states′.

·Rdenotes the immediate reward after performing action from statesto states′.

The MDP is commonly used to choose a policyπ(s)to maximize the long-term rewards,which is defned as

in whichγdenotes the discount factor,0≤γ ≤1.We defne the expected return of taking actionain statesunder policyπ(s)as:

The optimal Q-value can be found by Bellman equation,which can be written as:

in whicha′denotes the next action.

When the state space is combinatorial,enormous or continuous,the Q-learning algorithm refered in(15)is not suffcient to fnd the optimal policy because Qtable cannot store so much data.To compress the state,a multi-layer fully-connected artifcial neural network with weights w can be used as a parameterized function to approximate the state,which is denoted asQ(s,a;w)≈Q*(s,a).

To reduce correlations of new and oldQ(s,a)in(15),a target network that delayed updates its weights is introduced to calculate the target action value.To reduce remove correlations of the sequence in one episode and smooth the state distribution,each UAV engages a memory poolDmare used to store the its experience tupleem[t]=(s[t],am[t],r[t],s[t+1])at each time stept.During training,we sample a minibatch of experience(s,am,r,s′)~U(Dm)randomly from the memory pools.The loss function of weights wmis designed as the squared error:

For the DQN algorithm,

whereηwis the learning rate of w.And the target network weights w′mare soft replaced as

whereτis the replacement factor,which controls the update period.

3.2 Multi-Agent Few-Shot Meta Reinforcement Learning Algorithm

The few-shot meta reinforcement learning is mainly used to solve few-shot learning problem.More specifically,it trains a learning model,denoted byf,which can fast adapt to new tasks by using a small number of experience.More specifcally,in the meta-training phase,the model is trained on a set of tasks(also known as support set)whose distribution is similar to the new tasks.And in the meta-testing phase,the model is applied on new tasks(denoted as query set)without retraining or with only a few episodes.The basic structure of meta learning is shown is Figure 2.

For the application of meta reinforcement learning algorithm on trajectory design and channel selection in the UAV-assisted network,we denote the distribution over tasks asp(T),which we want the learning modelfto adapt to.As general machine learning,meta reinforcement learning divides the tasks into support set and query set,and the two sets have the same probability distribution.For theK-shot learning,the learning modelfis trained to learn a new taskTifromKsamples of experience.During meta-training,Ksamples of experience on each task are randomly drawn to calculate the corresponding lossLTi(see(16)),and then other samples of experience from the same task are used for testing.The learning modelfis trained to minimize the test error on the tasks.In particular,the test error on support set is regarded as the train loss of the meta-training.After meta-training,the tasks on query set are sampled fromp(T),the performance of meta reinforcement learning is evaluated by the test error on query set after learning fromKsamples of experience.

Algorithm 1.Meta reinforcement learning algorithm for trajectory design and channel selection on UAV m.Input: p(T)over tasks,inner learning rate ηi,outer learning rate ηo,and environment simulator Output:LearningModel fθ.1: Initialize θ.2: while not done do 3: Sample batch of tasks Ti ~p(T)4: for all Ti do 5:Copy the parameters θ to the task.6:for t=1,2,...terminal do 7:Perform the action am[t]=π(s[t];um)+nm with a probability 1-ε,otherwise perform a random action.8:Observe the next state s[t+1]and calculate immediate reward r[t].9:Store the experience tuple em in memory pool Dm.10:end for 11:Randomly sample a mini-batch with a size K of experience ei from Ti.12:Compute local parameters by gradient descent using ei: θi =θ-ηi∇θLTi(fθ).13:Sample other experience e′i from Ti.14: end for 15: Update θ ←θ-ηo∇θTi~p(T)LTi(fθi)using each e′i.16: end while

We consider the learning model is parameterized byθ,denoted asfθ.When adapting to a specifc taskTi,the global parametersθbecome local parametersθi.In the meta learning algorithm,the local parametersθiis computed by gradient descent updates on taskTi,given by

in whichηidenotes the inner learning rate.

The global parameters are trained to optimize the performance ofwith respect toθover tasks sampled fromp(T).We assume that the local parameters have been learned from the training set of support set as(20).The meta-objective is to minimize the loss function with respect toθover the testing set of support set,which is presented as:

Note that the meta reinforcement learning is performed on the global parametersθ,while the objective is computed by the local parametersθi.Actually,the goal of meta reinforcement learning is to optimize the global parameters so that only a few gradient updates on a new task will bring remarkable improvement on the task.

The meta training is performed over all the tasks on the support set by stochastic gradient descent,thus the global parametersθare updated as:

in whichηodenotes the outer learning rate.

The multi-agent few-shot meta reinforcement learning algorithm for trajectory design and channel selection is summarized in Algorithm 1.

IV.SIMULATION RESULTS

In this section,simulation results are provided to demonstrate the performance of our proposed multiagent meta reinforcement learning algorithm for trajectory design and power control.A UAV-assisted downlink communication with 7 UAVs is considered,and the location of the ground users is shown in Figure 3.The simulation parameters are listed in TABLE 1 unless otherwise specifed.In our simulations,the jamming car moves around these cells,and we test the average throughput of the ground users in comb jamming.For clearness,the support set consists of the tasks where the jamming is randomly distributed over the channel except the task in the query set.The query set is designed as the task where the jamming is transmitted by the jamming car over channel 0,2,4,6.

Figure 3.The users’placement and UAVs’initial location.

Figure 4.The performance of the MADQN scheme for trajectory design and channel selection,the MADQN scheme for trajectory design,the MADQN scheme for channel selection and fixed location and channel.

Figure 5.The comparison of one-shot learning performance between the MADQN scheme and multi-agent meta reinforcement learning algorithm.

4.1 The Performance of the MADQN Scheme

A simulation of the MADQN scheme for trajectory design and channel selection is established.For each agent,the Q-network is designed as a fully-connected network with 3 layers.The hidden layer consists of 256 neurons.The input layer contains 11 neurons,8 of which represent sensed power over each channel and 3 of which represent its 3D coordinate.The output layer 56 neurons,which represent the combination of 8 channels and 7 trajectory actions.More specifcally,the 7 trajectory actions include left,right,front,back,up and down.Besides,the UAVs are restrited into the cube with a radius of 100m around the initial location.During training,τ,γ,learning rate,batch size and capacity of the memory pool are set as 0.01,0.95,0.001,20 and 1000,respectively.Each episode has a length of 40.In addition,initial exploration is set to 0.2 and linearly descends to 0.1.

Figure 4 shows the performance of the MADQN scheme for trajectory design and channel selection,the MADQN scheme for trajectory design,and the MADQN scheme for channel selection and fxed location and channel.It can be seen the MADQN scheme for trajectory design and channel selection performs the best in the simulation.Specifcally,compared with fxed location and channel,the MADQN scheme improves the average reward by 51.7%.By learning from the experience,the UAVs and adjust their policy from the feedback of the environment to improve the channel capacity.Hence,the UAVs select the unjammed channel to get a higher reward.Furthermore,the UAVs can adjust their location to make them closer to the ground users,so that the channel power gain is increased.Therefore,trajectory design further improves the throughput.Note that the performance of the MADQN scheme declines and rises sharply because the neural networks are overcoming the problem of local optima.And by cooperation and competition,the agents selects proper channel to avoid jamming and suffer a little co-interference.

4.2 The Improvement of Meta Reinforcement Learning

In the multi-agent meta reinforcement learning algorithm,inner loop includes the MADQN algorithm,which has the same simulation parameters refered in the last subsection.Inner and outer learning rate are both set to 0.001.The learning model are meta-trained by the experience on the support set,with each type of jamming depoyed as a task.During meta-training,the learning model is trained from only one episode sample in a task.The one-shot learning performance of multi-agent meta reinforcement learning algorithm is shown in Figure 5 with the MADQN scheme set as the benchmarker.

Figure 5 shows the comparison between the MADQN scheme and multi-agent meta reinforcement learning algorithm.The simulation results demonstrate meta learning can achieve high average transmission rate among the ground users in the case of insuffcient training,while the MADQN scheme needs around 1400 training episodes.This is mainly because the meta reinforcement learning model has been pretrained by the support set,which provides part of knowledge for the model.Hence,meta reinforcement learning does not need as much data as the MADQN.Such results suggest meta learning has obtained general knowledge from other tasks with similar distribution so that the learning model can perform well on new tasks with only a few episodes.

Figure 6.The occupied channel of the UAVs.

In details,the completeK-shot learning performance is shown in Table 2.It is observed that for few-shot learning setting,the performance of multiagent meta reinforcement learning(MA-meta-RL)algorithm is gradually improved with the number oftraining samplesK.In addition,multi-agent meta reinforcement learning algorithm can signifcantly improve average transmission rate compared with the MADQN scheme.More specifcally,multi-agent meta reinforcement learning algorithm achieves at least 38%higher reward when the learning model is trained for less than 20 episodes.

Table 1.Simulation parameters.

Table 2.Few-shot learning performance of the MADQN scheme and multi-agent meta reinforcement learning algorithm.

Since channel selection is the more dominant factor for the improvement of transmission rate,the fnal channel selection by multi-agent meta reinforcement learning algorithm is shown in Figure 6.It can be seen the 7 UAVs occupies 4 channels,that is channel 1,3,5,7.By multi-agent reinforcement learning,the UAVs learn to select channel through trial and error,and adjust their selection from the feedback of the environment to improve the performance.Specifcally,the UAVs select the unjammed channel to get a higher reward,so they can avoid the jammed channels.Additionally,the farthest UAV occupies the same channel,which causes minimum co-interference.The results suggest the multiple UAVs can intelligently choose the unjammed channel and reach dynamic equilibrium through competing and cooperation.

V.CONCLUSION

In this paper,we have investigated the joint optimization problem of UAV trajectory design and channel selection to maximize the average achievable channel capacity among the ground users in a UAV-assisted network.Firstly,we proposed an MADQN scheme,where the UAVs perform actions from their observations distributively and share the same reward.Secondly,we proposed a multi-agent meta reinforcement learning algorithm for the task where the jamming pattern had never appeared.By meta training the tasks that have the same distribution,the learning model can obtain general knowledge,which fast adapt to new tasks.Simulation results have demonstrated that using the MADQN scheme,the UAVs can design their trajectories and select occupied channel properly.In particular,the MADQN scheme improves the average achievable rate among the ground users by 51.7% compared with fxed allocation.Moreover,our proposed multi-agent meta reinforcement learning algorithm learns the new tasks much faster than the MADQN scheme with only a few episodes.For fewshot learning setting,multi-agent meta reinforcement learning can improve at least 38%transmission rate.

ACKNOWLEDGEMENT

This work was supported in part by the National Nature Science Foundation of China under Grant 62131005 and U19B2014,in part by the National Key Research and Development Program of China under Grant 254.