Multi-Objective Deep Reinforcement Learning Based Time-Frequency Resource Allocation for Multi-Beam Satellite Communications

2022-02-16 05:50YuanzhiHeBiaoShengHaoYinDiYanYingchaoZhang
China Communications 2022年1期

Yuanzhi He,Biao Sheng,Hao Yin,Di Yan,Yingchao Zhang

1 School of systems science and engineering,Sun Yat-Sen University,Guangzhou 100876,China

2 Institute of Systems Engineering,AMS,PLA,Beijing 100141,China

Abstract: Resource allocation is an important problem influencing the service quality of multi-beam satellite communications.In multi-beam satellite communications,the available frequency bandwidth is limited, users requirements vary rapidly, high service quality and joint allocation of multi-dimensional resources such as time and frequency are required.It is a difficult problem needs to be researched urgently for multi-beam satellite communications, how to obtain a higher comprehensive utilization rate of multidimensional resources,maximize the number of users and system throughput,and meet the demand of rapid allocation adapting dynamic changed the number of users under the condition of limited resources,with using an efficient and fast resource allocation algorithm.In order to solve the multi-dimensional resource allocation problem of multi-beam satellite communications, this paper establishes a multi-objective optimization model based on the maximum the number of users and system throughput joint optimization goal, and proposes a multi-objective deep reinforcement learning based time-frequency two-dimensional resource allocation (MODRL-TF) algorithm to adapt dynamic changed the number of users and the timeliness requirements.Simulation results show that the proposed algorithm could provide higher comprehensive utilization rate of multi-dimensional resources,and could achieve multi-objective joint optimization,and could obtain better timeliness than traditional heuristic algorithms, such as genetic algorithm (GA)and ant colony optimization algorithm(ACO).

Keywords: multi-beam satellite communications;time-frequency resource allocation; multi-objective optimization;deep reinforcement learning

I.INTRODUCTION

Multi-beam satellite communication system has the advantages of extensive coverage, high beam gain,small user terminals and high frequency utilization.It plays an irreplaceable and important role in mobile and emergency communication scenarios over a large area in complex environments such as floods, earthquakes and so on, and is widely researched all over the world.In multi-beam satellite communications,however,the available frequency bandwidth is limited and channel resources are very precious.Therefore,it is an important problem influencing the performance of multi-beam satellite communications,how to complete resource allocation more quickly and efficiently,and to implement maximum the number of users and system throughput within the constraints of limited resources[1-4].

For the past few years, researchers have expended a lot of efforts on exploring resource allocation methods for different satellite communication resource,including time slot,spectrum,power and others[5-17].In view of the characteristics of inter-beam interference, flexible payload and non-uniform distribution of user requirements, Ref.[5] constructed a powerfrequency joint optimization based resource allocation model for high-throughput multi-beam satellite communications, and proposed a genetic algorithm with optimal control strategy.Ref.[6]proposed a traffic perception based multi-beam channel resource allocation method which could adapt the dynamic changed traffic in multi-beam satellite communications.Considering the gain of different beam can be adjusted separately,Ref.[7] proposed a power-gain joint optimization algorithm for multi-beam satellite communications by using duality theory and the steepest descent method.Ref.[8] established a power-spectrum joint allocation model to solve the problems of limited resources and low energy efficiency in distributed satellite clusters,and proposed an energy efficiency resource allocation algorithm based on convex optimization theory,and discussed the trade-off relationship between energy efficiency and spectrum efficiency.Ref.[9] proposed an improved simulated annealing algorithm,which can maximize the throughput of the satellite system while taking into account the fairness between beams.Ref.[10, 11] studied the resource allocation problem of on-board power of low-orbit satellites and limited bandwidth resources.Ref.[10] proposed a bandwidth-power joint allocation method based on swallow swarm algorithm to improve the fairness of inter-satellite resource allocation and service carrying capacity.Constrained by time window, satellite energy consumption,the number of channels,user priority, and task burstiness, an improved ant colony optimization centered on the initial solution set construction and the deposition of extra pheromone was proposed in Ref.[11],which overcomes the shortcomings of traditional ant colony optimization algorithm such as slow initial search speed, weak local search ability and easy to fall into local optimum.Ref.[12]proposed a satellite resource allocation algorithm combining Pareto Frontier and Particle Swarm Optimization for multi-objective satellite resource allocation,which could obtain a set of differential resolution sets with advantages in different indicators,so as to facilitate the selection of optimal solutions according to real-time requirements and user preferences.Ref.[13]proposed a multi-layer many-to-many matching game resource allocation scheme for satellite network resource allocation under multi-user scenarios, which could effectively improve resource utilization and user overall experience.Aiming at the joint optimization of power and channel resources in distributed clusters,an improved two-dimensional immune clone algorithm was proposed in Ref.[15], where the Pareto solution for power-timeslot joint optimization was given by algorithm simulation.In Ref.[16], the problem of multi-beam power resource allocation for the same frequency interference under the condition of different channels between beams in the case of rain failure was considered,and a scheme based on particle swarm optimization algorithm was proposed.The above studies are mainly based on traditional algorithms and intelligent optimization algorithms to allocate satellite communication resources, which may improve resources utilization and system service quality.However, the long computation time and high algorithm complexity make it difficult to realize dynamic allocation of communication resources rapidly when the requirement is constantly changing.

With the rapid development of artificial intelligence technology,Deep Reinforcement Learning(DRL)and other artificial intelligence technologies have been widely applied in the field of information technology, and also provide a new methodology for resource allocation of satellite communications[18-29].Ref.[23] used DRL algorithm to allocate resources in multi-beam satellite communications, which has lower complexity compared with traditional algorithms.Considering the continuous nature of the power resource, Ref.[24] proposed a deep reinforcement learning based dynamic power allocation structure for high throughput satellites,which used continuous state and action spaces to define the allocation strategy, and used proximal strategy optimization algorithm of gradient method to improve the allocation strategy, in order to minimize the power consumption of the system.According to the non-uniformity of multi-beam communication traffic distribution in satellite communications, Ref.[25] designed a dynamic resource allocation algorithm based on reinforcement learning technology,which could avoid the frequency interference and reduce the system blocking probability.Ref.[26]proposed a DRL based resource allocation algorithm for multi-objective optimization of multi-beam satellite communications, which used Q network to optimize the energy efficiency, spectrum efficiency and user satisfaction, and achieved better overall performance compared with the traditional genetic algorithm (GA) and simulated annealing (SA) algorithm.Ref.[27] proposed a cooperative multi-Agent deep reinforcement learning (CMDRL)framework,and proved the effectiveness of this framework in the resource allocation of multi-beam satellite communication.In order to cope with the constant changes of wireless channels over time,Ref.[28]proposed a multi-objective reinforcement learning and deep neural network combined method, which could be used for performance monitoring and resource allocation prediction.Considering that the heuristic optimization algorithm based power resource allocation method of multi-beam satellite is not suitable for the dynamic change of demand and channel conditions,Ref.[29] proposed an online dynamic power allocation algorithm based on DRL, which could improve the throughput of multi-beam satellite communications and reduce the power consumption.The above research applied DRL algorithm to solve the dynamic resource allocation problem of satellite communications.Compared with the traditional algorithm and intelligent optimization algorithm, the DRL algorithm can significantly reduce the computational complexity,and has better environmental awareness and multistage decision-making ability.However,the above researches mainly focused on the allocation of one kind of communication resource, without considering the joint allocation of two-dimensional resources,such as power-frequency or time-frequency.In the meantime,the resources studied in the above research are mainly beam resources with large granularity,which makes it difficult to realize better allocation of resources at the user level.Therefore,it is one of the difficult problems in the field of DRL based satellite communication resource allocation to realize the multi-dimensional and refined allocation of communication resources under the condition of dynamic changes in the number of users and user communication requirements.

Aiming at time-frequency resource allocation problems of multi-frequency time division multiple access(MF-TDMA) multi-beam satellite communications,based on maximum the number of users and system throughput combined optimization goal, this paper establishes a multi-objective optimization model and proposes a time-frequency joint resource allocation algorithm based on deep reinforcement learning.Through a number of adaptive improvements on the basic architecture and algorithms of deep reinforcement learning, the algorithm proposed in this paper can meet the requirements of the dynamic changes of the number of users and timeliness of allocation.Finally, simulations are carried out to verify the accuracy and effectiveness of the proposed model and algorithm.

II.SYSTEM MODRL AND RESOURCE ALLOCATION

2.1 System Model

As shown in Figure 1, the multi-beam satellite communication is composed of a multi-beam communication satellite located in geosynchronous orbit,a ground gateway station and a large number of user terminals.The user terminals are distributed unevenly in different beams.The number of satellite beams isB.The total bandwidth of the system isWtot.The spectral efficiency is improved via seven-color multiplexing.The available bandwidth of each beam is equal, i.e.,w=Wtot/7.The system uses MF-TDMA to provide services for users.MF-TDMA is a two-dimensional multiple access technique based on the combination of frequency division and time division.It combines timeslot carrier hopping,rate change and virtual routing addressing, etc., to realize flexible networking of ground terminals and multi-point communications for integrated services such as voice,data,and video[30].

Figure 1. Satellite communication system model.

The uplink of the MF-TDMA system uses the mode of timeslot and frequency jumping simultaneously for communication.Each beam is divided into multiple carriers in the frequency domain,and each carrier is further divided into multiple timeslots with equal length in the time domain.In order to improve the utilization rate of the frequency band,the MF-TDMA system supports two kinds of carrier frequency bandwidths, large and small, for adapting to user requirements with different information rates.

The structure of the MF-TDMA superframe is shown in Figure 2.In the time dimension, there is a total ofttimeslots with equal length,and the length of each timeslot isτ.The size of the timeslot protection interval isδ.In the frequency dimension,there are two kinds of carriers with different sizes in the superframe,where the bandwidth of the large carrier isw1and the bandwidth of the small carrier isw2.The protection bandwidth between each carrier iswg.The number of large or small carriers is determined by user requirements, and the sum of the bandwidth of all carriers and all protection intervals does not exceed the beam bandwidthw.Thus,it is regarded that an MF-TDMA superframe is composed of multiple time-frequency resource blocks, including large resource blocks and small resource blocks.A large resource block represents the information capacity carried by a large carrier in a timeslot,and a small resource block represents the information capacity carried by a small carrier in a timeslot.During communication, each user uses resource blocks in different timeslots to carry information, and different user uses different size block in same timeslot to carry information, according to user requirements.

Figure 2. The schematic diagram of superframe structure.

The MF-TDMA multi-beam satellite communication can support multi-type user terminals such as handheld terminals, portable terminals and vehiclemounted terminals,etc.,and can provide various communication services such as voice, data and video.Therefore, user terminals have various information rate requirements.Assuming that user terminals haveqtypes of services, and each type of service hasptypes of information rates, then there is a total ofM=p · qtypes of information rate gears, and the information rate of each gear is expressed asRm,m=1,2,...,M.

When a user terminal has a service requirement,the terminal selects an appropriate information rate from theMgears according to the transmission capacity of its device,the channel quality and the type of communication service.Then it sends the service transmission request to the system.In the system, the ground station informs the user terminal of the resource utilization of the beam in which it is located by satellite broadcasting.If an user terminal is in the multi-beam overlapping region, the user terminal will select the beam which has the most available resources in the visible beams for communication.After receiving the transmission request of each user terminal in the beam,the ground gateway station calls the resource allocation algorithm and allocates communication resources to users based on the feedback results.

2.2 Resource Allocation Problem

The communication performance and service quality of satellite communications are usually determined by the system throughput and the serviced the number of users.Therefore, maximizing the system throughput and the serviced the number of users are considered as the objectives of the resource allocation problem in this paper.

In general,the frequency bandwidth of each carrier in the MF-TDMA multi-beam satellite communications is equal, so the throughput of the system is in a directly proportional relationship with the serviced the number of users.The more users served, the greater throughput of the system will be.In the actual MFTDMA system,each beam can be flexibly divided into different numbers of large and small carriers.Since both large and small carriers need to have the same protection bandwidthwgin the frequency band, the frequency band utilization rate of using a large carrierη1=w1/(w1+wg) is larger than that of using a small carrierη2=w2/(w1+wg).Therefore,if the ground gateway station mainly uses the timefrequency resource blocks of the large carrier to provide services for the user terminals with high information rate, the beam can obtain greater throughput and spectrum utilization, but the serviced the number of users will be reduced.If the time-frequency resource blocks of small carriers are mainly used to serve users with small information rate,the beam can serve more users,but the system throughput and spectrum utilization rate will be reduced.Thus, in the MF-TDMA multi-beam satellite communication which supports the division of large and small carriers, the directly proportional relationship between the system throughput and the number of users will no longer hold, and the two objectives are mutually restricted.In order to improve the performance of this kind of communication, a compromise between the number of users and system throughput is needed.

The multi-objective joint optimization model of MF-TDMA multi-beam satellite communication system resource allocation is given as follows:

whereNis the total serviced the number of users of the multi-beam satellite communication system,Tis the throughput of the multi-beam satellite communication system,Bis the total number of beams,Ubis the the number of users that request service within thebth beam.Letxb,udenote the number of large resource blocks allocated to theu-th user terminal of theb-th beam,yb,udenote the number of small resource blocks allocated to theu-th user terminal of theb-th beam.v1is the information rate of large resource blocks.v2is the information rate of small resource blocks.represents the ceil operation.Letλb,udenote the service decision factor of theu-th user terminal of theb-th beam.If the system provides service for this user,thenλb,u=1,otherwiseλb,u=0.

In the above model,the objective function Φ1represents the total serviced the number of users in the system.The objective function Φ2represents the system throughput, which is the product sum of the number of either kind of resource blocks allocated to all users and the information rate of the corresponding resource blocks,where the different number of large and small resource blocks are allocated according to the requirements of users.

In the frequency dimension, constraintC1 restricts the bandwidth occupied by the resource blocks allocated to users should not exceed the total bandwidth in each beam,wheredenotes the number of large carriers occupied by the allocated large resource blocks,denotes the number of small carriers occupied by the allocated small resource blocks.The total occupied bandwidth is the product sum of the number of large and small carriers and the corresponding occupied bandwidth.In the time dimension, constraintC2 restricts that the number of resource blocks allocated to the same user terminal should not exceed the number of time slotstof one MF-TDMA superframe.In the actual engineering application,one user terminal can only occupy one carrier in the same time slot.If the number of resource blocks allocated to the same user terminal exceed t,there would be a situation that a user terminal occupies more than one carriers of different frequency bands in the same time slot.

The two objective functions are normalized and according to the importance ofNandT, we define the weights of two objective functions asθ1andθ2, respectively, and thus the multi-objective optimization model is transformed into max(θ1N+θ2T).

2.3 Problem Analysis

The complexity of the above multi-objective joint optimization problem is analyzed in the following part.

The optimization problem can be divided into two kinds: the problem of Polynomial(P) and the problem of non-deterministic Polynomial(NP) according to whether it can be solved in Polynomial time[6,31].In the multi-objective joint optimization model presented by Eq.(1),both the two objective functions and the constraintC1 contain decision factors,which represent whether to provide services to users.Users can be regarded as objects to be taken, and decision factors represent whether to take this object into the knapsack.The information rate allocated to the user can be regarded as the value of the object.The bandwidth occupied by the user can be regarded as the size of the object.The problem can be understood as maximizing the number and value of the objects to be taken under the constraint that the objects to be taken do not exceed the size of the knapsack.Therefore,this model is a typical 0/1 multi-objective Knapsack Problem(KP),which has been proved to be a typical NP Problem,and its computational complexity isO(an)[32].

In the optimization problem in this paper,the algorithm complexity is significantly related to the size of the solution set space.In objective function Φ1, decision factorλb,uhas two states of 0 and 1 for each user, that is, in a certain beam which hasUbusers,there are 2Ubpossibilities for the allocation result.In addition,considering that users in different beams are uncorrelated, the size of solution space of objective function Φ1is Θ1= 2BUb.In objective function Φ2,each user is allocatedxb,ularge resource blocks andyb,usmall resource blocks.The combinatorial space of(v1xb,u+v2yb,u)isxb,u·yb,u,so the size of the solution space of Φ2is Θ2=(2xb,uyb,u)BUb.According to the above analysis,it is concluded that the computational complexity of the optimization problem in this paper is non-polynomial,so this optimization problem is an NP problem.

If we use the traditional operational research exact algorithm to solve this problem, the time complexity of the algorithm will appear as the number of recursive exponential growth.That is to say, when the scale of the 0/1 knapsack problem is very large,the possibility of using enumeration method, the dynamic programming method and other mathematical methods to solve the problem can only stay in the theoretical stage[32].

III.THE MODRL-TF ALGORITHM

In order to solve the multi-objective problem derived in Section II, a multi-objective deep reinforcement learning based time-frequency(MODRL-TF)resource allocation algorithm is proposed.The process of resource allocation based on MODRL-TF is shown in Figure 3.All the users of the satellite communication have service requirements independently.The user terminal will first check whether it is located in an overlapped region.If the user terminal is located in an overlapped region, it obtains the resource utilization of all the visible beams and selects the beam with the most available time-frequency resource to send its service requirement.Otherwise, the user terminal sends its requirement directly through the only visible beam.After the satellite communication receives the requirements,it sets the service strategy according to the system state and calls the resource allocation algorithm to obtain the optimized allocation result.Then system allocates resources to users according to the result and updates the resource pool of the system.In such a manner,the process of time-frequency resource allocation is completed.

Figure 3.The process of time-frequency resource allocation based on MODRL-TF for multi-beam satellite.

Note that the resource pool is established to record the use of resources and provide the constraint of available resources to the allocation algorithm.

3.1 MODRL-TF Algorithm Architecture

MODRL-TF algorithm is based on Double Deep Q Network(DDQN)algorithm[33].In order to maximize the serviced the number of users and system throughput,we model the ground gateway station as the agent,model the number of time-frequency resource blocks of the satellite communication and user requirements as the environment, model the values of environment as the state and models the process of the resource allocation as the action of the agent.The cumulative benefit is maximized through the agent’s learning and training.The algorithm architecture is shown in Figure 4.

Figure 4. Architecture of MODRL-TF algorithm.

After the statesiis input into the predictive Q network, the predictive Q network outputs actionaiwhich decides whether to allocate resources for the user.After the agent executes the action, it will interact with the environment and obtain the rewardri+1of action and transfer the environment to the next state.Meanwhile, the obtained experience data{si,ai,ri+1,si+1}will be stored in the experience buffer.

When the empirical data reaches the training threshold,a certain amount of data can be extracted from the experience buffer to train and update the predictive Q network and target Q network.

In training Q network, the error function needs to be calculated, and the error is transmitted by backpropagation through the gradient descent method,and the network parameters are updated.The error calculation formula is given as follows:

whereωis the network parameter of the deep Q network,andyiis the value of the target Q network.The calculation formula is given as follows:

3.2 MODRL-TF Algorithm Design

On the basis of MODRL-TF algorithm architecture,considering the situations of the satellite communication resources and user business requests, the state space,action space and reward in deep reinforcement learning are designed.

Considering that the input and output dimensions of the neural network are fixed, it is difficult to adapt to the dynamic changes of the number of users and the arrival of users requirements.MODRL-TF algorithm takes the user as the unit to design the action space and reward.The users requirements are regarded as a sequence whether to allocate resources for the users to overcome the problem that the input and output dimensions of neural networks are not changeable.The state space, action space and reward of MODRL-TF algorithm are specifically designed as follows:

State spaceS: state information includes the number of available resource blocks and the requirements,which is expressed as:

wherecis the number of available resource blocks,andRb,uis the information rate requested by the u-th user of the b-th beam.

Action spaceA: MODRL-TF algorithm has two kinds of actions,A={0,1}.Action 1 means that providing the service and resource for users.Otherwise,Action 0 means refusing to do it.

Rewardsr: The increment of the optimization target is denoted as the basis to judge the rewards.The optimization target value consists of the serviced the number of usersNand the system throughputT.

After completing the design of state space, action space and reward, the neural network is further constructed.

Network structure: The input of MODRL-TF algorithm is the states of the state spaceS,and each state can be represented by the vector[c,Rb,u].Since the inputs are vectors,the neural network can be composed of fully connected layers.The activation function of the network istanhfunction, and the output is the Q value of actions 0 and actions 1.

3.3 Steps and Flows of MODRL-TF Algorithm

The MODRL-TF algorithm mainly consists of the following four parts:

1)Parameters initialization: Initializing relevant parameters of satellite communication system and neural networks.

2)The agent environment interaction:The agent selects the optimal action according to the current state of the environment,executes the action,obtains the reward,and transfers the environment to the next state.

3) Storage according to the importance of experience: DDQN algorithm does not consider the importance of different experiences in the process of using training experience in the experience buffer.While,the Prioritized Experience Replay DQN(PREDQN)algorithm [34] takes account of the importance of different experiences and uses the Temporal-Difference(TD) error of each experience to measure the importance of experience.TD error is expressed as:

The larger the TD error is,the lower the accuracy of sample prediction is,which means the experience still needs to be learned.PERDQN is trained through sampling high important experiences with greater probabilities by the way of calculating constantly and updating the importance of experience data and using random priority sampling.However, the priority sampling of important experience reduces a certain amount of data diversity,and the algorithm needs to constantly calculate, update, and rank the importance of experiences, which increases the complexity of training.

In order to make full use of the importance of experience data and avoid TD error update of experience data, importance judgment procedure is introduced in the experience storage stage.The algorithm sets the size of the experience buffer toNexp,and the count of the experience storage operations is recorded ascount.countpluses 1 whether the experience is stored in the experience buffer or not.During the experience storage process,for each experience data,the algorithm first checks whether the experience buffer is full.When the experience buffer is not full, the experience data is stored directly.Otherwise, the algorithm calculates the position of the experience data to be stored,given asj=count%Nexp,and then calculates the TD error of the experience.If this TD error is greater than the TD error of the experience at positionj, replace the experience at positionj, otherwise, the experience to be stored will not be stored.

By calculating the importance of experience in the experience storage stage,it can be ensured that the currently stored experience is relatively important, and has a higher learning priority, and does not need to constantly update the TD error of the experience.

4)The update of the neural network: sample the experience data uniformly from the experience buffer,calculate the loss through Eq.(2) and (3), and update the network through the gradient descent method.

The process and steps of the algorithm are shown in Figure 5 and Algorithm 1.

Figure 5. Process of the MODRL-TF algorithm.

Next,the time complexity of the proposed algorithm is analyzed.Once the training of the MODRL-TF algorithm is completed, the time complexity is very small,i.e.,Owhere L is the number of fully connected layers of the neural network,X and Y are the size of each layer of the neural network,and N is the serviced users number.

IV.SIMULATION AND RESULT ANALYSIS

In order to verify the performance of the proposed MODRL-TF algorithm, the resource allocation process of MF-TDMA multi-beam satellite communications is simulated.The simulation parameters are shown in Table 1.

Table 1. Parameters of the satellite communication.

In the simulation, the arrival of user requirements follows the Poisson distribution with a parameter of 100 times/minute.Each requesting user randomly selects one gear from the six information gears as required information rate, and the probability of each gear being selected is equal.

This paper uses the 1.71 version of Pytorch framework of Python 3.8 language to implement and simulate the algorithm, and the neural network of the proposed MODRL-TF algorithm is implemented by using three full connection layers.The capacity of the experience buffer used by the experience playback mechanism is set to 20000.The experience playback threshold is set to 2000,and the activation function uses the tanh function.Related parameters of neural network and algorithm are shown in Table 2.

Table 2. Parameters of neural network.

(1)Evaluation of algorithm convergence

Figure 6 shows the convergence trend of the objective function values after the algorithm is trained.The horizontal axis is the number of training episode stages, and the vertical axis is the weighted objective function value.As can be seen, the value of the weighted objective function fluctuates sharply in the first 50 episodes, because the algorithm in this stage is still in the early exploration stage and has not yet been finally stabilized.In this time period, the algorithm often explores wrong allocation strategies and thus is punished with large negative rewards.Between 50 and 170 episodes, the fluctuation of weighted objective function value obviously tends to be flat, but there is still a small probability of it dropping around the 150th episode, which indicates that the algorithm is still exploring with a small probability.After the 170th episodes,the weighted objective function value tends to be stable,indicating that the algorithm exploration is over and the effect tends to be stable.It can be judged that the proposed MODRL-TF algorithm is converged.As for the volatility of the weighted objective function value,it is acceptable because the input to the algorithm, i.e., the number of users with communication requirement and the information rate required by each user,are random in each episode.

Figure 6. MODRL-TF algorithm.

Algorithm 1. Algorithm of MODRL-TF.1: Initialize parameters of the multi-beam satellite communication system, including the number of available resource blocks 2: Initialize experience buffer size,experience quantitative learning threshold as Nth,exploration probabilities as εstart,decay rate as εdecay,probability of terminating exploration as εend,discount factor as γ 3: Initialize network parameters of each agent as ω,set target network parameter to ϖ = ω,set target network update frequency to Nstep 4: Obtain the business requests of satellite users,and obtain the current resource availability 5: for episode=1 to M do 6: for reuse=1 to N do 7: Disorder user requests 8: Get the initial state s0 9: for all User requests do 10: Count steps 11: In state si,action ai is randomly selected if the random probability is less than probability ε=εend+(εstart-εend)e- steps Q(si,a;ω)is selected 12: Perform resource allocation action ai and calculate reward ri+1; Update the number of available resource blocks as ci+1,obtain the information rate demand Rb,u of the next user,and construct the next state si+1 13: if experience buffer is full then 14: calculate the position j where the current experience will be stored 15: if current experience is more important than the j-th experience then 16: Put{si,ai,ri+1,si+1}into the experience buffer 17: else 18: continue 19: end if 20: else 21: put the experience into the buffer 22: end if 23: When the number of samples in the experience buffer is not less than Nth, a batch of data{si,ai,ri+1,si+1}is randomly sampled 24: Calculate the loss function Eq.(2)25: Sampling random gradient descent method back propagation error,training update Q network parameter ω 26: Target Q network parameter ϖ =ω is updated with Q network parameter ω every fixed step of Nstep 27: si =si+1 28: if si is final state then 29: break 30: end if 31: end for 32: end for 33: end for εdecay;otherwise,action ai =arg max a

(2)Algorithm performance analysis

After completing the training of MODRL-TF algorithm, we compare the optimization results of MODRL-TF Algorithm,Genetic Algorithm(GA)and Ant Colony Optimization (ACO) to verify the optimization effect and time complexity of MODRL -TF algorithm.In GA algorithm, the number of individuals in the initial population is set to 2000,the selection operator adopts the method of combining the roulette with the elite selection.The crossover method uses single point crossover,and the crossover probability is 0.1,when the mutation probability of the basic mutation operator is 0.001.In AOC algorithm,the population size of ants is set to 50,the information heuristic factor is set to 2,the greed heuristic factor is set to 1,and the pheromone forgetting factor is set to 0.4.Ant Density model is adopted for the pheromone updating strategy[32].

Next,we will compare MODRL-TF algorithm with GA algorithm and ACO algorithm from two aspects of optimization ability and time complexity of the algorithms.According to the optimization effect of MODRL-TF algorithm, 100 groups of user demands are randomly generated based on the system parameter settings,and the above three algorithms are used to allocate resources respectively.The weighted objective function value of serviced the number of users and throughput is obtained, as shown in Figure 7, where the X-axis represents 100 groups of different user demands,and Y-axis is the value of the weighted objective function.According to the statistics of 100 distribution results,the average of the objective value realized by GA algorithm is 125.82, and the variance is 60.10.The objective average of ACO algorithm is 126.96, and the variance is 63.54.The average objective of MODRL-TF algorithm is 129.25, and the variance is 58.00.It can be noticed that the allocation effect of MODRL-TF is better than GA and ACO algorithms.

Figure 7. Comparison of the optimization ability of MODRL-TF,GA and AOC.

For the time complexity of MODRL-TF algorithm,the resource allocation optimization time of the three algorithms are tested respectively under the condition that the serviced the number of users requests is 70,90, 110 and 130.The histogram of the change of resource allocation time with the serviced the number of users requests are obtained,as shown in Figure 8.It can be seen that under different user request scales,the time of MODRL-TF algorithm to complete resource allocation is on the order of milliseconds,which is far smaller than those of GA and ACO.According to the last paragraph of Section III, the time complexity of the MODRL-TF algorithm isOThe time complexity of GA isO(TN02),whereN0is the number of parent generation individuals andTis the total number of iterations.The time complexity of ACO isO(N(N -1)MT/2), whereNis the serviced the number of users,Mis the number of ants,andTis the total number of iterations.According to the above analysis, it can be concluded that the time complexity of MODRL-TF algorithm is much smaller than those of GA and ACO,and it can realize real-time resource allocation and meet the dynamic resource allocation requirements of multi-beam satellite communication system.

The above simulation has verified the convergence and comprehensive optimization effect of the MODRL-TF algorithm.Considering that the service strategy of the system changes dynamically in the actual application process, it is necessary to further verify the allocation results under different multiobjective weight values to test the applicability of the MODRL-TF algorithm to the service strategy of the system.

Set the weights of the two optimization targets as(0.1, 0.9) and (0.9, 0.1) respectively, where the two weights in brackets correspond toθ1andθ2, respectively.The variation of the serviced the number of users and system throughput with the number of available resource blocks are depicted in Figure 9 and Figure 10, respectively.When the target weights change from (0.1, 0.9) to (0.9, 0.1), the serviced the number of users increases while the system throughput decreases.The average difference of the serviced the number of users corresponding to the two system service strategies is 7.1, and the average difference of system throughput is 419.4KB/s.It can be seen that the MODRL-TF algorithm can obtain appropriate allocation results that meet the requirements of system service strategies.

Figure 9.The relationship between the serviced the number of users and the number of resource blocks available in the system.

Figure 10.The relationship between system throughput and the number of resource blocks available to the system.

V.CONCLUSION

Aiming at time-frequency resource allocation problem of MF-TDMA multi-beam satellite communications with large and small carriers, a multi-objective optimization model with the serviced the number of users and system throughput as objectives was established.On this basis, the multi-objective deep reinforcement learning based time-frequency resource allocation(MODRL-TF)algorithm was proposed.Based on the importance of experience, the experience buffer was established, which could improve the utilization rate of experience.By serializing user requirements,the defect of fixed input and output dimensions of the neural network was overcome.Besides,in order to improve the utilization rate of samples,the order of user requests was disrupted in the training process.Simulation results showed that the proposed MODRL-TF algorithm could achieve the comprehensive optimization of the serviced the number of users and system throughput.Furthermore, it could allocate resources with linear time complexity, which significantly improves the allocation efficiency and is suitable for dynamic resource allocation of MF-TDMA satellite communication system.In addition, the applicability of MODRL-TF algorithm under different system service strategies was further verified by setting different weight values among multiple targets.

ACKNOWLEDGEMENT

This work was supported by the National Key Research and Development Program of China under No.2019YFB1803200.