Dynamic user-centric multi-dimensional resource allocation for a wide-area coverage signaling cell based on DQN*

2023-02-06 09:44ZhouTONGNaLIHuiminZHANGQuanZHAOYunZHAOJunshuaiSUNGuangyiLIU

Zhou TONG,Na LI,Huimin ZHANG,Quan ZHAO,Yun ZHAO,Junshuai SUN,Guangyi LIU

Future Research Lab,China Mobile Research Institute,Beijing 100053,China

Abstract:The rapid development of communications industry has spawned more new services and applications.The sixth-generation wireless communication system(6G)network is faced with more stringent and diverse requirements.While ensuring performance requirements,such as high data rate and low latency,the problem of high energy consumption in the fifth-generation wireless communication system(5G)network has also become one of the problems to be solved in 6G.The wide-area coverage signaling cell technology conforms to the future development trend of radio access networks,and has the advantages of reducing network energy consumption and improving resource utilization.In wide-area coverage signaling cells,on-demand multi-dimensional resource allocation is an important technical means to ensure the ultimate performance requirements of users,and its effect will affect the efficiency of network resource utilization.This paper constructs a user-centric dynamic allocation model of wireless resources,and proposes a deep Q-network based dynamic resource allocation algorithm.The algorithm can realize dynamic and flexible admission control and multi-dimensional resource allocation in wide-area coverage signaling cells according to the data rate and latency demands of users.According to the simulation results,the proposed algorithm can effectively improve the average user experience on a long time scale,and ensure network users a high data rate and low energy consumption.

Key words:6G;Wide-area coverage signaling cell;Multi-dimensional resource allocation;Deep Q-network(DQN)

1 Introduction

With the global commercialization of the fifthgeneration wireless communication system(5G)network,mobile communication has risen to a new level,from the realization of“connection of people”to the establishment of“connection of things”between terminals in thousands of industries.Driven by the 5G network,the requirements of users are more differentiated,and the data rate and latency performance required by various new services and new applications are more extreme.Affected by the coverage of mainstream 5G network frequency bands(such as 3.5 GHz),to meet the extreme performance requirements of users,the deployment density of base stations(BSs)has to be greatly increased,which increases the 5G network construction cost and energy consumption.

The high energy consumption of the 5G network has also become a key issue of the sixth-generation wireless communication system(6G)network.To reduce the network power consumption caused by the dense deployment of high-frequency BSs and ensure the performance of network wide-area coverage,Liu et al.(2022b)proposed a wide-area coverage signaling cell technical scheme.As shown in Fig.1,in this scheme,the low-frequency(such as 700 MHz)control BSs/cells provide unified signaling coverage for a large geographical area,and are responsible for the transmission of radio resource control(RRC)messages and physical layer control signaling,thereby reducing the impact of high path loss caused by high-frequency bands and ensuring continuous and reliable connectivity and mobility.High-frequency(such as 62.5 GHz and above)data BSs/cells provide data transmission and a small amount of necessary signaling.These high-frequency data BSs have the characteristics of high capacity and on-demand activation,to reduce the interference between data cells and energy consumption of the entire network.

Fig.1 Wide-area coverage signaling cell(BS:base station)

Resource allocation is also a key problem to be solved in wide-area coverage signaling cells,because resource allocation is related to both user experience and network efficiency.The application of artificial intelligence(AI)in 5G networks promotes the development of the mobile communication network and its application in vertical industries(Liu et al.,2022a).With the improvement of network automation and intelligence,AI has become one of the effective means of solving the problem of resource allocation in dynamic radio environments(Lin and Zhao,2020).Ji et al.(2021)proposed an online bandwidth resource allocation algorithm based on deep reinforcement learning(DRL)to solve the resource allocation problem caused by operators by sharing network resources,which effectively improves the bandwidth resource utilization.Gang and Friderikos(2019)studied the bandwidth allocation and power allocation problems in 5G virtual network slicing and proposed an optimization framework for flexible inter-tenant resource sharing based on transmission power control.Luo et al.(2014)took the maximization of the average signal to interference plus noise ratio(SINR)as the goal of resource allocation,and used Q-learning to finish the channel assignment and power allocation at the same time.To overcome the excessive energy consumption problem in indoor wireless networks,Lüet al.(2021)proposed a deep Q-network(DQN)based transmission power allocation algorithm for home BSs.Ren et al.(2021)proposed a DRL-based approach to minimize long-term system energy consumption in a computation offloading scenario with multiple Industrial Internet of Things(IIoT)devices and multiple fog access points.In Zhao et al.(2015),a method based on the combination ofK-means clustering and Qlearning was proposed to jointly optimize the spectrum allocation,load balancing,and energy saving in mobile broadband networks.The above research works were designed based on a traditional network architecture.

Different from traditional cells that are responsible for transmission of both signaling and data,the wide-area coverage signaling cell will primarily be in charge of the transmission of signaling messages as well as management of all data cell resources.For future wide-area signaling coverage scenarios,in this paper,the network side uses intelligent capabilities to summarize user characteristics,and uses AI tools to realize on-demand and dynamic resource allocation according to the differentiated requirements of users,which can improve the overall resource utilization of the network and greatly improve the user experience.In this paper,the user experience considered is the difference between the data rate revenue and the total delay loss.

The main contributions of this paper are summarized as follows:

1.Aiming at solving the problem of multidimensional resource allocation in wide-area coverage signaling cells,a user-centric dynamic allocation model is constructed for multi-dimensional wireless resources,in which more differentiated requirements of users in the future,such as rate and latency,are considered,and the actual limitations of network power and bandwidth are considered.

2.Considering the dynamic BS changes concerning the data queue,wireless channel state,and user service requirements,a user admission control scheme is formulated to enable the on-demand on/offof data BSs.

3.A DQN-based dynamic allocation algorithm for wireless resources is proposed to realize user admission control and the dynamic and flexible allocation of physical resource blocks(PRBs)and power.According to the simulation results,the proposed algorithm can improve the average user experience on a long time scale,ensure a high data rate for users and low energy consumption of the network,and achieve real-time optimization of the overall network utility.

2 System model and problem formulation

2.1 System model

In this paper,we consider the wide-area coverage signaling cell scenario.The dynamic user-centric allocation model of multi-dimensional wireless resources is shown in Fig.2.In this model,we assume that the network perceives each user that it serves,and that users regularly report their requirements to the network.Users in different industries have different quality of service(QoS)requirements,including the rate and latency.The network performs big data calculation on users through the data collection module,summarizes user characteristics,and customizes flexible and dynamic wireless resource allocation strategies according to user requirements.The resource allocation involved in the process of the BS providing services to users includes user admission control,PRB allocation,and power allocation.

Fig.2 Dynamic user-centric resource allocation model in a wide-area coverage signaling cell(BS:base station;PRB:physical resource block)

In this model,we assume that there is a control BS and multiple data BSs in a specific area,J={1,2,···,J}.The total bandwidth ofWHz is divided into multiple PRBs,B={1,2,···,B},which are shared by all BSs.Suppose that there areNusers in the area,N={1,2,···,N}.Due to the limitation of orthogonal frequency division multiple access(OFDMA),a user can access only one BS.Letaj,n(t)andφbj,n(t)represent the binary user admission control factors,i.e.,the user admission control of BSjand the allocation strategy of PRBbin time slott,respectively.When usernaccesses BSjin time slott,aj,n(t)=1;otherwise,aj,n(t)=0.When BSjallocates PRBbto usernin time slott,φbj,n(t)=1;otherwise,φbj,n(t)=0.φbj,n(t)satisfies

The channel state in each time slot is assumed to be fixed when a user requests access to each BS.The channel states among different time slots change randomly,and are independent of each other.The transmission rate provided by BSjto usernon PRBbin time slottcan be expressed as

wherewbj,nis the bandwidth allocated by BSjto usernon PRBb,andσ2is the noise power.The noise power is the same on all PRBs of all BSs for all users.pbj,n(t)represents the power allocated by BSjto usernon PRBbin time slott.LetHbe a finite set of channel states.When usernaccesses BSjin time slott,hj,n(t)is the channel gain,wherehj,n(t)∈H={h1,h2,···,hH}(here,His the number of different channel states in this model).

Therefore,the total transmission rate provided by BSjfor all users accessing the BS in time slottis

The total rate of all BSs in time slottin the whole network is

The long-term average total rate of the whole network is

Consider a discrete-time queuing system,in which the length of each time slot is fixed.Denote the number of data packets arriving at BSjaccessed by usernin time slottasXj,n(t).The number of arriving data packets follows the Poisson distribution with parameterλj,nand is independent and identically distributed between different time slots.This model constructs a corresponding queue for the data packets of the services to be processed by each BS.At the beginning of time slott,the queue length ofwhereQj,n(t)is the queue length of usernaccessing BSj.

The dynamic update process ofQj(t)is described as follows:

whereDj(t)=εj(t)wAj(t)/Srepresents the number of data packets leaving the queuing of BSjin time slott,εj(t)represents the spectral efficiency in time slott,wis the bandwidth of each PRB,Aj(t)is the number of PRBs allocated by BSjto users in time slott,Sis each data packet’s size in the BS queue,andis the number of data packets arriving at BSjin time slott.LetQ(t)={Q1(t),Q2(t),···,QJ(t)}represent the global queue state information of the network in time slott.The global channel state information in time slottcan be expressed aswhere(j=1,2,···,J)represents the average channel gain of users accessing BSjin time slott.

2.2 Optimization problem

The objective of this study is to maximize the overall user experience on a long time scale,that is,the difference between the data rate revenue and the total delay loss.

The total radio interface delay considered in this study includes the processing delaydnprocand the transmission delaydntranof usern.After the BS receives the data request from the corresponding user,the time required to process the data packets is defined as the processing delay.The data processing delay of usernaccessing BSjis expressed as

whereRj,nis the rate at which BSjprocesses the data packets of usern,andSj,nis the data packet size of usernaccessing BSj.

Between the BS and the user,the time required to transmit data packets over the air interface is defined as the transmission delay.The data transmission delay of usernis expressed as

The total radio interface delay of usernis

The total air interface delay of the whole network is

The long-term average total air interface delay of the whole network is

The average network benefit and the average network cost of the system can be expressed as

whereδrandδdrefer to the unit prices of the data rate and delay,respectively.

The overall average user experience is

Therefore,the optimization problem is

C1 indicates that user admission control and resource allocation should meet the minimum data rate requirements of users.C2 indicates that user admission control and resource allocation should meet the user delay limit.C3 means that the total power allocated to users by each BS should not exceed its maximum transmission power limitpjmax.C4 means that each PRB can be assigned to only one user.C5 indicates that each user can be associated with only one BS.C6 means that the data processing rate required by each user on any BS should not exceed the total data processing rate of the BS,whereRjrepresents the total data processing rate of BSj.C7 represents that the total allocated bandwidth of BSjis not greater than the upper limit of the available bandwidthWjof BSj.

3 Dynamic resource allocation algorithm based on DQN

In traditional resource allocation problems,the Q-learning algorithm is often used.The problem of the Q-learning algorithm is that when the state space and action space are discrete and the dimension is not high,a Q-table can be used to store theQvalue of each state-action pair.However,when the state space and action space are high-dimensional and continuous,the action space and state space are too large,and it is very difficult to use a Q-table.As an algorithm based on value iteration which is similar to Q-learning,DQN is a concrete implementation of the combination of a deep learning multi-layer convolution neural network(CNN)and Q-learning.When the state space and action space are highdimensional and continuous,DQN can transform the update of Q-table into a function-fitting problem.By fitting a function instead of the Q-table to generate theQvalue,similar states can obtain similar output actions.Therefore,we propose a DQN-based dynamic allocation algorithm for wireless resources to solve our optimization problem and dynamically allocate wireless resources in the access network.

3.1 Reconstruction of constrained Markov decision process(CMDP)based on DQN

The optimization problem in this study can be formulated as a CMDP problem(Xu et al.,2021).CDMP is closely related to reinforcement learning.CDMP uses a time-varying random variable to simulate the state of the system,and its state transition depends on the current state and the action vector applied to the system.A Markov decision process is used to calculate the action strategy,which will maximize the utility related to the expected reward.In this model,user admission control,PRB allocation,and power allocation are formulated as a CDMP problem,which can be denoted as a quadruple〈C,A,pa(c'|c),Ra(c'|c)〉,whereCrepresents the finite set of states in the network andArepresents the finite set of possible actions.When actionais taken in statecduring the current time slott,pa(c'|c)is the probability that the state will transition toc'fromc.When the system transitions to statec'after performing actionain statec,Ra(c'|c)is the reward function,indicating the immediate cost/reward,which reflects the learning objective.The basic elements include the system state,resource allocation behavior,state transition probability,and cost function.

Take statecas the input to the DQN algorithm.After the neural network analysis,the DQN algorithm outputs the corresponding action.The main idea behind the algorithm is to approximate the distribution ofQvalues using the neural network training functionfap.TheQvalue can be denoted as

whereQdenotes the main network’s weight,andQ(c,a)=[Q(c,a1),Q(c,a2),···,Q(c,aK)](here,Kis the maximum number of actions that can be taken inA).

The target Q-network is updated only once in a period,while the main network is updated after each iteration.The targetQvalue can be denoted as

where the discount factorγ∈[0,1)represents the decay degree of the reward function value,indicating the impact of the future reward on the current behavior choice,andθ-is the target Q-network’s weight.To improve the network prediction performance,it is required to learn and train the weight function repeatedly to fit complicated environmental data.

Fig.3 depicts the DQN training procedure.In this training model,the optimization of weightθis achieved by minimizing the loss function between the main network and the target Q-network,which can be described as

Fig.3 Deep Q-learning network training model

The optimal allocation strategy for wireless resources can be found using the trained main network of the DQN algorithm after the main network has been trained.The process of the dynamic wireless resource allocation algorithm is organized as follows:in time slott,the system state is specified asct=(Q(t),H(t))∈C,and the action is defined asat=(a(t),φ(t),p(t))∈A.π:C→A,which is a stability policy and can be expressed asa=π(c),is the process of mapping the state space to the action space.According to the initial statecand the strategyπ∈Π,whereΠrepresents the set of all possible strategies,in time slott,the expected cumulative network sum rate can be denoted as

The expected cumulative sum delay of the total network radio interface is

3.2 Algorithm implementation

The proposed algorithm’s state,action,and reward are specifically defined as follows:

State:Define the state of the network system of the access network asct=(Q(t),H(t))∈C,including the global queue state informationQ(t)and the global channel state informationH(t).

Action:Action seta*tis defined as a series of vectors. Each vector represents user admission control,PRB,and power allocation on all BSs,satisfying[a*(t),φ*(t),p*(t)]=arg,wherea*(t),φ*(t),andp*(t)represent the user admission control scheme,PRB,and power allocation strategy that satisfy the user experience maximization in time slott,respectively.

Reward:Considering that the objective of this algorithm is to maximize the overall average user experience,the reward function is defined as the sum of user experience gained after all users associate BSs and allocate their PRB and power when constraints C1-C7 are satisfied.Otherwise,it is defined as a negative feedback:

The specific flow of the algorithm is shown in Algorithm 1.At step 3,the optimal actionatunder statectaccording to the output result of the latest main network is obtained.At step 4,the PRB and power allocation of the access network are jointly adjusted according toat,to ensure the QoS in real time and obtain the final user admission control schemeaj,n(t),power allocation strategypbj,n(t),and PRB allocation strategyφbj,n(t).Then the resource allocation process ends.

Algorithm 1 DQN-based dynamic allocation Input:system initial state c and the corresponding reward r(c,a)1:for t=1,2,···,T do 2:In current time slot t,monitor the global state ct of the access network,including the global channel state information H(t)and the global queue state information Q(t)3:Calculate the optimal power and PRB allocation actions,at=arg max a∈A Q(ct,a,θ)4:Adjust the power and PRB allocation depending on the optimal action at 5:t=t+1 6:end for Output:user admission control scheme aj,n(t),power allocation strategy pbj,n(t),and PRB allocation strategy φbj,n(t)

4 Simulation results and analysis

In this section,the overall user experience of the system and the average user experience of a single user are used as the performance evaluation indices to evaluate the feasibility of the built model and the effectiveness of the proposed algorithm.The algorithm proposed in this study is compared with the heuristic algorithm(Kalil et al.,2017)and the minimum distance allocation(MDA)algorithm(Zhang et al.,2021).In the heuristic algorithm,the weight of each user is calculated according to the queue state and channel state of each BS in the current time slot and the minimum resource requirement of each user.Based on the calculated user weight,network resources are allocated to the corresponding users according to the weight in each discrete resource scheduling time slot.In the MDA algorithm,each BS associates users according to the shortest distance,and each PRB allocates the same amount of power for users.

4.1 Simulation environment

In the simulations,we assume that four BSs are distributed uniformly in a 2 km×2 km area.The coordinates are(0.5,0.5),(0.5,1.5),(1.5,0.5),and(1.5,1.5)km,and users are randomly distributed in the area.Assuming that there are three types of services required by users,the minimum rate requirements and the total radio interface delay requirements of different users are different,and the arrival process of user data packets follows an independent and identically distributed Poisson distribution.In addition,set the noise powerσ2=10-7mW.The optional power level on the PRB is{0,0.5,1}dBm.The service rate unit price and the delay unit price are 5 per Mb/s and 1 per ms,respectively.

In the DQN-based dynamic allocation algorithm,a multi-layer CNN is used in the main network and target Q-network,including three convolution layers and two fully connected layers.The relevant information of each layer includes the size of the convolution kernel,the size of the convolution step,and the number of convolution kernels.The queue length of each BS is discretized into a finite number of equally spaced intervals,and each interval represents the current queue state.Therefore,the system state space in the constrained Markov problem is a finite state set.The parameters of the target Q-network are updated every 200 iterations.In the training process,the capacity of the DQN experience playback pool is set to 10 000.ε=0.7 is the probability value of anε-greedy strategy.The remaining parameters are shown in Table 1.

Table 1 Simulation parameters

4.2 Performance evaluation

Fig.4 shows the changes of the user experience of the system of the three resource allocation algorithms with the advancement of time series when the number of users is 30 and the maximum transmission power of the BS is 39 dBm.The figure shows that the user experience of the proposed algorithm and the heuristic algorithm tends to be stable over time,while as a static resource allocation algorithm,the user experience obtained by the MDA algorithm does not change with time.Compared with the heuristic and MDA algorithms,the proposed algorithm can obtain superior user experience on a long time scale.

Fig.4 Changes of the user experience of the system over time when the number of users is 30 and the maximum transmission power of the base station is 39 dBm

Fig.5 illustrates the relationship between the average user experience and the number of users when the maximum transmission power of the BS is 39 dBm on a long time scale.Fig.5a shows the average user experience of all the users in the system,and Fig.5b shows the average user experience of a single user.The simulation results show that compared with the heuristic and MDA algorithms,the proposed algorithm can obtain the maximum average user experience and has the greatest optimal effect on the user experience.Because the heuristicalgorithm considers the user’s minimum demand for resources,the heuristic algorithm can guarantee the service rate,but cannot achieve the optimal user experience.In the MDA algorithm,each PRB allocates the same amount of power for users,and resources cannot be flexibly and dynamically allocated according to the user’s needs.

Fig.5 Average user experience varying with the number of users when the maximum transmission power of the base station is 39 dBm:(a)average user experience of the system;(b)average user experience of a single user

In Fig.5a,when the number of users is small,the average user experience obtained by the heuristic algorithm is similar to that obtained by the proposed algorithm,because the network resources are relatively sufficient.With the increase in the number of users,the increase of the data rate revenue is greater than the total delay loss in the whole network,so the average user experience of all the users in the system increases.In addition,it can be seen from Fig.5b that the average user experience of a single user decreases with the increase in the number of users,due to the limitation of radio resources in the network.When the number of users in the system is small,the network resources are relatively sufficient,and a single user can obtain a high data rate and a low delay.With the increase in the number of users in the system,the available resources are limited.When the number of users reaches a certain scale,the proposed algorithm can maintain only the user’s minimum requirements for rate and delay.Therefore,the average user experience of a single user gradually decreases as the number of users increases.From the simulation results,it can be concluded that the proposed algorithm can maintain the optimal performance and maximize the average user experience regardless of the overall user experience of the system or the average user experience of a single user.

Fig.6 shows the relationship between the average user experience of the system and the maximum transmission power of the BSs when the number of users is 30.It can be seen from Fig.6 that the user experience of the three algorithms all increases with the increase in the maximum transmission power of the BSs.An increase in the transmission power of the BSs will boost the data rate revenue and improve the overall user experience of the system.When the maximum transmission power of the BSs is small,the average user experience of the MDA algorithm is negative,because the transmission power of the BSs is too small to guarantee the service rate and latency requirements of the surrounding users.By comparing these three algorithms,it can be concluded that the proposed algorithm can guarantee the maximum average user experience and has the best performance.

Fig.6 Average user experience of the system varying with the maximum transmission power of the base stations(BSs)when the number of users is 30

5 Conclusions and future work

Considering the future wide-area coverage signaling cell scenario,we proposed a dynamic user-centric multi-dimensional resource allocation method.Considering the different QoS requirements of users in different industries,we constructed a dynamic allocation model for wireless resources.A DQN-based dynamic allocation algorithm for wireless resources was proposed to maximize the overall user experience.In the model,the network fully perceived its state through various measurements reported by the terminal.The proposed algorithm realized on-demand user admission control and dynamic resource allocation according to the requirements of rate and latency reported by users.The simulation results showed that the proposed algorithm can effectively improve the average user experience on a long time scale,while ensuring the user’s minimum data rate requirements and latency constraints and ensuring low energy consumption of the network in the process of resource allocation,thus achieving the goals of optimizing the overall network utility in real time and realizing on-demand wireless resource allocation.

In the future research work,more types of resources can be considered in this paper’s model,including communication resources,computing resources,and cache resources,to enable deeper integration of data,information,and communication technologies.

Contributors

Zhou TONG,Na LI,Junshuai SUN,and Guangyi LIU designed the research.Zhou TONG and Huimin ZHANG conducted the simulations.Zhou TONG drafted the paper.Na LI helped organize the paper.Zhou TONG,Quan ZHAO,and Yun ZHAO revised and finalized the paper.

Compliance with ethics guidelines

Zhou TONG,Na LI,Huimin ZHANG,Quan ZHAO,Yun ZHAO,Junshuai SUN,and Guangyi LIU declare that they have no conflict of interest.

Data availability

Data not available due to commercial restrictions.Due to the nature of this research,participants of this study did not agree for their data to be shared publicly,so supporting data is not available.