Kai Chen,Qinglei Kong,Yijue Dai,Yue Xu,Feng Yin,*,Lexi Xu,Shuguang Cui
1 School of Science and Engineering,The Chinese University of Hong Kong,Shenzhen 518172,China
2 Future Network of Intelligence Institute(FNii),The Chinese University of Hong Kong,Shenzhen 518172,China
3 School of Information Science and Technology,University of Science and Technology of China,Hefei 230026,China
4 Research Institute,China United Network Communications Corporation,Beijing 100048,China
5 Alibaba Group,Hangzhou 310052,China
Abstract: Data-driven paradigms are well-known and salient demands of future wireless communication.Empowered by big data and machine learning techniques,next-generation data-driven communication systems will be intelligent with unique characteristics of expressiveness, scalability, interpretability, and uncertainty awareness,which can confidently involve diversified latent demands and personalized services in the foreseeable future.In this paper,we review a promising family of nonparametric Bayesian machine learning models,i.e.,Gaussian processes(GPs),and their applications in wireless communication.Since GP models demonstrate outstanding expressive and interpretable learning ability with uncertainty,they are particularly suitable for wireless communication.Moreover,they provide a natural framework for collaborating data and empirical models(DEM).Specifically,we first envision three-level motivations of data-driven wireless communication using GP models.Then, we present the background of the GPs in terms of covariance structure and model inference.The expressiveness of the GP model using various interpretable kernels, including stationary,non-stationary,deep and multi-task kernels,is showcased.Furthermore,we review the distributed GP models with promising scalability, which is suitable for applications in wireless networks with a large number of distributed edge devices.Finally, we list representative solutions and promising techniques that adopt GP models in various wireless communication applications.
Keywords: wireless communication; Gaussian process;machine learning;kernel;interpretability;uncertainty
Recently,there has experienced an explosion of works in artificial intelligence (AI) for wireless communications [1-6].Furthermore, traditional paradigms(TPs) based on mathematical modeling have greatly hindered the progress of future wireless communications and negatively affected its emerging applications,such as the Internet of Vehicles (IoV) [7-9], Internet of Things(IoT)[10-14],augmented/virtual reality(AR/VR) [15, 16], and energy efficient 5G [17-21].Increasingly,many new breeds of smart connected sensors and AI-enabled applications heavily depend on intelligent real-time response and explainable decision making,e.g.,emergency braking in self-driving vehicles,obstruction warning for drones,fault diagnosis for intelligent manufacturing,environmental perception for cooperative multirobot systems,and predictable humancomputer interaction for AR/VR,to reduce response times and human-interventions.These applicationdriven requirements demand the next-generation communication systems[22-24]to be intelligent with the following welcome features: flexibility,scalability,interpretability,and especially uncertainty modeling to confidently involve latent demands and personalized service in the future.
Compared with traditional paradigms in wireless communication, a significant advantage of machine learning is its capability of gaining knowledge and automatically extracting information without specific rules[25].However,due to the insufficient interpretation of state prediction,machine learning methods with black-box decision making [26-28] always confuse the diagnosis and analysis of complex communication systems and lead to a passive understanding of its functioning mechanism.To promote interpretable machine learning for data-driven wireless communications,in this paper,we review the Gaussian process(GP)model,and present their applications in wireless communications due to their interpretable learning ability with uncertainty.
GP is a generalization of the Gaussian probability distribution,which means GP is any distribution over functionsf(x)such that any finite set of function values has a joint Gaussian distribution[25,27].The GP provides a model where a posterior distribution over the unknown function is maintained as evidence is accumulated.This allows GPs to learn the underlying functions of wireless communication systems when a large number of observations are collected.In contrast to the popular deep neural network(DNN)[31]and other learning models,GP model shows a unique property of uncertainty qualification with a closed-form mathematical expression of great value to data-driven wireless systems that demand controllable and understandable state prediction.In table 1,we compare GPs with DNN, reinforcement learning (RL), and TPs in terms of model expressiveness, interpretability, scalability, uncertainty modeling, sample efficiency, and collaboration of data and empirical models(DEM).
A GP can model a large and complex wireless communication system through the design of its covariance function(also called kernel function),which encodes one’s assumption about the auto-covariance of an unknown function.Therefore, a kernel is crucial in a GP model,as it implies the characteristics of distribution over functions.In addition,scalable inference is another core aspect in GP because the computational complexity of GP is cubicO(n3)for large scale wireless communication systems.This usually prevents the GP from learning a big data problem.Thus, the most important advances in wireless communication using GPs are related to both the kernel function design and scalable inference,which are extensively studied[2, 35, 36, 34].There are a number of tutorials and survey works reviewing GPs in wireless communication,such as[2,37].In a recent magazine paper[2],GP model was introduced as a replacement of the traditional deep learning model to address critical uncertainty issues of the next-generation data-driven wireless communication system,however,technical details of the GP models were kept brief.In the magazine paper[37], GP model was introduced for nonlinear signal processing applications.Therein, a number of illustrative applications of signal processing for wireless communication,such as probabilistic channel equalization and tracking of nonlinear channel variations,were presented.The main attention of[37]lies in demonstrating the superior nonlinear mapping ability of GP models,while not on model expressiveness and interpretability aspects.In contrast,our survey paper gives both a complete overview of the mathematical advances developed in recent years and a comprehensive survey of wireless communication applications benefited from using GP models.
Generally,for kernel function designation,there are broadly four categories of covariance design in the existing works for GP,including(1)compositional kernel design[38,39],where kernels are constructed compositionally from several existing base kernels;(2)spectral kernel learning,where kernels are derived by modeling the kernel spectral density as a mixture of distributions[40-43];(3)deep kernel representation[44,45],where DNN plays a role in nonlinear mapping between input space and feature space;and(4)multi-task kernel[46,47],where adjacent devices(tasks)share knowledge and interact with each other to obtain collective intelligence.In the next sections, we review related works in detail.
To overcome the computational complexity issue of GP[2,27,34,48]used in large scale wireless communication systems, scalable inference can be achieved by exploring(1)low-rank covariance matrix approximation[49,50],(2)special structures of the kernel matrix[51,52],(3)Bayesian committee machine(BCM),which distributes computations to a big number of computing units[53,54],(4)variational Bayesian inference[55,56],and(5)special optimization[57,43].Notably,these scalable methods are not exclusive,and we can combine some of them to get a better method,for instance,stochastic variational inference(SVI)[55,56]combining the strength of inducing points for low-rank approximation and variational inference.
The main contributions of this survey are summarized below:
• We extensively discuss the generally desired AI features of next-generation data-driven wireless communication systems,namely,expressiveness,scalability,interpretability,and uncertainty modeling.Regarding these aspects,we compare GP with other machine learning methods and then conclude that GP can cover these qualities better(as shown in Table 1).
Table 1. Comparisons between GP and other popular methods in terms of AI characteristics for wireless communication.
• We broadly analyze and explain four categories of covariance design in terms of mathematical theorem and GP kernel expression,including(1)stationary kernel,(2)non-stationary kernel,(3)deep kernel, and (4) multi-task kernel.These kernels leverage both the expressiveness and interpretability of the GP model in wireless communication.
• Due to the scalability demand and distributed deployment of wireless communication systems,we review and evaluate the advances of distributed GP with scalable inference for big data of cloud intelligence as well as AI-enabled edge devices.
• We show an exemplary case by extrapolating the number of online 5G users collected from a real world 5G wireless base station.The results show the expressiveness,interpretability,and uncertainty modeling of GPs for data-driven wireless communication.
• We exhibit some representative wireless communication scenarios for applying the GP models and further envision some open issues and challenges of using GPs for future data-driven wireless communication.
For the rest of this paper,we begin by introducing the motivation of using GPs for data-driven wireless communication in section II and then give the mathematical background of GPs in section III.In section IV and section V,we present the advances of GPs.A demonstration for wireless communication using GPs is given in section VI.In section VII and section VIII,we give existing GP applications and future researches on wireless communications,respectively.
In this section,we present the unique features for the next-generation data-driven wireless communication using machine learning methods with expressiveness,scalability,uncertainty modeling,and interpretability(see Table 1).
Due to the inherent intelligence requirements in datadriven wireless communication systems,there are three levels of motivations to apply GPs.First,the low-level motivation is based on the demands of smart,efficient,and flexible decision making,planning,and prediction in future wireless communication systems[3],which cannot be achieved by applying traditional paradigms.Then,the comparison between GP and other machine learning methods brings the middle-level motivation and comprehensively explains why we tend to choose the GP model for data-driven wireless communication systems [25, 27].As shown in Section VI, the highlevel motivation is derived from the competitive applications empowered by GPs in wireless communication.Specifically,the motivations can be summarized as follows:
• For future wireless communication systems,it is expected that there are many latent demands and personalized services driven by diversified applications.These latent demands and personalized services can be further modeled and improved by using machine learning methods,with the growth of historical data,and ever-increasing computing power.There are many features describing future wireless communication: (a)expressiveness correlated to model complexity which results from diversified application scenarios[4,58];(b)scalability on big data due to the ever-growing network size with network densification and an increasing number of connected intelligent devices[3,59];(c)uncertainty resulting from a dynamical communication environment[60,61];and(d)interpretable knowledge discovery and representation for understanding the mechanism of complex systems[62,63].In particular,uncertainty modeling is critical for state prediction in wireless networks since there are always multiple noises and dynamic factors intervening the status of the system and the mobile users’experience.
• As a class of Bayesian nonparametric model,GP provides a principled,practical,probabilistic approach for learning the patterns encoded by kernel structure [27].Among all machine learning models, the GP has a tight connection with various learning models [25-27], including spline models, support vector machines (SVMs), regularized least-squares models,relevance vector machines(RVMs),autoregressive moving averages(ARMAs),and deep neural networks(DNNs).In particular,GPs have advantages with respect to the interpretation of model learning,model selection,and uncertainty prediction from a Bayesian point of view.Using an appropriate kernel structure and computational approximation,GP can model any function of wireless communication systems with flexibility and scalability.Owing to the Bayesian rules, GP with a measure of uncertainty is more robust to overfitting problems.In comparison with other machine learning models,the GP model can simultaneously meet the requirements of expressiveness, scalability, uncertainty modeling, and interpretability [25, 27] for data-driven wireless communications.
• Thanks to the Bayesian properties,GP model has eye-catching interpretations in terms of model construction,selection,and hyper-parameter adaptation(see section III).Such interpretation strengths promote a large number of GP models to empower diversified wireless communication applications.There are five popular GP models using different kernels to support various wireless communication tasks,such as the GP models with stationary spectral mixture(SM)[40,64]and compositional kernels[65](see section 4.1),non-stationary(NS)kernels [66-69] (see section 4.2), deep kernels[44,70](see section 4.3),and multi-task kernels[34,46](see section 4.4).Furthermore,GPs have scalability variations with distributed inference to scale large data on a big number of edge devices(see section V).The distributed GPs can make full use of the computational resources of local edge devices in wireless networks to gain efficiency improvement as well as privacy protection[35,71].
There are multiple uncertainty issues in the modeling of wireless communication: (1)functional uncertainty describing the gap between the true function and learned model;(2)prediction uncertainty with a fuzzy range caused by the number of observed evidences; (3) input uncertainty due to the noise generated during the wireless propagation; and(4)output uncertainty due to unstable wireless propagation and poor precision of measuring sensors.Theoretically,these uncertainties,as well as interpretability,can be well represented by a GP model.In this section,we briefly describe the background of the Gaussian process for machine learning in terms of its mathematical definition,kernel function and model inference.
From the function-space view,a Gaussian process[26,27]defines a distributionp(f(x1),f(x2),...,f(xn))~N(m(x),K(x,x′))over functions,completely specified by its first and second-order statistics,namely,the meanm(x)and the covariancek(x,x′)functions[72].For a given input location x∈Rpof a real stochastic processf(x),the meanm(x)and covariance functionk(x,x′)are defined as:
Thus, a GP is expressed asf(x)~GP(m(x),k(x,x′)).Without loss of generality,the mean of a GP is often assumed to be zero anywhere because we usually do not have any prior knowledge about the mean.The covariance function(also called the kernel) between function values is applied to construct a positive definite covariance matrix on input pointsXfor the joint Gaussian distribution,here denoted by Gram matrixK=K(X,X).By using a GP prior over functions in the kernel designation and parameter initialization,from the training dataX,we can predict the unknown function value ˜y∗and its variance V[y∗](that is,its uncertainty)for a test point x∗.Specifically, we have the following predictive equations for GP regression[27,48]:
Basically,the smoothness and generalization properties of GP depend on the kernel function and its hyperparameters Θ.Choosing an appropriate kernel function and the corresponding initial hyper-parameters are crucial to GP design since the posterior distribution can vary significantly for different kernels.The most extensively used covariance function is stationary.We introduce a generalized theory of both stationary and non-stationary covariance functions in the later sections.For the underlying function to be modeled by a GP,there are many characteristics,such as exponentially decayed dependency and periodic dependency,which can be encoded by specific covariance functions.
To make the GP model applicable for practical applications,the inference of the GP model is also very important.During the inference phase of the GP model,the freedom of model selection is considerable even though an appropriate covariance was specified in advance.Typically,GPs contain hyper-parameters Θ describing the properties of the kernel and noise of the GP.Suppose we have chosen a covariance functionk(x,x′)with hyper-parameters Θk.The inference of the GP means Bayesian model selection with the possible best values of Θ ={Θk,}.Such selection can be accomplished by minimizing the negative log marginal likelihood(NLML),LNLML=-logp(y|X,Θ).The inference and posterior sampling of a GP model is illustrated in Figure 1.
Figure 1. Samples from GP prior distribution and GP posterior distribution based on three observations(black crosses).Subplot(a)is the prior distribution(in cyan)and sampling(in light blue,dark blue,and red);subplots(b),(c),and(d)are the posterior distribution and sampling with one,two,and three observations,respectively.The shaded area(in cyan)can be seen as the uncertainty bound of the predictive function value.With the increase in collected observations,GPs can adapt the underlying function space very smoothly.
Figure 2. Spectral densities(left)with a mixture of Gaussians and corresponding covariance functions(right)in the SM kernel.For SM,the location(black dot)of each component denotes the period of underfunction.
Figure 3. Spectrogram (left) depending on both input x and spectral density s and the corresponding covariance functions(right)for NSM.
The NLML can be used for assessing the goodness of fit of the GP model.For the evaluation of GP model,we usually apply the mean squared error(MSE)and mean absolute error(MAE)to measure prediction performance.Specifically,the predictive uncertainty described in Eq.(2b)scores the confidence of the prediction.
Data generated in wireless communication systems often demonstrate the following patterns: (1)weekly periodic trends on weekdays and weekends,(2)daily periodic trends in working hours and spare time, (3)decayed deviations in terms of small-scale variation,and(4)some noise introducing disorder fluctuations.These patterns are generally stationary and can be captured by the GP with a flexible kernel structure (see section VI).However,without tangible prior information,the number of patterns and their signal features are not clear for the definition and construction of a GP model.Alternatively,we can apply a universal representation of stationary kernels and then automatically infer the latent patterns through optimization, which can simplify the practice of machine learning in wireless communication systems and enhance the efficiency of interpretable knowledge discovery.
In this section, we review the theoretical foundation of stationary covariance functions and recent GP works.Stationary covariance is regarded as a function ofτ=x-x′other than input location x,which is invariant to translations in the input space[27].For each covariance function of a stationary process,there is a corresponding representation,the Fourier transform of a positive finite measureψ,in the frequency domain.Referring to[73,74],Bochner’s theorem indicates the connection between the covariance function and its spectral density.
Theorem 1(Bochner’s Theorem[73,74]).A complexvalued function k onRP is the covariance function of a weakly stationary mean square continuous complex-valued random process onRP if and only if it can be represented as
where ψ is a positive finite measure and ȷdenotes the imaginary unit.
Ifψhas a densitys)called the spectral density or power spectrum ofk,Theorem(1)implies the following Fourier dual.
wheredenotes the inverse Fourier transform operator from the frequency domain to the time domain.For the SM kernel,we can interpretwi,µi,and Σias the signal variance,inverse period,and inverse length scale of thei-th covariance component,respectively.In summary,the SM kernel can be seen as a generalization of existing stationary kernels.Note that the GP model with an SM kernel has been used for wireless traffic prediction[34]and is trusted by the application of wireless communication.In section VI we predict the number of online 5G users by using a GP model with an SM kernel.
In addition to stationary patterns, there are also a few complex non-stationary patterns with time-varying characteristics for wireless communication, for instance, mmWave massive MIMO channel modeling[79],5G wireless channel modeling[80],wireless control systems[81],3D non-stationary unmanned aerial vehicle (UAV) MIMO channels [82], non-stationary mobile-to-mobile channels allowing for velocity and trajectory variations in mobile stations[83],and nonstationary channel modeling for vehicle-to-vehicle communications [84].In contrast to the stationary kernel depending only on the distanceτ, the signal characteristics of non-stationary GP,such as frequencies, amplitudes, and spectral densities, have direct dependences on the input locations x.The extension of Bochner’s theorem (see Theorem 1) to the nonstationary domain has a generalized spectral representation on theP ×Psurface
where uSis a positive finite measure on spectral surfaceP ×P.
Arguably,the dot product kernel is the simplest nonstationary kernel[27].The well-known and extensively used non-stationary kernels are linear and polynomial kernels [27], which are less parameterized for representing complex patterns.Since the introduction of the neural network(NN)kernel[28],GPs can approximate both DNN and one hidden layer neural network model(known for universal approximator and nonlinear property)with infinity neurons.After that,Gibbs[85]developed the non-stationary covariance function shown in Eq.(6)by considering a grid of exponential basis functions and parameterizing its length scale as positive functions,
Then,Higdon[86]proposed a non-stationary spatially evolving GP using a process convolution to model toxic waste remediation.Based on[86],Paciorek[87]generalized the Gibbs kernel using non-stationary quadratic formQx,x′= (x-x′)((Σx+Σx′)/2)-1(x-x′)instead ofτin any stationary kernel,where Σxis the positive length scale function of input x.After proposing the SM kernel(see Eq.(4)),in[66],a non-stationary SM (NSM, see Figure 3 ) kernel was introduced by modeling the spectral surface as a two-dimensional GMM.
Figure 4. The covariance structure of the optimal DKGP[45],where a multilayer fully connected feed-forward NN is applied as the universal approximator of the underlying function f(x).
For the aforementioned non-stationary kernels, their hyper-parameters can be parameterized as positive functions described by stationary GPs.For example, we can parameterizeθℓasθℓ ~GP(0,kℓ(x,x)).Recently, the harmonizable kernel[88]showed a novel spectral representation of the non-stationary kernel by incorporating a locally stationary kernel with an interpretation of the Wigner distribution function.In[67],another convolutional spectral kernel was proposed to give a concise representation of the input frequency spectrogram,but it shows less insight into a prespecified complex-valued radial base.To meet the development needs of a non-stationary GP,the non-separable and non-stationary kernel[89],including a varying nonseparability and local structure,has a natural interpretation through the spectral representation of stochastic differential equations(SDEs).
Neal[28]proved that a Bayesian neural network with infinitely many hidden neurons converges to a GP.In practice,GPs with popular kernels are mostly used as simple nonlinear interpolation models.Deep neural networks (DNNs) are demonstrated in their competent learning and representation in many application domains,including computer vision[90],speech recognition[91],language processing[92],and recommendation systems[93].The most interesting DNN capability is feature discovery and representation.However,DNNs have a well-known interpretation imperfection in that the mechanism of model learning and inference is a black box,which heavily depends on hyperparameter tuning techniques.Therefore,deep kernel GP(DKGP)[44,94,95]combines the nonparametric flexibility of kernel methods with the inductive biases of deep learning architectures, which presents benefits in both expressive power and interpretability.As a result,the DKGP can draw their strengths to learn a model for complicated wireless mechanisms,such as 5G and vehicle-to-everything(V2X)channel impulse responses, multipath radio signal propagation, radio feature maps(such as the signal quality,uplink/downlink traffic,wireless resource demand/supply)over time and space,and indoor pedestrian motion,etc.
For DKGP, a typical framework extracts features from DNNs and then treats the features as inputs of multiple GPs[44,95].The model comes from linearly mixing these GPs and jointly optimizing hyper-parameters through a marginal likelihood objective.The understanding of this kind of deep kernel is straightforward and can actually be seen as the GP using complicated feature engineering or transformation before learning.The popular structure of the deep kernel can be written as
wheregNN(x,wNN) denotes a nonlinear feature mapping given by DNN with weights wNN.Note that the kernelkiused in Eq.(8)can be arbitrary.Similar to the DNN,the chain rule is also applicable for deep kernel learning.According to the chain rule,the derivatives of the NLML with respect to the deep kernel hyperparameters are given as follows:
where the derivative of NLML with respect to the covariance matrix is=
Another deep kernel using the finite rank Mercer kernel function with orthogonal embeddings on the last layer has a better learning efficiency and expressiveness[96].However, incorporating DNN into GP leads to poor interpretability due to DNN’s blackbox.To enrich the interpretability of deep kernels, the second class of deep kernels was proposed to reveal the learning dynamics of the DNN by building connections between the GP and DNN.Furthermore,considerable focus has been paid on interaction detection in DKGP to enhance its interpretability.Interestingly,a recently proposed novel optimal DKGP(see Figure 4)[45]demonstrates better model interpretability.The resulting kernel has a non-stationary dot product structure with minimized test mean squared error, shallow DNN subnetworks with feature interaction detection,much reduced hyperparameter space,and good interpretability.
In wireless communication systems,adjacent devices are not independent and must be correlated because there are shared patterns and environmental factors between them.For example,connected smartphones,robots,drones,vehicles,intelligent home systems,and NB-IoT sensors in the same wireless network may have dependent behaviors or trends impacted by the status of the wireless network.Hence,a joint learning model can make full use of data collected from adjacent devices to achieve collective intelligence.Knowledge obtained from different edge devices can be transferred to augment the overall prediction performance and system understanding.Therefore, a paradigm of multi-task learning can empower such collaboration in wireless communication.
For MTGP,a crucial point is how to jointly encode the shared structure and difference between tasks in the kernel [98].Kernel design should consider both the cross-covariance between tasks and auto-covariance within each task.Early MTGP approaches mainly focus on linear combinations and convolution of independent single-source GPs, which correspond to the linear model of coregionalization(LMC)framework[46, 99, 100] and convolved GP [101, 102], respectively.Many improvements and applications of MTGPs have been introduced in previous works,such as[46,99,102,103].One method for promoting the representation ability of MTGP model is via using the SM kernels.First,the SM-LMC kernel[99]models the covariance of a single task with an SM kernel, linearly combines these single tasks with LMC and provides an interpretation of the Gaussian process regression network(GPRN)from the perspective of a neural network with
wherekSM,iis a covariance structure shared by tasks andBiencodes the cross-covariance between tasks.Then,the cross-spectral mixture(CSM)kernel[100]additionally introduced a phase factor intoBito encode amplitude and phase for cross-covariance with
wherekSG,i(τ;Θi)is the phasor notation of a spectral Gaussian kernel.The multioutput spectral mixture kernel (MOSM) [103] further represents both time and phase delays in cross covariance between tasks by using complex-valued matrix decomposition.
However,MOSM has a compatibility drawback in that it cannot reduce to the SM kernel when only one task is available.Therefore,a multioutput convolution spectral mixture(MOCSM)kernel[104]was proposed to enjoy the compatibility property perfectly through cross convolution of time and phase delayed SM components.Another important extension of MTGP is multi-task generalized convolution SM(MT-GCSM)kernel[105],which models nonlinear task correlations and dependence between arbitrary components and provides a framework for heterogeneous tasks with different levels of complexity.The later convolved GP is more flexible and expressive because it allows each task to have its own kernel and complexity.
The distributed Gaussian process (DGP) in wireless communication involves learning on distributed edge devices.The use of DGP can avoid frequent interactions with a central server and allow each edge device to possess a local learning model.For delay-sensitive applications such as self-driving vehicles and unmanned aircraft,a local learning model can rapidly respond to a local request in a timely manner.Particularly, the DGP can save the overall time cost of the wireless communication when the central server is not available or network congestion occurs.Therefore, DGP can be seen as a form of on-device intelligence,which addresses the major concerns of scalable computation and privacy protection in wireless communication.
In this section,we introduce the framework of DGP,which has shown significant advantages in computational efficiency [1, 35, 106, 107].There are many reasons for the selection of DGP, such as scaling ordinary GP to large datasets,applying ordinary GP to distributed edge dataset,preventing access to privacysensitive data and making full use of multicore highperformance computers(HPCs).In general,DGP splits big data into multiple (M) smaller pieces computed on local computing nodes to speed up the inference of the whole model[108],which refrains from centrally collecting and storing massive data.The initial aim of DGP is to make GP scalable to big data.However,with the development of multicore computing architecture and edge computing in IoT networks,DGP is gradually receiving attention from research and industrial applications because it provides a more practical machine learning framework than the existing GPs.Some representative DGP works have been published recently[1,35,106-109].By using the map-reduce framework and decoupling the data conditioned on the inducing points,a distributed variational inference for GP and latent variable models (LVMs) was proposed [108] .The distributed variational inference for GP still has the limitation of scalable inference when the data size isn ≥107.Another DGP is based on the mixture-ofexperts(MoE)model[110].The MoE model weights the predictions of all local expert models(node)to give the final prediction.For MoE,confusion is how to specify the number of experts and weight of each expert.Compared with MoE,product-of-GP-experts models(PoEs)[109]that multiply predictions of independent GP experts can avoid assigning weight to experts but are inevitably overconfident.The marginal likelihoodp(y|X,Θ)of PoEs is written as follows:
whereMis the number of GP experts andp(i)(y(i)|X(i),Θ) is the marginal likelihood of theith GP expert using thei-th partition{X(i),y(i)}of dataset{X,y}.Additionally,the predictive probability of PoEs is the product of all predictive probabilities of independent GP experts,
Similarly,the Bayesian committee machine(BCM)[111]combines independent estimators trained on different datasets by using Bayes’rule.BCM has a better interpretation due to considering the GP priorp(f∗).Furthermore,robust BCM[112]generalized the original BCM and PoE-GP by incorporating a GP prior and the importance of GP experts.In order to achieve a much better approximation of a full GP, other improved DGP works include: (1) asynchronously distributed variational GP [107] that uses weight-space augmentation to scale up GPs to billions of samples;(2) generalized robust BCM [113] that gains a consistent aggregated predictive distribution by randomly selecting a subsetD(1)as a global node for communicating with the remaining subsets;(3)nested kriging predictors that aggregates submodels based on subsets of observation points[114].
In this section, we show an exemplary case by considering a real world wireless communication dataset regarding the number of online 5G users collected from a 5G base station.This experiment can substantiate the features of data-driven wireless communication using GPs in terms of expressiveness, interpretability, and uncertainty modeling.The dataset was collected in a southern city in China in 2021.As shown in Figure 5,there are multiple patterns with different time scales in the varying of the number of online 5G users,such as long-term(subplot(b)),short-term(subplot(c)),and mid-term (subplot (d)) trends.We split 70% of the dataset as training(in blue,up subplots)data and the rest 30% as testing (in cyan, up subplots) data.We set the SM kernel withQ=6 components and initialize the hyper-parameters by fitting a Gaussian mixture model on the empirical spectral densities(in blue,bottom subplots).Subplot(a)in Figure 5 indicates that the GP model(in dashed red)can extrapolate the patterns of testing well.The predictive trend (in dash red) is very close to the ground true trend(in cyan).Moreover,the predictive 95% confidence interval (CI) (in grey shade)of the GP model shows uncertainty bounds of the prediction and completely covers the ground truth.
The interpretability of the GP model can be sniffed from the structure of spectral density(in dash red,bottom subplots).As shown in subplot(i)Figure 5,there are six significant peaks captured by the GP model,which denotes different patterns with different periods.The peaks located at low frequencies reveal long-term trends with periodOn the contrary, the peaks located at high frequencies describe short-term trends.These trends with different time scales connect to the periodicity of human activities.The covariance of longterm trend in subplot(f)decays much slower than midterm trend in subplot(g)and short-term trend in subplot(h).The amplitudes of spectral peaks determine the scales of their corresponding function values.Specifically,we exhibit three patterns(the 1st,2nd,and 4th)learned by the GP model and their corresponding spectral densities.For instance,the 4th pattern in subplot(d)demonstrates a mid-term evolution of online 5G users,which has a spectral density peak located atµ4=0.04.The sum of all these 6 spectral peaks constitutes the learned spectral structure of the GP model,namely,the Fourier transform of the covariance of the GP model.In this context, all learned patterns have clear physical interpretations.However,for neural networks we cannot get such interpretations with insights for each neuron.The expressiveness of the GP model is guaranteed due to the theories,namely,the generalization of Fourier transform and the approximation of Gaussian mixture on spectral distribution.In particular,the expressiveness of the GP model can be further improved by giving more components when learning from high complicated wireless task.By using the distributed framework shown in Section V,the scalability of the GP model is easily achieved for large wireless datasets.
In this section, to show the widespread applications of GPs in wireless communication,we further review and discuss three typical applications: wireless traffic prediction,localization,and trajectory planning.
Wireless traffic prediction has been a long-standing demand for wireless network planning and management.Highly accurate wireless traffic prediction can reduce the uncertainty of network load and reflect the traffic behavior in wireless network,which greatly matches the benefits of GP in terms of uncertainty and interpretability.There are many traffic related issues in wireless communication suitable for GP model, such as wireless traffic analysis[1],cellular traffic load prediction[115],traffic load balancing for multimedia multipath systems[116],channel prediction for communicationrelay UAV [117], stochastic link modeling of static wireless sensor networks[118].We survey some representative examples as follows:
• Wireless traffic prediction.In[1],a GP model with the alternating direction method of multipliers(ADMM)for distributed hyper-parameter optimization was proposed to predict 4G wireless traffic,which shows better performance than a DNN model,such as long short-term memory(LSTM).
• Cellular traffic load prediction.In [115], a scheme combining GP and LSTM was proposed to generate accurate cellular traffic load prediction,which is important for efficient and automatic network planning and management.Compared with benchmark schemes, the proposed scheme achieves state-of-the-art performance.
• Traffic load balancing for multi-media multipath systems.In[119],an adaptive load balancing algorithm using an online GP was proposed to estimate the path status and allocate traffic load to each path properly in multimedia multipath systems.The proposed GP based algorithm is helpful for offering higher reliability and stability utilizing a variety of communication media and paths.
Another representative wireless scenario using GPs is wireless localization.Wireless localization has become a cornerstone of modern life due to the increasing demand of location-based applications,e.g.,shopping and industry activities.GPs have been successfully applied for indoor wireless localization due to its uncertainty modeling and nonlinear regression abilities, such as wireless tracking[35],online radio map update[120],and calibrating multichannel RSS observations for localization[116].We review various indoor localization frameworks using GPs as follows:
• Wireless target tracking.In[35],a framework of distributed recursive GP was proposed to build multiple local received signal strength(RSS)maps,which has reduced computational complexity on big data generated from large-scale sensor networks.Then, a global map is constructed from the fusion of all the local RSS maps.The proposed framework shows excellent positioning accuracy in both static fingerprinting and mobile target tracking scenarios.
• Online radio map update.In [120], a novel scheme combining crowdsourcing and GP regression can adapt radio maps to environmental dynamics in an online fashion,which recursively fuses crowdsourced fingerprints with an existing offline radio map.The scheme has particular advantages in efficiency and scalability.
• Calibrating multichannel RSS observations for localization.In[116],a GP model was proposed to compensate for frequency-dependent shadowing effects and multipaths in received signal strength(RSS)observations.By applying the GP model,multichannel RSS observations can be more effectively combined for localization over a large space.
UAV has been an emerging technology strongly correlated with wireless communication.Due to the complex communication environment and irregular time delay in wireless network,UAV trajectory planning is rapidly gaining attention in wireless communication.To meet the requirement of expressiveness appearing in wireless network,popular RL and DNN methods[6]have been recently considered as panaceas for wireless trajectory planning.In addition,the GP-based approaches with uncertainty representation and sample efficiency are also impressive for wireless trajectory planning because of their unique features,namely,model expressiveness,uncertainty representation,and sample efficiency.We review recent advances of wireless trajectory planning using GPs as follows:
• GP-based runtime planning,learning,and recovery for safe UAV operations.In[121],a recursive dynamic GP regression-based framework with fast online planning,learning,and recovery approach was proposed for safe UAV operations under unknown runtime disturbances.The framework can estimate the behavior of the UAV system and provide safe plans at runtime under unseen disturbances.
• Networked operation of a UAV using GPs.In[122],a novel GP model predictive control(MPC)scheme for path planning was introduced to deal with the time-varying network delay,non-linearity,and time-sensitive characteristics of multirotortype UAVs.The scheme increased the accuracy of path planning and state estimation for multirotortype UAVs.
• Obstacle-aware informative path planning using GPs for UAV-based target search.In[123],an algorithm leverages a layered planning strategy using a GP-based model of target occupancy to generate informative paths in continuous 3D space.The algorithm can achieve a balance between information gain,field coverage,sensor performance,and collision avoidance for efficient target detection.
From the motivations of using GP for wireless communication,we note that there are many emerging difficulties.We outline a few challenging open issues of GP models for future data-driven wireless communication.
• Ultra large-scale distributed GPs for dense and decentralized wirelesscommunication systems.In future data-driven wireless communication,the widely existing sensors gather considerable data at all times,which leads to large considerable data transmission and storage.An effective and pragmatic solution reducing the cost of data transmission and storage is to perform ultra large-scale distributed machine learning.Even though the scalability of GP is available currently.However,ultra large-scale distributed GP is still an open research issue.
• GPs for multimodal data in wireless communication systems.Currently, the GP model can only learn from structured data generated from wireless communication systems.There are also multimodal data collected from different types of sensors, such as numerical raw data from smartphones,ultrasound data from UAV ultrasonic sensors, images and videos from surveillance cameras, and natural language from speech sensors.Particularly,the signaling and data transmitted via interfaces of both LTE/5G wireless and core networks,are always nonstructured.Therefore,learning from nonstructured and multimodal data in wireless communication systems is another challenge for GP model.
• Highly interpretable GPs with deep structure in wireless communication systems.The deep kernel of a GP has difficulties in that it increases the flexibility of the GP model as well as the difficulty of model interpretation.From both the theorems of stationary and non-stationary kernels,the mathematical definition of deep kernels in the frequency domain remains unclear.Similar to DNN,sacrificing interpretability in data-driven wireless communication is usually the compromise option between learning and understanding the network,which is less tolerable for high complexity state prediction.Hence,pursuing a highly interpretable GP with deep structures will be a critical open issue in future data-driven wireless communication.
• Privacy-enhancing GPs to address privacy issues and legal concerns.Since machine learning models depend on the sharing of the required training and prediction data, the concerns of potential privacy disclosure must be addressed.One type of privacy-enhancing GP solution is the fully or partially homomorphic encryption(HE)[124],and the other type of privacy-enhancing GP solution adopts differential privacy(DP)technique[125].However,the HE techniques may introduce heavy computational overheads, while DP techniques may degrade the performance of the training process.Besides,an emerging type of privacyenhancing technique is through the exploitation of federated learning[126,127],which collaboratively learn a shared model while keeping all the training data on local devices.However, design of efficient privacy-preserving federated learningbased GPs for wireless communications remains an open challenge.
In this paper,we comprehensively review data-driven wireless communication using GPs in terms of motivation,definition and construction of a GP model,GP expressiveness using different kernels,and distributed GP scalability.A GP with a Bayesian nature can model a large class of wireless communication systems through the designation of its covariance function.By using a distributed approach,GP models are capable of performing scalable inference on big data in a wireless network.
Data-driven wireless communication systems using GPs can achieve desired properties, expressiveness,scalability,interpretability,and uncertainty modeling.These characteristics become crucial for models in wireless communication due to the collected rich data and the modeling complexity in wireless networks.In particular, interpretability and uncertainty modeling are inherent advantages of GPs due to their mathematical definition.From existing applications of the GP models in wireless communication,we present that the GP models can cover the aforementioned properties of data-driven wireless communication very well,which has been successfully proven to be valuable.
The work was supported in part by the National Key R&D Program of China with grant No.2018YFB1800800,by the Basic Research Project No.HZQB-KCZYZ-2021067 of Hetao Shenzhen-HK S&T Cooperation Zone,by Natural Science Foundation of China (NSFC) with grants No.92067202 and No.62106212, by Shenzhen Outstanding Talents Training Fund 202002, by Guangdong Research Projects No.2017ZT07X152 and No.2019CX01X104,and by China Postdoctoral Science Foundation with grant No.2020M671899.