Xuanjie Li and Yuedong Xu
School of Information Science and Technology,Fudan University,Shanghai,China
ABSTRACT We are investigating the distributed optimization problem,where a network of nodes works together to minimize a global objective that is a finite sum of their stored local functions.Since nodes exchange optimization parameters through the wireless network,large-scale training models can create communication bottlenecks,resulting in slower training times.To address this issue,CHOCO-SGD was proposed,which allows compressing information with arbitrary precision without reducing the convergence rate for strongly convex objective functions.Nevertheless,most convex functions are not strongly convex (such as logistic regression or Lasso),which raises the question of whether this algorithm can be applied to non-strongly convex functions.In this paper,we provide the first theoretical analysis of the convergence rate of CHOCO-SGD on non-strongly convex objectives.We derive a sufficient condition,which limits the fidelity of compression,to guarantee convergence.Moreover,our analysis demonstrates that within the fidelity threshold,this algorithm can significantly reduce transmission burden while maintaining the same convergence rate order as its no-compression equivalent.Numerical experiments further validate the theoretical findings by demonstrating that CHOCO-SGD improves communication efficiency and keeps the same convergence rate order simultaneously.And experiments also show that the algorithm fails to converge with low compression fidelity and in time-varying topologies.Overall,our study offers valuable insights into the potential applicability of CHOCO-SGD for non-strongly convex objectives.Additionally,we provide practical guidelines for researchers seeking to utilize this algorithm in real-world scenarios.
KEYWORDS Distributed stochastic optimization;arbitrary compression fidelity;non-strongly convex objective function
In modern machine learning problems,datasets are frequently too large to be processed by a single machine or may be distributed across multiple computing nodes due to legal restrictions or cost considerations.In such scenarios,distributed learning has emerged as a viable solution to train models in a decentralized fashion.Each computing node trains on its local data samples,and communicates with other nodes to exchange information and update model parameters.This approach can significantly reduce processing time and has gained increasing attention in recent years.
Formally,we consider an optimization problem of the form
wherefi: Rd→R is the local objective function of computing node (alternatively referred to as a worker)ifori∈{1,···,n},x usually refers to the neural network model parameter,and eachfiis determined by the local dataset of nodei,i.e.,
wheremidenotes the number of data samples of nodei,ζjrepresents one data sample,andFi: Rd×Ω→R is the loss function.For example,in a regression problem,a square loss function is commonly used asFi(·).
Consensus-based gradient descent algorithms are extensively researched and employed in distributed learning owing to their simplicity and efficiency.These algorithms enable nodes to exchange parameters with neighboring nodes and incorporate new information from received messages to drive the parameters toward the global average parameter value.Meanwhile,each node computes gradients using its local data samples and updates its parameters using the gradient descent methods.
One limitation of these algorithms is the requirement for exact parameter transmission among nodes,which can be challenging in scenarios where there are restrictions on data transmission amount or high communication latency(e.g.,with large neural networks[1]or limited network bandwidth).In such cases,compressing the parameters becomes necessary.However,parameter compression usually results in a decrease in the convergence rate.To address this challenge,Koloskova et al.proposed CHOCO-SGD [2],which was a compressed distributed algorithm designed for strongly convex objective functions.CHOCO-SGD allows for both biased and unbiased compression operators and can achieve arbitrary compression precision.Arbitrary compression precision means that any compression fidelity 0<ω≤ 1 (w=1 indicating no compression) can guarantee the same convergence rate order as the no-compression equivalent.
However,strong convexity is a restrictive assumption that may not hold in various practical applications,such as in operations research and machine learning.For instance,consider the Lasso problem[3]with an objective function represented as(〈di,x〉-li)2+β‖x‖1,wherei∈{1,···,m}
denotes the data index,βdenotes the regularization coefficient,anddiandlirepresent data features and labels,respectively.In this case,each component function is non-strongly convex,as well as the global objective.Another well-known example is logistic regression [4],where the objective islog(1+exp(-li〈di,x〉).Therefore,it is crucial to investigate the behavior and performance of CHOCO-SGD for objectives that are non-strongly convex in nature.The main contributions are summarized as follows:
· We establish a compression fidelity bound for CHOCO-SGD,ensuring its convergence over non-strongly convex functions.This criterion serves as a valuable guideline for researchers and engineers when applying CHOCO-SGD in practical scenarios.
· We rigorously prove that within the compression fidelity threshold,CHOCO-SGD achieves the same convergence rate order as DSGD over non-strongly convex objectives,and compression fidelity only affects the higher-order terms in the rate.This means that CHOCO-SGD effectively reduces the transmitted data without compromising training efficiency.
· We conduct comprehensive experiments to validate our theoretical results.The numerical outcomes show that when controlling compression fidelity above the lower bound,CHOCOSGD effectively reduces communication costs,while remains a comparable convergence rate to that of DSGD.Additionally,our experiments reveal that CHOCO-SGD fails to converge when the fidelity falls below acceptable limits,implying that this algorithm cannot achieve arbitrary compression fidelity on non-strongly convex objectives.Moreover,we also observe that CHOCO-SGD cannot converge in time-varying network topologies.
Notations.In this paper,we use uppercase letters likeX,Ato denote various matrices,including parameter matrix,gradient matrix,weight matrix,etc.Lowercase letters with subscripts are used to denote vectors,for example,ximeans the parameter vector of nodei.In terms of some special matrices and vectors,we use 1 for a column vector whose elements are all 1 and I for the identity matrix whose size can be inferred from the context.We uniformly use‖·‖without subscripts for norm notations.When the norm is on a vector,it should represent theℓ2norm,but when it is on a matrix,it should be the Frobenius norm.
Consensus algorithm.For the distributed optimization problem,we want every node to reach an agreement on the global parameter value.The agreement means finding the average ofninitial values stored onnnodes [5,6].So,the consensus problem can be regarded as a subproblem of distributed optimization problem.There are a plethora of research works conducted to explore the consensus problem.It has been proved that the consensus problem can be addressed in the sense of linear convergence on undirected graph[6].
Distributed optimization.On the basis of consensus algorithms,researchers proposed many distributed gradient-based algorithms[7-12].In undirected topology,Nedic et al.[7]and Yuan et al.[8]analyzed the convergence of Decentralized Stochastic Gradient Descent algorithm(DSGD)on nonstrongly convex objective functions.Lian et al.proved that for nonconvex and Lipschitz-smooth objective functions,DSGD can converge to a stationary point with the rate[13],whereTdenotes the number of episodes.Furthermore,Shi et al.proposed the EXTRA algorithm that had linear convergence rate for strongly convex objectives[14].Assran et al.further addressed challenges introduced by directed graphs by designing the SGP algorithm based on DSGD [15].More related algorithms has been reviewed in[16,17].
Parameter/gradient compression.With the rapid growth of learning models,traditional distributed learning faces significant communication pressure.As a result,Researchers have increasingly focused on designing communication-efficient algorithms[2,18-24].The proposed methods can generally be classified into three main categories.The first category is quantization.Instead of transmitting precise vectors (such as model parameters or gradients),nodes exchange limited bits that approximate the original vectors.The second category is sparsification,where only a few elements of a vector are transmitted accurately,while the remaining elements are forced to be zero.The third category involves transmitting parameters in some iterations rather than in every iteration.This aims to reduce the amount of transmitted information.All these methods aim to find the optimal parameter value by transmitting less information.Therefore,in this paper,we use the termcompressionto encompass these methods.Kolovaskova et al.proposed the CHOCO-SGD algorithm[2],a distributed stochastic algorithm over undirected graphs.This algorithm allows for arbitrary compression precision and has been proven to converge for strongly convex objectives.Taheri et al.[25] extended this algorithm to directed graphs for both convex and nonconvex objectives based on the SGP algorithm[15].Different from this work,our study is based on undirected graphs.Table 1 further illustrates the highlights of this work and the differences between the work involved.
Table 1: Comparisons on different algorithms for decentralized optimization problem
In this section,we will introduce the basic knowledge of distributed optimization and present the classic Distributed Stochastic Gradient Descent(DSGD)algorithm.Additionally,we will discuss the fundamental ideas underlying the CHOCO-SGD algorithm.
We assume that the computing nodes are distributed across an undirected graphG=(V,E),whereVandErepresent the set of nodes and edges,respectively.The DSGD algorithm is an efficient method to solve the distributed optimization problem in Eq.(1).It consists of two key components:consensus,also referred to asgossip,and parameter updates.The consensus step aims to achieve parameter agreement among all nodes.Specifically,we denote the local parameter of nodeias xi,and perform the following parameter aggregation process:
wheretdenotes the iteration number andaijdenotes the weight assigned by nodeito nodej.For compatibility with the underlying graph structure,aijis set to be positive if and only if edge(i,j) ∈Eori=j.This ensures that only neighboring nodes,as determined by the graph,can communicate and exchange their parameters.By using a matrixA=(aij)n×n1aij represents the entry in the ith row and the jth column of A.to incorporate all the weights,it has been proven that whenAis row stochastic(i.e.,each row ofAsums to one and all the entries are nonnegative),the parameters of all nodes can reach a consensus.This implies that each parameter xi(t)will converge to the average of the initial values(1)[26].This consensus algorithm enables the nodes to synchronize their parameter values.
The second component of the DSGD algorithm involves parameter updates using stochastic gradient descent.Combining these components,the update rule of DSGD is given as follows:
where ∇Fi(xi(t),ζi,t)represents the gradient of the local objective function evaluated at xi(t)using the data batchζi,t,andαdenotes the learning rate.This update rule combines the consensus step,which ensures parameter agreement,with the gradient descent step,which drives the parameters towards optimal values.We formally present the DSGD algorithm in Algorithm 1.Under some constraints,extensive studies [16] have demonstrated that the DSGD algorithm achieves a convergence rate ofon non-strongly convex objective functions,whereTdenotes the number of episodes.
However,when dealing with massive training models,the DSGD algorithm encounters a communication bottleneck.To address this limitation,a straightforward yet effective solution is to compress the parameters before transmitting them over the network.For this purpose,We introduce a compression operator denoted asQ,along with the definition of compression fidelityω,which indicates the ratio of preserved information.A higher value ofωsignifies a greater fidelity,implying that less information is discarded during the compression process.The value ofωsatisfies the following assumption.
Assumption 3.1.The compression operatorQsatisfies
EQ‖Q(x)-x‖2≤(1-ω)‖x‖2.
We take the expectation in this context because there might be inherent randomness in the compression operators,such as in the case of randomized gossip discussed later.It is worth noting that whenω=1,the compressed parameterQ(x) should ideally be the same as x (almost surely,although we may disregard negligible difference).And we state that the operatorQcan be either biased or unbiased,encompassing a wide range of compression techniques.
Example 3.1.Examples of compression
· randomized gossip:
Q(x)=
wherep∈(0,1].We can findω=p.
· Top-k sparsification[20]:for a parameter vector x ∈Rd,we selectkdimensions with the highest magnitude,wherek∈{1,2,···,d}.Subsequently,we transmit a compressed version of the parameter vector,retaining only the selectedkitems while setting all other positions to 0.It is worth noting that the compression fidelity in this case can be calculated asω
Through conducting simple experiments,one can observe that when integrating compression directly into the transmitted parameter in Algorithm 1 (i.e.,replacing the transmission of xi(t) withQ(xi(t))),the algorithm typically fails to converge.The primary reason behind this behavior stems from the magnitude of xi(t).In general,it lacks a bound,meaning its magnitude is not necessarily close to zero.As a result,the error introduced by compressing (such as forcing certain positions to be zero)becomes difficult to control.Hence,it is important to employ the compression operator appropriately and judiciously.
Based on Assumption 3.1,CHOCO-SGD was proposed and is presented in Algorithm 2.As the transmitted parameters are compressed and inexact,each nodeineeds to maintainnadditional auxiliary variables {ˆxj(t),j∈N} to approximate the true parameters of allnnodes.Specifically,for each nodei∈N,the newly added auxiliary variable ˆxj(t) represents an estimation of the actual local parameter of nodej.In Algorithm 2,instead of sending the compressed parameters,the nodes transmit the compressed value of the difference between the true local parameter and its estimator.Subsequently,upon receiving messages from neighboring nodes,every node updates its estimators of parameters of other nodes,denoted as{ˆx1(t+1),···,ˆxn(t+1)}.If the estimations maintained by each node are accurate,the error introduced by compression is minimal and will eventually vanish.It has been proven thatQ(ˆxi(t)-xi(t))asymptotically equals to ˆxi(t)-xi(t),which does not hold forQ(xi(t))and xi(t).
Koloskova et al.[2]has proved that Algorithm 2 achieved a convergence rate ofon strongly convex objective functions when specific parameters such as the step size and regularization parameter were appropriately chosen.Furthermore,it has been established that this algorithm can converge with arbitrary compression fidelity,indicating minimal communication pressure.To evaluate its potential in practical scenarios,we investigate its convergence properties on non-strongly convex objective functions.
In this section,we begin by introducing assumptions regarding the underlying graph structure.As mentioned previously,we useG=(V,E) to denote the undirected graph and useNito denote all the neighbors of nodeitogether with itself,i.e.,Ni={j: (i,j) ∈E}∪{i}.We have a couple of well-accepted assumptions in distributed optimization,which are related to the graph structure and the corresponding weight matrix.
Assumption 4.1.The underlying graph structureGis connected.
Assumption 4.2.Given the aggregation weightsaij,the weight matrixA=(aij)n×nis doubly stochastic and non-negative,i.e.,
A·1=1,AT1=1,aij≥0,whereAsatisfiesσ2<1,whereσ2denotes the second largest singular value ofA.
Remark 4.1.Under the undirected graph setting,a doubly stochastic matrix can be easily found.Given an undirected graphG,with its adjacent matrixA0,the degree matrixDand the maximum degreeΔ,it can be directly shown thatis a valid doubly stochastic matrix[27].
We next introduce several basic assumptions on the objective function and its gradients.Note that these assumptions are common in first-order continuous optimization.
Assumption 4.3.Each local objective functionfi(·)is convex,i.e.,
andL-smooth,i.e.,there exists a constantL>0(for alli∈N),such that
The definition ofL-smooth is equivalent to the following inequality:
Assumption 4.4.There exists a constantM,such that
Assumption 4.5.There exists a constantσ,such that
Assumptions 4.4 and 4.5 bound the local gradients and variance with respect to data samples,respectively.The latter assumption can be easily satisfied when the dataset across nodes is independent and identically distributed.Besides,the following asymptotic property ofAimplies that all the elements inAtwill be distributed in the vicinity ofast→∞.
Lemma 4.1.IfA∈Rn×nis a doubly stochastic matrix withσ2<1,then there exists constantsλ∈(0,1)andC>0,such that
whereAtrefers to the product oftmatrices.
This lemma states that as matrix multiplication progresses,all elements ofAtconverge toThe detailed proof is presented in the appendix.With these assumptions and lemma,we can analyze the convergence properties of CHOCO-SGD for non-strongly convex objectives.
Theorem 4.1.Under Assumptions 3.1 and 4.1-4.5,and assumeα≤the CHOCO-SGD algorithm satisfies
Furthermore,the expressionω>is a sufficient condition of achievingconvergence rate.In this constraint,ωneeds to be higher than a threshold related to the number of nodes and communication topology.We will demonstrate with experiments that CHOCOSGD cannot converge on non-strongly convex objectives with low compression fidelity.In other words,the CHOCO-SGD algorithm does not allow arbitrary compression fidelity.
Proof.For simplicity,we let
and then the updating rule of Algorithm 2 can be expressed as
Note that when we use the compression operatorQon a matrix,we mean compressing every column of that matrix.Since the underlying topology is fixed,nodeiand nodej(i,j∈Nk)have the same estimatorfor their common neighbor nodek.Then we can use one matrix ˆX(t) to denote the originaln×nauxiliary variables.Now,we start the proof by analyzing the distance between the average parameter in the(t+1)thiteration,denoted as(t+1),and the optimum,denoted as x*.By the updating rule and properties ofA,we have
where the inequality comes from Cauchy-Schwarz inequality.The two terms in the right-hand side(RHS)of the last equation can be bounded as follows:
Therefore,aftertiterations,the difference between the current parameter and the global average can be expressed as
where the last inequality applies Assumption 4.5.Then Eq.(25)can be written as
By substituting this result to Eq.(19),we can get Theorem 4.1.
Testbed setting and topology.Our experiments are conducted on a testbed of 8 servers,each with an 8-core Intel Xeon W2140b CPU at 3.20 GHz,a 32 GB DDR4 RAM and a Mellanox Connect-X3 NIC supporting 10 GbE links.We use PyTorch as the experiment platform and employ its Distributed Data-Parallel Training (DDP) paradigm to realize parameter communication among the 8 servers.The main topology in experiments is a ring topology,as shown in Fig.1a.Ring topology can be used in Local Area Network(LAN)or Wide Area Network(WAN)networks,and is also commonly used[28]in distributed learning.
Figure 1:Undirected topologies with 8 nodes
We consider a task where all nodes aim to classify images in the MNIST or CIFAR-10 dataset into 10 classes using a linear neural network.The key features of both datasets are listed in Table 2.The objective function for this task is cross-entropy loss.The cross-entropy loss of a data batch is defined as
wheremandlrepresent the batch size and the number of categories,respectively.In this equation,q(xij)denotes the predicted probability of sampleibelonging to classj,whilep(xij)represents the true label,which is equal to 1 when sampleibelongs to classjand 0 otherwise.The entire training dataset is divided into 8 equal partitions,each stored in an independent node.Unless otherwise specified,the learning rate and batch size are set to be 0.01 and 64,respectively.
Table 2: Key features of datasets
We evaluate the performance of the CHOCO-SGD algorithm in comparison to the DSGD baseline.In CHOCO-SGD,we examine the efficacy of two compression operators outlined in Example 1:randomized gossip withp=0.7 and Top-k sparsification withω=0.7.To align with our theoretical convergence criterion,the performance is evaluated based on the global loss computed on the global average parameter,denoted asf(¯x(t)).The model is trained for 25 episodes,where each episode corresponds to a pass through the entire dataset.
The results in Fig.2a demonstrate that CHOCO-SGD and DSGD exhibit comparable convergence rates,with similar reductions in loss after the same number of episodes.To distinguish the curves,we scale the loss values to the range[0,1]with the following formula:
wherevis the real loss value in the training process,vminandvmaxdenote the minimal and the maximal loss value achieved by three algorithms,respectively,andvscalerepresents the scaled loss value.The scaled results are depicted in Fig.2b on a logarithmic scale.In the following context,the scaled training loss is processed likewise.In Fig.2b,CHOCO-SGD and DSGD show only marginal differences in convergence rate.Furthermore,Figs.2c and 2d (scaled) illustrate the comparison of the loss with respect to the number of transmitted bits.Remarkably,CHOCO-SGD,employing either randomized gossip or Top-k sparsification methods,achieves lower loss values while transmitting the same volume of data.This outcome can be attributed to the compression of certain values,which enables nodes to transmit fewer data in each communication round without adversely affecting the convergence rate.Similar results also appear on CIFAR-10 datasets,as shown in Fig.3.CHOCO-SGD transmits about 1.2×1010bits,while DSGD needs to transmit 1.7×1010bits,when they achieve the same training error.These results align with our theoretical analysis,indicating that compression can effectively reduce the amount of transmitted data while exerting minimal influence on the convergence rate for non-strongly convex objectives,provided that the fidelity threshold is appropriately maintained.
Figure 2:Training loss vs.episodes/transmitted data amount of CHOCO-SGD and DSGD on MNIST Dataset
Figure 3:Training loss vs.episodes/transmitted data amount of CHOCO-SGD and DSGD on CIFAR-10 Dataset
To further investigate the impact of compression fidelity,we conduct several experiments by controlling the values ofpandωwithin the range of 0.9 to 0.5 for two compression operators.In randomized gossip,a smaller value ofpincreases the probability of transmitting full-zero parameters,which could lead to inaccuracies in ˆxjas described in Algorithm 2.The change in (scaled) training loss with respect to iterations and transmitted bits is depicted in Figs.4a,4b(MNIST)and Figs.5a,5b(CIFAR-10),respectively.We observe that lowering the value ofpresults in fewer transmission bits without slowing down the convergence rate.Similarly,the results obtained using Top-k sparsification,as shown in Figs.4c,4d and 5c,5d indicate that reducing the number of bits,while maintainingωwithin a certain threshold,ensures satisfactory convergence rates and reduces communication burden.
Figure 4:Training loss of CHOCO-SGD with randomized gossip(a),(b)and with Top-k sparsification(c),(d)on MNIST dataset.Both p and ω vary from 0.9 to 0.5
Figure 5:Training loss of CHOCO-SGD with randomized gossip(a),(b)and with Top-k sparsification(c),(d)on CIFAR-10 dataset.Both p and ω vary from 0.9 to 0.5
Furthermore,we conduct experiments to determine if CHOCO-SGD fails to converge when the compression fidelity is small for non-strongly convex objectives,as stated in Eq.(6).Specifically,for randomized gossip and Top-k sparsification,we set bothpandωto 0.4.Fig.6 demonstrates that in an over-compression scenario,the training loss of CHOCO-SGD continues to increase,implying that the algorithm fails to converge.With these results,we empirically verify that within the context of non-strongly convex functions,CHOCO-SGD requires control of the compression fidelity within a threshold to ensure convergence,which differs from the situation with strongly convex functions.In other words,CHOCO-SGD cannot achieve arbitrary compression in a non-strongly convex scenario.
Figure 6:Convergence of CHOCO-SGD with low compression fidelity on MNIST dataset
In this section,we study the influence of hyperparameters(learning rateαand batch sizeb),on the convergence performance of the CHOCO-SGD algorithm for non-strongly convex functions.The compression fidelity is set to bep=ω=0.7.We test on three learning ratesα=0.01,0.05,0.1 and three batch sizesb=32,64,128.The results are displayed in Tables 3 and 4.With the increase of the batch size,the final loss values of three algorithms after 25 training episodes become larger.Because each episode is one pass through the whole dataset,and larger batch size leads to fewer batches and thus fewer times of gradient descent and data transmission.Furthermore,the loss goes smaller when the learning rate gets larger,since the global parameter takes a larger step towards the global minimum.No matter in which hyperparameter setting,CHOCO-SGD and DSGD algorithms can achieve similar convergence error with the same training episodes,but CHOCO-SGD transmits less data amount.Furthermore,according to Table 4,the change of the learning rate has negligible influence on communication load,while larger batch size leads to fewer transmission data.
Table 3:Summary of training loss with different hyperparameter settings.The loss values are recorded after 25 training episodes and are rounded to three decimal places
Table 4: Summary of Transmitted bits with different hyperparameter settings.The bits values are recorded after 25 training episodes and are rounded to three decimal places
5.5.1InvestigatingDifferentFixedTopologies
To explore the algorithm’s applicability for non-strongly convex objectives,we test the convergence performance of CHOCO-SGD on two different topologies,as displayed in Fig.7.Fig.7a shows a bipartite topology,where nodes 1-4 and 5-8 are divided into two groups and each edge can only connect nodes in different groups.Besides,Fig.7b displays a more sparse topology,where parameters can only be transmitted along one path in two directions.The compression fidelity is set to bep=ω=0.7 which is the same as basic experiments.Figs.8 and 9 show the scaled training loss of MNIST dataset training on two topologies,respectively.In Fig.8,CHOCO-SGD and DSGD have the same convergence speed,since they have similar loss values after the same training episodes.But the transmission amount of CHOCO-SGD,about 5 × 109,is much less than 7 × 109of DSGD.Besides,in the sparser topology,Fig.9b demonstrates that both algorithms can transmit fewer data to achieve convergence,but the transmitted amount of DSGD is 40% more than that of CHOCOSGD.Therefore,these results further illustrate the robustness and applicability of CHOCO-SGD for non-strongly convex functions on various topologies.
Figure 7:Different undirected and connected topologies
Figure 8:Training loss vs.episodes/transmitted data amount of CHOCO-SGD and DSGD on bipartite topology Fig.7a
Figure 9:Training loss vs.episodes/transmitted data amount of CHOCO-SGD and DSGD on sparse topology Fig.7b
5.5.2Time-VaryingTopologies
We further explore the impact of time-varying topologies on the convergence of CHOCO-SGD.The training topology alternates between the two graphs in Fig.1.For instance,if the nodes are initially distributed over the left topology,in the next iteration,they will be connected according to the right topology.It is important to note that this setup still ensures connectivity (as stated in Assumption 4.1).We set the compression coefficients top=ω=0.9 to reduce their influence on the convergence.Fig.10 illustrates that CHOCO-SGD is not able to converge in a non-strongly convex and time-varying environment.Because some nodes receive parameters from specific neighbors inconsistently,resulting in an uncorrectable bias from true parameters and the algorithm diverges eventually.
Figure 10:Convergence of CHOCO-SGD on time-varying topologies
In this work,we study CHOCO-SGD algorithm in the case that the objective function is nonstrongly convex.We provide the first theoretical analysis of this situation and prove that when controlling compression fidelity within a certain threshold,it has the same convergence rate order as DSGD.Experimental results validate the theoretical analysis by demonstrating that CHOCOSGD converges more quickly than DSGD when transmitting the same data amount.Besides,a low compression fidelity and time-varying topology can make CHOCO-SGD not converge in the end.
There are several open topics worthy of investigating in our future work.The first one is to explore CHOCO-SGD’s performance in more complex non-convex scenarios,which can provide a deeper understanding of its capabilities.Besides,it will be meaningful to improve this algorithm to achieve arbitrary compression fidelity for non-strongly convex functions,since most real-world problems are non-strongly convex and such improvement can enhance CHOCO-SGD’s durability and applicability.
Acknowledgement:The authors are grateful for the theoretical support by Simiao Jiao.Besides,this paper’s logical rigor,integrity,and content quality have been greatly enhanced,so the authors also wish to express their appreciation to anonymous reviewers and journal editors for assistance.
Funding Statement:This work was supported in part by the Shanghai Natural Science Foundation under the Grant 22ZR1407000.
Author Contributions:The authors’contributions to this manuscript are as follows:study conception:X.L.,Y.X.;experiment design and data collection:X.L.;theoretical analysis,interpretation of results,and draft manuscript preparation:X.L.,Y.X.All authors reviewed the results and approved the final version of the manuscript.
Availability of Data and Materials:The data,including MNIST and CIFAR-10 datasets,used in this manuscript is accessible via the following link:https://www.kaggle.com/datasets.
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.
Appendix A.Proof of Lemma 4.1
We firstly prove that for any matrixM∈Rd×n,ifM1=0,we have‖MA‖≤σ2‖M‖.Assume that with singular value decomposition,A=UΣVT,whereUis a matrix consists of eigenvectors ofAAT.IfU=(u1,···,un),becauseAAThas eigenvalue 1,we can chooseu1to be 1,which is the eigenvector of eigenvalue 1.Then
Computer Modeling In Engineering&Sciences2024年4期