Learning to Branch in Combinatorial Optimization With Graph Pointer Networks

2024-01-27 06:48RuiWangZhimingZhouKaiwenLiTaoZhangLingWangXinXuandXiangkeLiao

IEEE/CAA Journal of Automatica Sinica 2024年1期

Rui Wang ,,, Zhiming Zhou , Kaiwen Li , Tao Zhang ,Ling Wang , Xin Xu ,,, and Xiangke Liao

Abstract—Traditional expert-designed branching rules in branch-and-bound (B&B) are static, often failing to adapt to diverse and evolving problem instances.Crafting these rules is labor-intensive, and may not scale well with complex problems.Given the frequent need to solve varied combinatorial optimization problems, leveraging statistical learning to auto-tune B&B algorithms for specific problem classes becomes attractive.This paper proposes a graph pointer network model to learn the branch rules.Graph features, global features and historical features are designated to represent the solver state.The graph neural network processes graph features, while the pointer mechanism assimilates the global and historical features to finally determine the variable on which to branch.The model is trained to imitate the expert strong branching rule by a tailored top-k Kullback-Leibler divergence loss function.Experiments on a series of benchmark problems demonstrate that the proposed approach significantly outperforms the widely used expert-designed branching rules.It also outperforms state-of-the-art machine-learning-based branch-and-bound methods in terms of solving speed and search tree size on all the test instances.In addition, the model can generalize to unseen instances and scale to larger instances.

I.INTRODUCTION

COMBINATORIAL optimization seeks to explore discrete decision spaces, and to find the optimal solution in an acceptable execution time.Combinatorial optimization problems arise in diverse real-world domains such as manufacturing, telecommunications, transportation and various types of planning problem [1], [2].These kinds of problems can be immensely difficult to solve, since it is computationally impractical to find the best combination of discrete variables through exhaustive enumeration.In fact, most of the NP-hard problems in mathematical and operational research fields are typical examples of combinatorial optimization,such as the traveling salesman problem (TSP), maximum independent set [3], graph coloring [4], Boolean satisfiability[4], etc.

A large number of approaches have been proposed to tackle combinatorial optimization challenges these years [5].Exact algorithms can always find the optimal solution to a combinatorial optimization problem [6].A naive way is searching all possible solutions through enumeration, which results in an intractable solving time.Some advanced techniques have been proposed, such as branch-and-bound [7], to efficiently prune the searching space.Approximation algorithms are used when the optimal solution cannot be found in polynomial time [8].They guarantee a solution within a certain ratio of the optimal one.Examples include algorithms for problems like vertexcover or set-cover.However, such algorithms may not exist for all real-world combinatorial optimization problems.Heuristic algorithms are designed to find a good (if not optimal) solution quickly, and are useful when the problem is too large or complex for exact methods.Greedy algorithms, local search, and hill climbing are examples of heuristic methods[9].Metaheuristic algorithms provide a higher-level, general strategy that can be applied to many different types of combinatorial optimization problems.These are often inspired by natural processes and include genetic algorithms, simulated annealing, ant colony optimization, particle swarm optimization, and more.Metaheuristic algorithms cannot guarantee finding the optimal solution, however, require less computing time than exact algorithms [10].

As exact algorithms can always solve an problem to optimality, modern optimization solvers generally employ exact algorithms to solve combinatorial optimization problems,which can be formulated as mixed-integer linear programs(MILPs).The branch-and-bound (B &B) method is a typical exact method that solves MILPs in a divide-and-conquer manner.B&B [7] recursively splits the search space of the problem into smaller regions in a tree structure, where each node represents the subproblem that searches subsets of the solution set.Subtrees can be pruned once it provably cannot produce better solutions than the current best solution; otherwise,the subtree is further partitioned into subproblems until an integral solution is found or the subproblem is infeasible.In this solving process, there are several decision-making problems that should be considered to improve performance: node selection problem, i.e., which node/subproblem should be selected to process next given a set of leaf nodes in the search tree? Variable selection problem (a.k.a.branching), i.e., which variable should branch on to partition the current node/subproblem?

Historically, decisions related to MILP instances have relied on meticulously crafted heuristics.The creation of these hardcoded expert heuristics often requires considerable design efforts and extensive trial-and-error.However, the advent of artificial intelligence has sparked increased interest in using machine learning models to learn these heuristics, rather than relying on expert design.This shift is logical, as heuristics often consist of a series of rules, which could feasibly be parameterized by models such as deep neural networks.

Recent investigations have indeed begun to explore these learning-based approaches [11]-[14].Yet, these pioneering efforts still pose significant challenges: identifying effective features to represent the current state of the B&B process on which the branching decision is based; and developing effective models that can map the B&B state to the branching decision.These challenges underline the need for continued research and development in this promising area.

This paper presents an innovative approach to addressing the branching problem in B&B using a graph pointer network model, trained to imitate the effective yet computationally intensive “expert” strong branch heuristic.We aim to produce strategies that result in a small number of search nodes,approaching the performance of the strong branch heuristic,while maintaining a low computation cost.Though this idea is not new [13]-[15], we improve the performance of the learning model in a novel way.The contributions of this work are summarized as follows:

1) Traditional B&B methods, which depend on manuallydesigned branching heuristics, often struggle with adaptability and efficiency across diverse problem scenarios.In contrast, we propose using neural networks to automatically learn these branching heuristics.

2) In addition the typical graph features that are commonly explored, global and historical features are designed in this work, providing a more comprehensive and richer representation of the problem state.

3) We introduce an innovative model that combines the graph neural network with a pointer mechanism.The graph neural network processes the graph features, while the pointer mechanism assimilates the global and historical features to finally determine the variable on which to branch.

4) Our research presents a top-k Kullback-Leibler divergence loss function, specifically designed to train the model to imitate the strong branch heuristic effectively.

5) Notably, our proposed method consistently surpasses both expert-crafted branching rules and contemporary machine learning techniques across all tested problems.Once trained, our model demonstrates remarkable generalization abilities, effortlessly adapting to even unseen, larger instances.

The remainder of the paper is organized as follows.Section II reviews the recent advances of applying artificial intelligence methods for combinatorial optimization.Section III introduces the preliminaries of the work.The proposed graph pointer network model is described in Section IV.Section V outlines the imitation learning method for optimizing the model parameters.The experiment setup and numerical results are presented in Section VI-A.The last section gives the concluding remarks and future perspectives.

II.RELATED WORK

Recent days have seen a surge of applying artificial intelligence methods for combinatorial optimization.

Vinyalset al.[16] developed a pointer network model for solving small scale combinatorial optimization problems like the traveling salesman problems (TSPs).It borrowed the idea of the widely used sequence-to-sequence model in the machine translation field, and used the attention mechanism to map the input sequence to the output sequence.This work inspired a number of subsequent research that involved using machine/deep learning methods for combinatorial optimization.

Most the current works focus on solving the combinatorial optimization problems in an end-to-end manner.Belloet al.[17] first proposed the use of a deep reinforcement learning(DRL) method to optimize the pointer network model, which can output the solution sequence directly.Nazariet al.[18]investigated the vehicle routing problem (VRP) by modifying the pointer network and the attention mechanism.Daiet al.[19] developed a structure2vec graph neural network (GNN)model for combinatorial optimization.The GNN model can encode the graph feature of the problem and aid the decisions.Other works [20]-[23] explored advanced GNN models like the graph convolution networks (GCNs) and diverse training methods to solve the combinatorial optimization problems more effectively.Moreover, authors in [24], [25] improved the attention mechanism of the pointer network by leveraging the recent advances of the famous Transformer model [26] in the field of seqence-to-seqence learning.The attention model developed by Koolet al.[25] achieved state-of-the-art performance among the above approaches.This model can solve a number of combinatorial optimization problems, such as the TSP, VRP, the orienteering problem, etc.In addition, Liet al.[27] extended this line of work to a multiobjective version.

The initial efforts in using statistical learning for devising branching rules in B&B were spearheaded by Khalilet al.[12].They devised a branching rule tailored for an individual instance during the B&B procedure.This approach was also explored by Alvarezet al.[28] and Hansknechtet al.[29],where they focused on learning a branching rule offline from a set of related instances.In all these studies, the aim was to learn a branching strategy by emulating the expertise of strong branching.Specifically, both Khalilet al.[12] and Hansknechtet al.[29] approached it as a ranking challenge, aimed at learning a hierarchical order of the candidates by the expert.On the other hand, Alvarezet al.[28] approached this as a regression task, intending to directly ascertain the strong branching scores for each candidate.Gasseet al.[13] pioneered the application of GNN for branching policies in MIPs.Their approach leveraged the inherent variable-constraint bipartite graph representation, utilizing imitation learning to swiftly approximate strong branching.Extending upon this foundation, Guptaet al.[14] introduced a hybrid framework.In their system, the GNN model operates at the root node,while a computationally efficient (albeit weaker) model governs the subsequent nodes.Moreover, Nairet al.[30] presented two innovative models: neural diving and neural branching, to augment the capabilities of traditional MIP solvers.The neural diving model is adept at predicting partial assignments for integer variables, facilitating the generation of more compact MIPs.Conversely, neural branching harnesses a GNN model to approximate optimal branching policies.

In a broader context, numerous researchers have developed machine learning techniques to refine exact optimization methods, even outside the scope of general MILPs.A comprehensive review on this topic was recently presented by Blum and Roli [10].

III.PRELIMINARIES

A. Problem Definition

1)Mixed Integer Linear Program: A combinatorial optimization problem can be always modeled as a MILP, having the form

The aim is to find an optimal set of x to minimize the objective function with c as the objective coefficient vector.There aremconstraints andndecision variables.A subset of the decision variables are integers.I ⊆{1,...,n} is their index set,and Z is their decision space.A,b are the coefficient matrix and the right-hand-side vector of the constraints.l ,u bound the decision variables.

2)LP Relaxation of a MILP: Given the complexity of MILP problems, where some variables are required to be integer values, they can be quite challenging and computationally expensive to solve exactly.LP relaxation simplifies these problems by ignoring the integer constraints and treating all variables as continuous.This results in a standard linear programming problem, which is significantly easier and faster to solve using well-established algorithms such as the Simplex method.This LP relaxation serves as a lower bound (in minimization problems) for the original MILP problem.This is because relaxing the integer constraints allows for a broader feasible region and hence, potentially, a smaller minimum value.

3)Branch-and-Bound: Branch-and-bound begins by solving the LP relaxation of the original MILP.The obtained solution x∗provides the lower bound to the problem.If the obtained solution respects all the MILP integrality constraints,it is the optimal solution to (1), and the algorithm terminates.If not, the LP relaxation is further partitioned into two subproblems by branching on an integer variable that does not respect integrality of the MILP.This is done by adding the following two constraints into the LP relaxation, respectively[13]:

where ⎿xi⋆」 refers to the maximum integer value that is smaller thanx⋆i, and 「x⋆i⏋ is the minimum integer value that is larger than.Here,iis called the branching variable.

By branching oni, two new LPs are constructed, which refer to the leaf nodes/subproblems of the search tree.The next step is to pick one leaf node, and repeat the above steps.Once a feasible solutionxˆ is found, that is, all the MILP integrality constraints are satisfied, there is an upper bound to the problem.If a solution is found with a lower objective value thanxˆ, the upper bound is updated.On the other hand, if a solution is found with worse objective value than the current upper bound, this subproblem is pruned and no longer branched.The subproblem is also fathomed if the solution is an integer or the LP is infeasible.The above procedures are repeated until no subproblems remains.The incumbent solution with the best bound is returned [13].

B. Branching Strategies

In the branching variable selection decision process, an integer variableiis selected among the candidate variablesthat do not satisfy the integer constraint.Existing methods usually score each candidate variable inC according to some handcrafted heuristics, and the variable with the largest score is selected for branching.The most commonly used scoring criterion is the change of the lower bound of the sub-problem after the variable is branched [31].Based on this criterion, a series of branch rules are designed to improve the efficiency of B &B.

where the product function is usually considered as the scoring function, that is, score(a,b)=a×b.ϵis a small constant.The SB rule computes the SB scores for all the candidate variables in the candidate set C, and selects the decision variable with the largest SB score to branch on.Each branching decision requires long computational time since computing each SB score requires solving two LP sub-problems.In this case,the SB-based B&B algorithm usually suffers heavy computational burden although SB can greatly reduce the search tree size.

In view of the heavy computational burden of the SB method, calculating the pseudocost instead of the SB score is another commonly used method in the current optimization solver.Pseudocost branching (PB) estimates the score of a variable according to its historical scores during the previous search process.Instead of solving the two sub-problems by branching oni, the upwards (downwards) score of variableiis the average value of the objective value changes when upwards (downwards) branching on variableiin the previous branching process.This can greatly shorten the calculation time.Denote the upwards and downwards average scores of variableias Ψ-jand Ψ+j, where its pseudocost scorePBiis calculated as [12]

It can be seen that there is a contradiction between the branching performance and the time cost by making each branching decision.This work aims to leverage the power of neural networks to emulate the SB rule, striving for comparable performance but with significantly reduced computational expenditure.

IV.MODEL

A graph pointer network (GPN) model is proposed to simulate the SB strategy previously mentioned.This model’s input is the current state of the solver, while the output corresponds to the decision on variable selection.First, the B&B procedure is structured as a Markov decision process.At each stage,the model interprets the current state and selects an appropriate variable, causing a corresponding change in the solver’s state.Then, the state of the solver is defined to encompass the graph structure feature, global feature, and historical feature.Finally, a graph pointer neural network model is developed in

accordance with the state definition, enabling the perception of the solver’s current state and facilitating the execution of branching decisions.

A. Markov Decision Process Modeling

B&B can be modeled as a Markov decision process [13], as shown in Fig.1.

At each decision stept, the current state of the solver is st,which represents the state of the current search tree.Based on thecurrent state of the solver st, the agent selects a variableat=ifrom the candidate set C ={i|x⋆i∉Z,i∈I} according to the strategy π (at|st).

The solver solves the two LP sub-problems after branching on variablei.Subsequently, the solver updates the upper and lower bounds, prunes the search tree, and selects the next leaf node to branch.At this time, the solver has been converted to a new state st+1.Then, the solver applies the branch strategy π(at+1|st+1)again to make the branching decision.This process is looped until all the leaf nodes are explored.

The initial state s0of the Markov decision process corresponds to the root node of the B &B search tree.And the final state is the end of the optimization process, i.e., all leaf nodes cannot be branched further.Denote the branching policy asπ,where the Markov decision process can be modeled as [13]

Fig.1.Markov decision process of the branch-and-bound.

In this paper, we learn the branching policyπto imitate the SB rule, which is realized through the following steps: 1) Define the problem state st.At each step of the branch decision,the branch decision needs to be made according to the current problem state.However, there is no standardized definition of the solver state.It is necessary to extract effective features to better represent the solver state, so as to make better decisions accordingly.2) Parameterize the branch policyπvia a novel model.The model should be able to map the problem statestto the branching action atcorrectly.The models, such as neural networks, random forests and support vector machines,need to be designed according to the characteristics of B&B.3) Optimize the parameters of the model by an effective training algorithm.The modelπcan be learned through a variety of machine learning methods to minimize the size of the search tree or reduce the total run-time of the B &B algorithm.

The proposed deep learning-based B &B method consists of the above three parts introduced as follows.

B. State Definition

First,the state stof B&B at the decision-making steptis defined in this section.In addition to the graph features introduced in [13], the global features and historical features are designed, which can provide a more thorough representation of the solver state.Therefore, stis composed of variable features, constraint features, edge features, global features, and historical features, namely st=(V,C,E,G,H).

The graph features (V,C,E) of the problem is defined by the bipartite graph of the current solver state, as shown in Fig.2.The bipartite graph is composed ofmconstraints andnvariables.Variablesx1,x2,...,xnare on the left side of the graph.The right-hand side (constant) term of the constraint is on the right side of the graph.The edge (i,j)∈Eof the graph is the connection of the variableiand the constraintj, i.e., whether the constraintjincludes the variablei.The weight of the edge is the coefficient of the variableiin constraintj.

Fig.2.Graph structure of the MILP state.

According to the bipartite graph structure, the solver state is comprised of variable features, constraint features, edge features, global features, and historical features.

1) Variable features represent the attributes of candidate variables at branching stept, including the variable type, variable coefficient, current value of the variable, whether the current value of the variable is on the boundary, the decimal part of the solution value of the variable, etc.There arencandidate variables in total, and the feature dimension isd.Therefore, the variable feature dimension isn×d.The detailed introduction of the variable features is listed Table I.

2) Constraint features represent the attributes of the LP constraint at branching stept, such as the right value of the constraint, whether the left value of the constraint exactly reaches the boundary, the similarity of the constraint coefficient and the target coefficient, etc.The current LP problem has a total ofmconstraints, and the feature dimension isc.Thus, the dimension of the constraint feature ism×c, and the description of the constraint features can be found in Table II.Refer to [13] for a detailed description of variable and constraintfeatures.

TABLE I VARIABLE FEATURES

TABLE II CONSTRAINT FEATURES

3) The edge feature is the coefficient of each variable in each constraint.Therefore, there arem×nedges in total, and the feature dimension is 1.The coefficient value is 0 if the constraint does not contain a certain variable.

Gmainly includes two parts: a) Global features of the whole MILP, including the gap between the upper and lower bounds of the current stage of MILP, the number of feasible solutions/infeasible solutions, etc.; b) Global features of the current LP sub-problem node, including the depth of the current node, the LP objective value information of the current node, etc.

The depth of the current node and the gap between the upper and lower bounds can be directly obtained by calling the PySCIPOpt interface.The number of feasible/infeasible solutions is computed by the proportion of leaf nodes that produce feasible/infeasible solutions:

TABLE III GLOBAL FEATURES

whereNleavesis the number of all leaves.The gap between the current node’s LP objectivevLPvalue and the global upper/lower boundsvbis calculated by the following formula according to [15]:

where the current node’s LP objective value and the global upper/bounds are obtained from the PySCIPOpt interface.

The relative positionposof the current node’s LP objective value to the global upper boundvuand lower boundvlis computed as

5) The historical feature consists of two parts.The first part is comprised of features of all past branching decisions C1=a1···at-1at previous steps 1···t-1.The second part is comprised of features of variables C2=x1,x2,...whose values have changed when generating the current node.That is,C2is the set of variables whose values have changed in the solution of the new problem after adding an integer constraint to the parent problem.

Traditional approaches only considers variable features,constraint features and edge features [13].This work further extracts global features and historical features, so as to obtain a richer representation of the environment state st.The global status of the current search tree and the current node can provide more information for the agent to make branching decisions.Moreover, observing the variables whose values have changed when generating the current node, and observing the variables selected during the historical branching process, can also provide effective information for making the branching decisions.Therefore, it is expected that adding additional global features and historical features can better describe the state of the current problem.

C. Graph Pointer Network Model

In this section, a GPN that combines the graph neural network and the pointer mechanism is proposed to model the branching policy, which can map the solver state to the branching decisions effectively.

第二年4月的一天，王敬凯去市卫生局办事，见档案员正在对下属单位的各种报表、报告、总结进行分类装订，便从已经装订好的一摞档案中拿出一本，随意地逐页摸起来。他似乎已经作下了“毛病”，不管到什么地方，只要看见纸张都要摸上一摸。

From the features extracted in the previous section, it can be seen that the solver state has a bipartite graph structure, that is,the left nodes (variables) and the right nodes (constraints) are connected by edges, as shown in Fig.2.Graph neural network can effectively process the information of the graph structure, and has been successfully applied to various machine learning tasks with graph structure input, such as social networks and citation networks.Therefore, we encode the graph structure of the solver state by a graph neural network model.

In addition, we take the global and historical features as a query, and compute the attention value, which is then normalized as a softmax probability distribution, as a pointer to the input sequence.In this way, the variable with the largest probability is selected as the branching variable.

The proposed graph pointer neural network model is composed of two parts: 1) The graph neural network calculates the feature vector for each variable based on variable features,constraint features and edge features; 2) The pointer mechanism outputs the variable selection probabilities by computing the attention values according to variables’ feature vectors and the query which is constructed by the global and historical features.The detailed process of modeling the branching policy is as follows.

1)Initial Embedding Calculation: Variable features, constraint features, edge features, and global features have different dimensions.For example, the variable feature is 13-dimensional, and the global feature is 9-dimensional.Therefore, thedh- dimensional embeddings of the variable features xv, constraint features xc, edge features xeand global features xgare computed as follows:

where EMBEDDING(·) is a two-layer fully connected neural network.The hidden dimension isdhand the activation function between layers is LeakyRELU

2)Graph Neural Network:Next, the final variable features are computed by a graph convolution neural network similar to [13]

Function g (·) is defined as

where FF is a two-layer fully connected neural network with LeakyRELU activation function.Function f(·) is also a twolayer fully connected neural network with LeakyRELU activation function.As demonstrated in (10), the graph embedding is computed by two successive convolution passes, one from variables to constraints and the next from constraints to variables.Thefirstconvolutionstepcomputesthefeatures xicof constraintiaccording to featuresxvjofits connected variablesj, features of the edgeand its own features.The second step computestheembeddingof variablejaccording tothe obtainedfeaturesofitsconnected constraintsi, features of the edgeand its own features.Through the graph convolution process, the final variable features aggregate the original variable features, constraint features and coefficient features of the problem, so as to effectively contain the graph information of the MILP state.

3)Historical Feature Calculation:At branching stept, the first part of the historical features is the past branching decisions C1=a1···at-1at steps 1 ···t-1.We compute this part ofdh-dimensional historical features as

4)Pointer Mechanism: The attention value, which can be seen as a pointer to the candidate variables, is computed by a compatibility function of the query with the key.The query,which is composed of global features and historical features,represents the current state of the solver.The key represents the feature of each candidate variable.Specifically, the query vector is calculated as the weighted average of global and historical features

whereutiistheattentionvaluecomputedbythecompatibility function.Notethatother compatibilityfunction can also be applied to compute the attention, which can refer to [26] for more details.softmax is used to normalize the attention value to the probability distributionpti, representing the probability of selecting variableiat branching stept.In this case, we can choose the variable with the highest probabilityptias the branching variable.

In addition, it is necessary to normalize the variable features, constraint features, edge features, and global features due to their different data range.To this end, the prenorm layer is applied as introduced in [13] to normalize the variable, constraint, and edge features.We also add a prenorm layer of global features accordingly, so that the neural network model can deal with problem instances with global features of different scales.

D. Branch and Bound Algorithm Based on GPN

The proposed GPN model is then used to select the branching variable in B&B.The GPN-based B&B is illustrated in Algorithm 1.

Algorithm 1 Branch and Bound Algorithm Based on GPN Input: Root node R, representing the LP relaxation of the original MILP S∗Output: Optimal solution R.lowerBound ←-∞1: /*Initialize the lower bound of R*/Queue ←{R}2: /*Store the unexplored node into the Queue*/UpperBound ←∞3: /*Initialize the global upper bound*/S∗←null 4:5: while is not empty do N ←Queue.get()Queue 6: /*Dequeue the node*/N.lowerBound ≥UpperBound 7: if then 8: /*If node N’s parent node’s lower bound is greater than the global upper bound, prune this node*/9: continue 10: end if S r ←solve(N)11:12: if is not feasible then 13: /*prune this node*/14: continue 15: end if Or ←S r.objectiveValue S r 16:17: if then 18: /*If node N’s lower bound is greater than the global upper bound, prune this node*/19: continue 20: end if S r Or>UpperBound 21: if is feasible then UpperBound ←Or 22:S∗←S r 23:S∗24: /*Update the global upper bound and */25: continue 26: end if state=(V,C,E,G,H)27: Extract features of the solver state,V ←GPN(S r,state)28: /*Select varibale V by the GPN model*/a ←floor(V.value)29:30:L ←addConstraint(N,V ≤a)31:32: /*Branch on V and obtain the two LP sub-porblems*/L.lowerBound ←OrR.lowerBound ←Or R ←addConstraint(N,V ≥a)33: ,Queue.add(L) Queue.add(R)34: ,35: end while S∗36: return

First, the LP relaxation of the original MILP problem is set as the root node.The queue data structure is maintained to store the sub-problem nodes to be solved.Each node defines an initial lower boundl, which represents the lower bound of its parent node.After the global upper bound is updated, iflis greater than the global upper bound, then the node will be pruned.When the node is taken out of the queue, its lower bound is compared with the global upper bound, and the node is pruned if the lower bound is greater than the global upper bound.The global upper bound is initialized to ∞, and is updated every time a better feasible solution is obtained.Variable, constraint, edge, global and historical features of candidate variables are extracted, which are subsequently input to the GPN model.The model outputs the probability distribution of the candidate variables.The one with the highest probability can be selected as the variable to branch on.Two subproblems are generated accordingly.This process loops until the queue is empty, i.e., all leaves of the search tree are explored.

Withnvariables andmconstraints, the computational complexity of the GPN model isO(mndh) wheredhis the dimension of hidden layers.When applying the GPN model for branching, a single forward pass through the GPN model can yield the result.Therefore, it costs much less running time than the classical SB rule, which requires solving a number of subproblems.

V.TRAINING METHOD

where loss(∗) is a function that defines the difference between the true value and the predicted value.For classification problems, there are a number of l oss(∗) functions such as the accuracy and cross entropy.

However, in B &B, SB scores of different variables might be the same or pretty close.It is equivalent to select these variables.But only one variable is selected when constructing the labeled dataset.By applying loss functions like cross entropy,the similarities of SB scores between different variables cannot be leveraged.In this case, we choose to imitate the distribution of the SB scores instead of the branching actions.To this end, the SB scores of all the candidate variables are recorded to construct the training setAnd Kullback-Leibler (KL) divergence is used as a measure of the difference between the SB score distribution and the predicted probability distribution.By minimizing the KL divergence, the model can work better for the above situation where multiple variables own the same or similar SB scores.It can better help the model imitate the SB scores.

DenotePas the true distribution of the data andQas the predicted distribution of the model to fitP, KL divergence is defined as

Therefore, given the probabilities πθ(s) of the candidate variables output by the model, the model parametersθare optimized by minimizing

In addition, we only care about the variables with high SB scores.The probability distribution of other variables has no effect on the branching variable selection.Thereby, we emphasize the similarity loss of variables with high SB scores in the training phase.Specifically, the variables are sorted according to their probabilities output by the model.More attention should be paid to the first few variables.To this end,the KL divergence of the top-k variables is added to the loss item.In specific, the probabilities πθ(s) output by the model are sorted, and the firstkvariables Ikare selected.The KL divergence value of variables Ikis computed by (16) asAnd the loss for training the model is defined as

The first term of the loss can make the overall predicted distribution similar to the distribution of the SB scores, while the second term makes the model pay more attention to the variables of large probabilities and weakens the distribution of irrelevant variables for selecting the branching variables.This can alleviate the situation where a large amount of training time is cost to fit the distribution of irrelevant variables.

VI.EXPERIMENTAL RESULTS AND DISCUSSION

A. Experiment Settings

1)Comparison Algorithm: The proposed approach is compared against the following approaches.

a) First, the proposed approach is compared against the classic B&B algorithm.The branching rule of reliability branching (RB), strong branching (SB) and pseudocost branching(PB) are compared respectively.They are all implemented in the well-known SCIP solver.The cutting plane is only allowed at the root node.Other heuristics are disabled during the branching process for fair comparison.Our method is also implemented in the SCIP solver, and uses the same set of parameters as the competitor methods.

b) Next, the proposed approach is compared with the stateof-the-art machine learning-based B&B algorithms, often regarded as standard comparisons in the literature.The algorithms under consideration are: branching method based on ExtraTrees [33] model [28] (TREES); branching method [12](SVMRANK) and [29] (LMART) based on SVMrank [34]and LambdaMART [35] model; branching method based on graph neural network [13] (GNN).

2)Test Problems: Effectiveness of the proposed method is evaluated on the following three benchmark combinatorial optimization problems.

a)Set covering problem[36]: The set covering instances contain 1000 columns.The model is trained on instances with 500 rows, and is evaluated on instances with 500 and 1000 rows, respectively.

b)Capacitated facility location problem[37]: The instances are generated with 100 facilities.The model is trained on instances with 100 customers, and is evaluated on instances with 100 and 200 customers, respectively.

c)Maximum independent set problem[38]: The instances are generated following the process in [38].The model is trained on instances of 500 nodes, and is evaluated on instances with 500 and 1000 nodes, respectively.

3)Experimental Parameter Settings: All compared algorithms are implemented by Python on the SCIP solver.SCIP uses its default parameters.The hidden dimensions of the models are set todh=64.The Adam optimizer is used for training with learning rate of 0.001.k=10 is set for the top-k imitation learning.The learning rate decreases 80% if the loss does not decrease for 10 epochs.The training is terminated if the loss does not decrease for 20 epochs.

4)Training Data Generation: The SCIP solver with default settings is used to collect training samples offline.Random instances are generated and solved using the SCIP.During the collecting procedure, the branching rule of RB is adopted with a probability of 95%, and the branching rule of SB is adopted with a probability of 5%.Only the samples generated by SB are collected.The data of variable, constraint, edge, global and historical features, candidate variable sets, and SB scores of the variables is collected.

Instances are randomly generated and solved until 140 000 samples are collected.100 000 samples are used as the training set, 2000 samples are used as the validation set, and 2000 samples are used as the test set.

5)Evaluation: First, the capability of the GPN model in imitating the SB rule is examined.Since multiple variables may have the same or similar SB scores, the following indices are used to evaluate the model accuracy [13]: a) The percentage of times the output of the model is exactly the variable with the highest SB score (acc@1); b) The percentage of times the output of the model is one of the five variables with the highest SB scores (acc@5); c) The percentage of times the output of the model is one of the ten variables with the highest SB scores (acc@10).Moreover, the total solving time of the GPN-based B&B in comparison with benchmark methods is evaluated.

B. Results

Figs.3-5 present the training performances of the proposed GPN model in comparison to the classic GNN model across three test problems.The convergence of the loss and model accuracy on the validation set are both compared.

Fig.3.Training performances of the models on the set covering.

Results indicate that the proposed GPN model surpasses the conventional GNN in both convergence speed and overall convergence performance during training.Additionally, GPN consistently outshines GNN in model accuracy for all three problems on the validation set.The superiority of GPN is particularly evident on the location problem and the maximum independent set problem; here, GPN converges to values of 0.66 and 0.003, respectively, while GNN reaches only 0.72 and 0.0047.

Fig.4.Training performances of the models on the capacitated facility location.

The superior convergence performance of the GPN over the GNN model stems from its advanced modeling capabilities and informative features that represent the problem state.By incorporating an attention-based pointer mechanism, the GPN model comprehends the graph, global and historical characteristics of the problem more effectively, thereby facilitating more accurate decision-making.Therefore, when compared to the GNN model, which solely employs a graph convolution network to process constraint and variable features, the GPN model exhibits a more expedient convergence rate and lesser validation loss.This highlights the advantage of the GPN model in fostering computational efficiency while maintaining high-quality performance.

Tables IV-VI present the model accuracy of GPN, TREES,SVMRANK, LMART, and GNN methods on the test set.Results of TREES, SVMRANK and LMART are from [13].Results of acc@1, acc@5 and acc@10 are listed respectively.

Results outlined in Tables IV-VI indicate that for the set covering, capacitated facility location, and maximum independent set problems respectively, the GPN method consistently outperforms other approaches.This superiority is evident across all metrics: acc@1, acc@5, and acc@10.This empirical evidence solidifies the GPN’s position as the leading method in these scenarios.Its performance is distinctly superior in the maximum independent set problem.This observation aligns seamlessly with the performance of the model during the training and validation process, indicating a consistent model behaviour.This empirical evidence affirms our premise that the GPN method is adept at imitating the SB rule, surpassing the alternative methods examined.

Fig.5.Training performances of the models on the maximum independent set.

TABLE IV RESULTS OF MODEL ACCURACY ON SET COVERING

TABLE V RESULTS OF MODEL ACCURACY ON CAPACITATED FACILITY LOCATION

TABLE VI RESULTS OF MODEL ACCURACY ON MAXIMUM INDEPENDENT SET

In addition, the running time of the approaches is evaluated,since the aim of the branching models is to reduce the overall solving time of B&B.The solving time is determined by the size of the search tree, that is, the number of explored nodes.It is also determined by the time consumed making the branch decisions.Therefore, a ood branching model can reduce the size of the search tree while making fast branching decisions.

Tables VII-IX list the results of solving time and the number of explored nodes when using GPN and the compared approaches for solving the three test problems.The “TimeB”index represents the duration required to make a single branching decision.Results are obtained by solving 100 randomly generated problems and taking the average.

TABLE VII RESULTS OF RUNNING TIME ON SET COVERING

Table VII shows that, in comparison with the PB and RB rule, the proposed GPN method achieves at least 40% increase in the solution speed when solving the 500-row and 1000-row set covering instances.In terms of the number of explored nodes, GPN outperforms all of the compared methods except for the SB rule on the set cover instances.SB can always get the smallest search tree.But its total solving time has no advantage over our method due to its long computation time of making branching decisions.It is obvious that the GPN method outperforms all the compared machine learning methods in terms of solving speed and the ability to reduce the search tree on the set covering instances.

It can be seen from Table VIII that the GPN method shows greater advantages in solving the 100-customer and 200-customer capacitated facility location instances compared to other methods.In specific, GPN runs twice as fast as the PB and RB method.Compared with the machine learning methods, GPN has the fastest solving speed and the fewest number of nodes.

On maximum independent set instances, GPN achieves nearly 10% improvement in the solution speed and 20%reduction in the number of nodes as seen in Table IX.The solving time is reduced nearly twice when using the GPN compared with the PB and RB methods.

TABLE VIII RESULTS OF RUNNING TIME ON CAPACITATED FACILITY LOCATION

TABLE IX RESULTS OF RUNNING TIME ON MAXIMUM INDEPENDENT SET

Upon analysis of the results, it is evident that the proposed GPN method is capable of delivering branching performance commensurate with the SB rule, but with markedly less computational time.As a result, the solving speed of the B&B method is significantly enhanced.This illustrates the efficiency of the GPN approach.

Note that, the test instances are generated randomly, and are different from the training set.Once the model is trained, it can be generalized to unseen instances, and scale to larger instances.Although the RB heuristic is carefully handcrafted by experts, it is still defeated by the proposed GPN method,which can learn heuristics from the data.Experiments validate the novelty and efficiency of the GPN method.

The goal of B &B is to solve the combinatorial optimization problem as fast as possible, so the branch strategy should be a trade-off between the quality of the decision and the time spent on each decision.An extreme example is the SB branch rule: by calculating the SB score for variable selection, the final solution can be obtained with a small number of searches, but each decision step is very time-consuming, so that the overall running time is very long.From the results of the 500-row set covering problem in Table VII, the SB rule takes 0.562 s in average to make a single branching decision,while our GPN method takes just 0.048 s.Despite GPN processing more nodes (42.3) than SB (12.5), its total solving time is shorter.Similarly, PB, although faster in decisionmaking than GPN, processes more nodes (98.6) leading to a longer overall solving time.Comparatively, while SB has the best branching decisions, it’s the slowest.PB is the quickest but lacks in efficiency.GPN presents an optimal balance of performance and speed, offering the most effective solution time for B&B.This observation remains consistent across all other instances, validating the fact that the proposed method can achieve a better balance between the decision quality and decision time, therefore reducing the overall solving time of B&B.

VII.CONCLUSION

This paper has presented a novel method of modeling the variable selection strategy in B&B using a deep neural network model.We made use of graph features and introduced global and historical features to represent the solver state.The architecture combines a graph neural network with a pointer mechanism, enabling effective variable selection.Our experimental results on benchmark problems show that our approach surpasses traditional expert-designed branching rules and also outperforms state-of-the-art machine-learning-basedB&B methods.

Looking ahead, there are several promising avenues for extending and refining our work.

1) The process of constructing labeled datasets for imitation learning can be computationally intensive.To mitigate this, a potential solution would be to integrate reinforcement learning techniques.By using reinforcement learning, we could harness unsupervised training strategies that could alleviate the need for labor-intensive labeled datasets, paving the way for more scalable and efficient learning.

2) While our model has demonstrated its effectiveness on benchmark problems, it is essential to test it on a wider spectrum of practical problems.Exploring its applicability in diverse domains will provide insights into the generalizability and robustness of our approach.

3) The current model leverages graph, global, and historical features.Future work can explore the inclusion of other types of features, potentially capturing more nuanced aspects of the problem space, leading to even more informed decisions during the branching process.

4) One exciting direction would be the introduction of mechanisms allowing the neural network model to adapt in real-time based on feedback during the B&B process.This could lead to models that fine-tune their strategies on-the-fly,adjusting to the unique characteristics of the problem being solved.

5) As deep learning models can be computationally intensive, especially in real-time scenarios, future work can also explore optimizations tailored to specific hardware platforms,ensuring efficient execution and reduced computational overhead.

In conclusion, the present work lays a foundation upon which numerous exciting advancements can be built.We believe that the interplay of optimization and deep learning has much more to offer, and we are enthusiastic about the potential breakthroughs the future may bring.