A Machine Learning-Based Approximation of Strong Branching

程序猿声 2021-08-09
955 字丨阅读本文需 4 分钟

论文阅读笔记,个人理解,如有错误请指正,感激不尽!该文分类到Machine learning alongside optimization algorithms。

01 Introduction

目前大部分 mixed-integer programming (MIP) solvers 都是基于 branch-and-bound (B&B) 算法。近几年很多新的特性如cutting planes, presolve, heuristics, and advanced branching strategie等都被添加到了MIP solvers上以提高求解的效率。

branching strategie对MIP的求解效果有着很大的影响,也有很多文献对其进行了研究。

最简单的branching strategie为most-infeasible branching,即每次选择小数部分最接近0.5的变量进行分支。还有pseudocost branching,该策略首先对分支过程中的dual bound increases进行一个记录,然后在分支的时候利用该信息对候选变量的dual bound increases进行一个评估,这种方法有个缺点就是一开始的时候没有足够的信息可供使用。strong branching则克服了这一缺点,它通过计算每个fractional variable分支后的linear programming (LP) relaxations,从而显式地评估dual bound increases,然后最大的 increases就作为当前分支的节点。它的效果是显而易见的,但是,分支节点过多,每次求解LP relaxations需要花费过多的时间,导致了strong branching的求解效率过低。当然还有很多分支策略,例如reliability branching, inference branching和nonchimerical branching等,感兴趣的可以去读读文献。

这篇文章的contribution在于,利用机器学习的方法,对strong branching进行了学习,并模型集成到B&B算法的框架中,以加速算法的求解速度。

这篇文章处理的二进制MILP问题有如下的形式:

其中,分别表示成本系数和系数矩阵。在右边,和分别为整数变量和实数变量的下标集合。

02 Learning Branching Decisions

本文并不是提出一个新的branching heuristic,而是使用machine learning通过学习模仿其他分支策略生成的决策去构建一个branching strategy。本文的目标是创造一个效率更高的strong branching的approximation。

2.1  Data Set Generation

机器学习首先是收集数据对,其中为当前分支节点中待分支变量的特征值,而则是该特征对应的标签。对模型进行训练。收集数据可以使用strong branching对training problems进行求解,并将求解的中间过程记录下来。具体就是每个节点分支变量的特征以及标签值,这些数据最终作为机器学习算法的输入而对模型进行训练。

2.2 Machine Learning Algorithm

在这篇文章中,使用的算法为Extremely Randomized Trees或者ExtraTrees。ExtraTrees是从随机森林直接修改过来的,之所以选择ExtraTrees是因为它对于参数的取值具有较强的robust性。

03 Features Describing the Current Subproblem

特征对于机器学习算法来说是非常重要的,他们对方法的有效性起着关键作用。一方面,特征需要足够完整和精确以更加准确地描述子问题。而另一方面,他们又需要在计算中足够高效。这两方面是需要权衡的,因为有一些特征对方法的效果起着非常显著的作用,但是计算的代价又非常大。比如,在分支过程中,对某支进行分支时LP目标值的提升值,就是一个非常好的特征,也在strong branching中使用了。但是计算这个值需要消耗的代价还是太大了,因此不适合该文的算法。

至于要选择哪些特征,又是处于什么样的考虑,作者也说了很多。感兴趣的可以看看论文。下面是详细的特征介绍,比较重要,因此直接把原文搬过来了。

3.1 Static Problem FeaturesThe first set of features are computed from the sole parameters  and . They are calculated once and for all, and they represent the static state of the problem. Their goal is to give an overall description of the problem.Besides the sign of the element  , we also use  and . Distinguishing both is important, because the sign of the coefficient in the cost function is of utmost importance for evaluating the impact of a variable on the objective value.The second class of static features is meant to represent the influence of the coefficients of variable  in the coefficient matrix . We develop three measures—namely, , and  — that describe variable  within the problem in terms of the constraint . Once the values of the measure  are computed, the corresponding features added to the feature vector  are given by  and . The rationale behind this choice is that, when it comes to describing the constraints of a given problem, only the extreme values are relevant.The first measure  is composed of two parts:  computed by  such that , and  computed by  such that . The minimum and maximum values of  and  are used as features, to indicate by how much a variable contributes to the constraint violations.Measure  models the relationship between thecost of a variable and the coefficients of the same variable in the constraints. Similarly to the first measure,  is split in  with , and  with . As for the previous measure, the feature vector  contains both the minimum and the maximum values of  and  .Finally, the third measure  represents the inter-variable relationships within the constraints. The measure is split into  and .  is in turn divided in  and  that are calculated using the formula of  for  and , respectively. The same splitting is performed for . Again, the minimum and maximum of the four  computed for all constraints are added to the features.3.2 Dynamic Problem Features

The second type of features are related to the solution of the problem at the current B&B node. Those features contain the proportion of fixed variables at the current solution, the up and down fractionalities of variable , the up and down Driebeek penalties (Driebeek 1966) corresponding to variable , normalized by the objective value at the current node, and the sensitivity range of the objective function coefficient of variable , also normalized by .

3.3 Dynamic Optimization Features

The last set of features is meant to represent the overall state of the optimization. These features summarize global information that is not available from the single current node. When branching is performed on a variable, the objective increases are stored for that variable. From these numbers, we extract statistics for each variable: the minimum, the maximum, the mean, the standard deviation, and the quartiles of the objective increases. These statistics are used as features to describe the variable for which they were computed.

As those features should be independent of the scale of the problem, we divide each objective increase by the objective value at the current node, such that the computed statistics correspond to the relative objective increase for each variable. Finally, the last feature added to this subset is the number of times variable i has been chosen as branching variable, normalized by the total number of branchings performed.

04 Experiments(steps 1) we generate a data set  using strong branching,(steps 2) we learn from it a branching heuristic, and(steps 3) we compare the learned heuristic with other branching strategies on various problems.4.1 Problem Setstwo types of problem sets, namely randomly generated problem sets and standard benchmark problems from the MIPLIB.The random problem sets are used for both training (steps 1 and 2) and assessing (step 3) the heuristics. MIPLIB problems are only used for assessment (step 3).

The possible constraints are chosen among set covering (SC), multiknapsack (MKN), bin packing (BP), and equality (EQ) constraints. We generated problems that contain constraints of type BP-EQ, BP-SC, and MKN-SC.

As some of those problems are going to be used to generate the training data set, we randomly split each family into a train and a test set. In the end, we have six data sets: BPEQ_train, BPEQ_test, BPSC_train, BPSC_test, MKNSC_train, and MKNSC_test.

The chosen parameters of ExtraTrees are , and .

4.3 Results

We compare our approach to five other branching strategies, namely random branching (random), most-infeasible branching (MIB), nonchimerical branching (NCB) (Fischetti and Monaci 2012), full strong branching (FSB), and reliability branching (RB) (Achterberg et al. 2005).

In these tables (Tables 4, 5, and 7), Cl. Gap refers to the closed gap and S/T indicates the number of problems solved within the provided nodes or time limit versus the total number of problems. Nodes and Time, respectively, represent the number of explored nodes and the time spent (in seconds) before the optimization either finds the optimal solution or stops earlier because of one stopping criterion.

The normal learned branching strategy (i.e., (, all)) is learned based on a data set containing samples from the three types of training problems (i.e., BPSC, BPEQ, and MKNSC).

Finally, Tables 6 and 7 report the results form our last set of experiments.

Additionally, in another experiment, we let CPLEX use cuts and heuristics (with default CPLEX parameters) in the course of the optimization to observe their impact on the efficiency of each branching strategy.

Overall, our method compares favorably to its competitors when cuts and heuristics are used by CPLEX. Indeed, in that case, our learned branching strategy is the fastest (almost three times faster than the second fastest method (i.e., MIB)) to solve all 30 considered problems.

Note that the apparent bad results of RB are due to three problems that are especially hard for that branching heuristic (air04, air05, and mod011).

Things appear to be different when cuts and heuristics are not used. Indeed, based on the results of Table 7, our method seems to be very slow, but the large number of nodes and the large amount of time is actually due to a small number of problems for which the method does not work well.

A possible explanation for why our method does not perform well on those problems can be that these problems, because too large, are not well represented in the data set that we use to learn the branching strategy.

Reference[1] Alejandro Marcos Alvarez, Quentin Louveaux, Louis Wehenkel (2017) A Machine Learning-Based Approximation of Strong Branching. INFORMS Journal on Computing 29(1):185-195. http://dx.doi.org/10.1287/ijoc.2016.0723


免责声明:凡注明来源本网的所有作品,均为本网合法拥有版权或有权使用的作品,欢迎转载,注明出处本网。非本网作品均来自其他媒体,转载目的在于传递更多信息,并不代表本网赞同其观点和对其真实性负责。如您发现有任何侵权内容,请依照下方联系方式进行沟通,我们将第一时间进行处理。

0赞 好资讯,需要你的鼓励
来自:程序猿声
0

参与评论

登录后参与讨论 0/1000

为你推荐

加载中...