Ameer Hamza Khan, Xinwei Cao, Shuai Li, Vasilios N. Katsikis, and Liefa Liao
Abstract— In this paper, we propose enhancements to Beetle Antennae search (BAS) algorithm, called BAS-ADAM, to smoothen the convergence behavior and avoid trapping in localminima for a highly non-convex objective function. We achieve this by adaptively adjusting the step-size in each iteration using the adaptive moment estimation (ADAM) update rule. The proposed algorithm also increases the convergence rate in a narrow valley. A key feature of the ADAM update rule is the ability to adjust the step-size for each dimension separately instead of using the same step-size. Since ADAM is traditionally used with gradient-based optimization algorithms, therefore we first propose a gradient estimation model without the need to differentiate the objective function. Resultantly, it demonstrates excellent performance and fast convergence rate in searching for the optimum of non-convex functions. The efficiency of the proposed algorithm was tested on three different benchmark problems, including the training of a high-dimensional neural network. The performance is compared with particle swarm optimizer (PSO)and the original BAS algorithm.
OPTIMIZATION plays an integral part in the efficient operation of almost all real-world systems [1]–[3]. Additionally, with the rise of machine learning in recent years, the development of numerically efficient and robust optimization algorithms has been a topic of great research interest [4]–[8].Conventional optimization techniques use gradient-based methods to search for the optimum value [9]–[12]. Although these algorithms have proven to be quite stable [13], [14],however, they require the symbolic or numerical computation of the gradient direction. Most of the practical optimization problems are highly nonlinear with multimodal objective functions and non-convex constraints. Numerically evaluating the gradient for such function can be a computationally intensive task [15], [16]. Additionally, the computation of gradient imposes several conditions about continuity and differentiability of the objective function [17]. Such conditions do not hold for the vast majority of the systems, e.g., integer programming [18]. In fact, for the vast majority of the optimization problem, specifically in the control system [19], an accurate model of the system might be unknown in advance and require real-time estimation of parameters [20], [21].
The large majority of the gradient-based optimization is well-suited for computing systems with high computation power. However, their implementation on low-end embedded systems is challenging due to their limited computation power. To overcome such challenges, a new class of optimization algorithm called metaheuristic optimization has gained the attention of researchers [22]–[24]. These algorithms are mostly inspired by natural phenomena and do not require the computation of the gradient of the objective function. These algorithms have been shown in the literature to possess excellent convergence properties and high numerical efficiency. Even for problems in control systems where the model is unknown apriori [25], such algorithms have been employed for the parameter estimation, and tuning of controller gains [26]. One of the significant advantages offered by metaheuristic optimization is a relaxation on the conditions of continuity and differentiability. These algorithms can be effectively employed for solving discrete optimization problems. Additionally, because of their low complexity, they can be efficiently implemented on embedded systems with limited computational power. Given the advantages of metaheuristic algorithms, they have found their applications in several real-world systems [27]–[32].
Almost all the metaheuristic optimization algorithm proposed in [33]–[38] have found their inspiration in natural processes. Most of these algorithms have been inspired by the behavior of the animals, whereas others found their inspiration from the biological and chemical systems [39]. For example,biological evolution have given rise to a complete class of metaheuristic algorithm [40]–[42], e.g., genetic algorithms(GAs) [43], [44]. Other commonly used metaheuristic algorithms inspired by the behavior of living organisms include particle swarm optimization (PSO) [45], [46]. PSO was inspired by the swarming behavior commonly observed in birds and insects. Swarming behavior [47] has been of particular interest to researchers because it was observed that large groups of the birds and insects are able to coordinate by just following a set of straightforward rules. Although each member of the swarm is only aware of its limited surrounding environment, still their combined effort can solve a largescale, complicated task. This behavior is termed as swarm intelligence [48], [49]. Swarm behavior can be considered as an optimization process in which a group of birds works together to maximize the survivability of the whole swarm.Swarm intelligence has inspired several other algorithms, e.g.,ant colony optimization (ACO) [50], which is based on the social behavior of ants. An ant colony can contain millions of ants, but their social behavior and swarm intelligence help them efficiently search for food, thus maximizing the productivity of the colony. Similarly, the artificial fish swarm algorithm [51] is based on the swarming behavior of fishes and other aquatic lives. Although several nature-inspired algorithm metaheuristic algorithms have been presented in literature, few of them are mentioned here: Beetle antennae search (BAS) [52], [53], Cuckoo search [54], [55], invasive weed optimization (IWO) [56], honey bee algorithm (HBA)[57], [58], and firefly algorithms (FAs) [59]. Additionally, the optimization algorithm inspired by the swarming behavior of animals has also been proposed in recent years, e.g., grey wolf optimizer (GWO) [60] and its variations [61], [62]. These algorithms are inspired by the hierarchical structure and hunting mechanisms adopted by the wolfs. Other optimization algorithms are based on the sinusoidal function [63], [64], and use them to converge the initial population of random particles to the real solution. These metaheuristic algorithms have been shown to have good performance in real-world practical scenarios. However, their application in machine learning is still limited. The machine learning models are usually very large-scale, having a large number of parameters.The training of these models require searching very highdimensional spaces, and conventional metaheuristic algorithm does not scale well to such high-dimensions [65].
As already mentioned in the abstract, in this paper, we consider BAS algorithm [52], [53]. Food foraging behavior observed in beetles inspired BAS. Beetle behavior is of particular research interest because unlike other insects,beetles usually do not work in a swarm and have the ability to search for food individually. Beetles can search for food using its antennae and sensitive sense of smell. Besides, the beetle antennae have several small fibers, and by sensing the difference of smell at each of the fiber, the beetle can develop a map of smell intensity of surrounding. This map helps in finding the direction of maximum smell change. Inspired by this concept, an estimation model can be developed, which helps in approximating the gradient direction. The estimation of an approximated gradient is a key feature of BAS, which distinguishes it from another metaheuristic algorithm. Since its introduction, BAS has found its application in several realworld systems [21], [66]–[75]. The working of the original BAS can be described like this; at each iteration, the value of the objective function is computed at each antennae fiber location, a vector is drawn from the fiber with the lowest value toward the fiber with the highest value, the vector represents the direction of the approximated gradient. The approximated gradient is then used to update the location of the beetle. The working of the original BAS is shown in Fig. 1(a).
Fig. 1. Illustration of the working of original BAS and the BAS-ADAM algorithm proposed in this paper using beetle analogy. (a) Original BAS; (b)BAS-ADAM.
One of the key limitations of the BAS algorithm is the selection of a suitable value for initial step size and proper step size annealing. Careful tuning of these parameters is required to achieve a fast convergence rate and stable performance. The previous works on BAS [52], [53] use exponential decay and power-law annealing. However, these methods are not adaptive to the shape of the objective function, i.e., the step size change at the same rate. This behavior can results in slow convergence if the objective function has a narrow valley. From optimization literature, we know that for such objective functions, the gradient-based method, especially stochastic gradient descent (SGD), shows poor performance since the estimated point can bounce back and forth between walls of the valley during iterations [76]. It increases the number of iteration required to reach the optimal point. To improve the convergence rate, several modifications to the SGD algorithm have been proposed in the literature,which adaptively adjusts the step size and direction. On such popular algorithm is adaptive moment estimation (ADAM)[9]. The ADAM algorithm updates the value of step size based on all the past values of gradient and square of gradients. The idea of ADAM is borrowed from momentum SGD [76] but show more robustness and stability near-optimal points. Although several other rules to update step size have been proposed in the literature, e.g., RMSProp, AdaGrad [77],and AdaDelta [78] have been introduced in the literature.However, ADAM combines the features of several of these algorithms, for example, it computes the first and secondorder estimates of gradient and also applies the bias-correction on values of these estimates. For this reason, we choose the ADAM algorithm to adaptively update the step size during runtime instead of manually adjusting the parameters in the beginning. The working of the BAS-ADAM is shown in Fig. 1(b). It can be seen that it shows faster convergence as compared to the original BAS.
To demonstrate the fast convergence, efficiency, and effectiveness of the proposed BAS-ADAM algorithm, we performed several sets of experiments and compared its performance with other benchmark metaheuristic algorithm.In the literature on metaheuristic algorithms, the performance of algorithms is verified using low dimensional benchmark functions [57]. In this paper, we took the rigor of the testing process one step further and used the BAS-ADAM to train a hidden layer neural network with nonlinear activation functions. Additionally, we used two other benchmark optimization problems, which were similar to other works in metaheuristic literature. We solved these benchmark optimization problems using the PSO, a state of the art metaheuristic algorithm, original BAS, and BAS-ADAM. It is shown in results that the BAS-ADAM show superior performance as compared to original BAS and PSO in solving the benchmark optimization problems, including the training of neural networks. The main contributions of this paper are highlighted as follows:
1) The proposed algorithm incorporates the concept of gradient estimation into the metaheuristic optimization framework to intelligently decide a search direction.Traditional metaheuristic algorithms are mostly gradient-free and use random search direction.
2) The algorithm adaptively adjusts the step-size according to the value of the estimated gradient. The traditional metaheuristic optimization algorithm primarily uses predefined rules to calculate the step-size and require manual tuning of hyper-parameters.
3) The proposed ADAM update-rules help in avoiding the slow convergence behavior in case of a narrow valley in the objective function. Therefore, the proposed update-rules show smoother convergence behavior as compared to traditional algorithms.
The rest of the paper is organized as follows: Section II presents the details of the BAS-ADAM algorithm, Section III presents the benchmark optimization problems we used in this paper, Section IV presents the experimental results, and Section V concludes the paper.
In this section, we will describe the technical details and mathematical formulation of the BAS-ADAM algorithm and outline its steps.
Here we will describe the BAS algorithm [52] and present the gradient estimation model. Consider the optimization problem:
Now we will formulate the strategy to estimate the gradient direction using the element of settingandNote that there is a one-to-one correspondence between the elements ofand. Based on this correspondence, we first create a subsetfrom the setcorresponding to thelowest value in set. Similarly, we create a setcorresponding to thehighest value in the set. The value ofcontrols the robustness of the gradient estimation. In mathematical formulation, we create the following sets
Based on the value of estimated gradientwe can formulate an update rule as
The presented BAS-ADAM algorithm is given in Algorithm 1; it can be summarized as follow:
Algorithm 1: BAS-ADAM algorithm Step 1: Begin with an random starting point .m B ⊂Rn x0 Step 2: Generate a set of normalized random position vectors,.X ⊂Rn Step 3: Calculate the set for location of antennae fibres, .Step 4: Calculate the set of the objective function values F ={f(xb1),f(xb2),...,f(xbm)}Step 5: Calculate the sets and of lowest and highest values from set using (3).Xl ⊂ X Xh ⊂ X Fl Fh.Fl Fh k( Now we will calculate the computational complexity of the BAS-ADAM. As demonstrated in [21] that the original BAS algorithm has a complexity ofwhereis the dimensionality of the optimization variable. BAS-ADAM additionally computes the second moment of gradient according to (9) and the bias-correctness according to (10).The computation of second-moment requiresfloatingpoint operations, while bias-correction requires 4 floatingpoint operations. This implies that a single particle of BASADAM requiresadditional floating-point operations as compared to original BAS, i.e., the complexity of a single particle of BAS-ADAM is stillSince BASADAM employs a total ofsearch particles, therefore,effectively the complexity of BAS-ADAM isIt shows that although the complexity of an iteration of BAS-ADAM is higher as compared to original BAS, however, BAS-ADAM requires much fewer iterations to reach the optimal solution,as to BAS as demonstrated in the Results section.Additionally, the benefit gains in terms of robustness and smooth convergence toward the optimal point also demonstrate the efficacy of BAS-ADAM over the original BAS algorithm. This section presents the validation methodology used to test the efficiency, convergence, and efficacy of the proposed BAS-ADAM algorithm. We selected a set of three benchmark problems. The first is to search for the optimum value of Michalewicz function [79]. The second is a linear regression optimization problem, and the 3rd problem involves training of a single hidden layer neural network. Algorithm 1: BAS-ADAM optimizer Input: An objective function , where , and values of parameters: , , , , , , and .x∗ maxx f(x)f(x∗)f(x) x ∈Rn x0d0 η γ β1 β2 tstop Output: Optimal solution to the problem: and optimal value .t ←1 xbest ←x0 fbest ←f(xbest)t Michalewicz function [79] ishighly nonlinear,multimodal, non-convex function. This problem involves searching for the minima of the Michalewicz function. Since it is a multimodal function, there is a high chance that the searching algorithm gets trap in local minima. Therefore it is heavily used in literature to test the performance of metaheuristic algorithms. Michalewicz function is defined as Fig. 2. Plot of the Michalewicz function selected as validation problem formulated in Section III-B. It is a non-convex function with several local optimum. Its theoritically known optimal solution is shown on the graph. For Michalewicz function, the objective function was fixed,and the optimum value is already known in the literature.However, our second problem involves solving a linear regression optimization problem in which the objective function is defined using randomly generated data points.Since linear regression is essentially the least-square optimization problem; therefore, the objective function is convex. The linear regression optimization problem can be defined like this: given a vector ofindependent variables and one dependent variableestimate the parameterwhich results in best fit to following equation. The criteria for the best fit is usually defined in the least square sense If we are provided with a set ofdatapoints for independent and dependent variables, i.e.,andrespectively, we can define a vector of residual error values as based on this vector, we can define a cost function based on square-sum of the residual values as Therefore, linear regression can be written as the following optimization problem During experiments, we usedi.e.,300 samples points. The randomly generated dataset is shown in Fig. 3. It can be seen from (1) that forthree parameters need to be estimated, i.e.,In Section IV we will present the numerical results for this problem. Fig. 3. Point cloud used in validation problems two and three in Sections III-B and III-C respectively. are independent variables and y is dependent variables. As the third benchmark problem, we choose the training of a hidden layer neural network with nonlinear activation functions. Most of the metaheuristic algorithms show poor performance when it comes to the training of a machine learning model because these models are highly nonlinear and contain a large number of parameters. Without the information about the gradient direction, searching a high dimensional space becomes very challenging. Since our metaheuristic algorithm involves a rough estimation of the gradient direction, it is expected to show better performance for high dimensional search spaces. Additionally, we used the ADAM algorithm to adjust step size adaptively, therefore the proposed algorithm also does an excellent job in avoiding the local minimum. It is known from machine learning literature that hidden layer neural networks are excellent in end to end learning[80], i.e., they can model an unknown process by just observing the input-output data points. They can model an arbitrary nonlinear process without needing an apriori mathematical model. Therefore we used the point cloud of Fig. 3 same as linear regression. However, this time, instead of using a linear model of (14), we used a neural network. It is expected that the neural network will provide a better fit to the point cloud and therefore produce a lower value for the cost function. The architecture of the neural network we used in this paper has two inputs, ten hidden neurons, and one output.Now we will derive the objective function for this optimization problem. Letandbe the input and output of the neural network, respectively. The feedforward equation of the neural network can be written as and, Now let us consider a set ofsamples of input and output data points. Similar to the approach we used in Section III-B,we can define the following cost function for training the neural network In our experiments we used a neural network with 10 hidden neurons, 2 inputs and 1 output. There are a total of 41 trainable parameters in such a neural network. In summary,the problem 3 have these final values for the optimization problem:andi.e., 300 training samples. The results are presented in Section IV. Now we will present and discuss the experimental results for the validation problems presented in Sections III-A, III-B and III-C. Additionally, we compared the performance of the proposed BAS-ADAM algorithm with the PSO and original BAS algorithm. For the PSO algorithm, we used its implementation provided by MATLAB’s global optimization toolbox [81], whereas for BAS-ADAM and original BAS, we wrote the code in MATLAB. For Michalewicz function, we usedandin our experimentation. From [79] it is known that this function have a global minimum atand minimum value of the function is−1.8013. The performance of the BAS-ADAM, along with PSO and original BAS, is shown in Fig. 4. The Fig. 4(a) shows the decrease in value of residual error with increase in iterations.The plot of error residual clearly shows that the BAS-ADAM shows the best convergence performance as compared to the other two optimization algorithms. The optimal point and minimum value reached by the BAS-ADAM comes quite close to the one given in the literature. BAS-ADAM is able to reach the theoretical minimum value ofwhereas, for original BAS, the final value is − 1.7669. For PSO, the final value is −1.4581, and the final point iswhich is a local optimum, very far from the known global optimumSimilarly, Fig. 4(b) shows the convergence trajectory of BAS-ADAM and original BAS.This plot shows that the ADAM-BAS is able to reach the optimum point quite smoothly, whereas, for original BAS, the convergence trajectory shows many vibrations. It demonstrates the robustness of the BAS-ADAM optimizer.The convergence trajectory for PSO is not included for the purpose of visualization since its final point lies far off the graph. Fig. 4. Performance of optimization algorithms for Michalewicz function minimization given in Section III-A. The second validation problem involves solving the linear regression optimization problem given in (15). It is a convex optimization problem with a single global optimum and no local optimum. We tuned the parameters for BAS and BASADAM until we reached the best performance. The experimental results are shown in Fig. 5. The plot in Fig. 5(a) shows the decrease in value of error residual with iterations. The plot shows that BAS-ADAM shows the final rate of convergence, followed by PSO and original BAS.However, the final reached by original BAS,is much higher as compared to the values reached by BAS-ADAM and PSO. BAS-ADAM took about 33 iterations to reach near the optimum point, while PSO took around 43 iterations. These results again prove the faster rate of convergence and superior performance of the BAS-ADAM algorithm. Using the final estimated parameters from BAS-ADAM, we can plot a linear fit through the point cloud of Fig. 3. The fitted plane is shown in Fig. 5(b). Fig. 5. Performance of optimization algorithms for linear regression validation problem given in Section III-B. Among the three validation problems, the training of a hidden layer neural network is the most challenging one. As mentioned in Section III-C, we used a neural network with ten hidden neurons to model the point cloud in Fig. 3 . The topology of the neural network includes two inputs, one fully connected hidden layer with ten neurons and one output. Fig. 6 shows the experimental results for training the neural network using the three optimization algorithms. The plot of error residuals is shown in Fig. 6(a). According to the plot, the BAS-ADAM shows the best performance, followed by BAS and PSO. The BAS-ADAM is able to reach the objective function value of around 6.5, i.e.,Whereas original BAS and PSO get stuck in a local minimum and reach a minimum value of around 16 and 80. It is also worth noting that although the problem of linear regression and neural network are modeling the same point cloud using linear regression and neural network, respectively. The reason for the poor performance of PSO and original BAS is the high dimensionality of the objective functions. The neural has a total of 41 trainable parameters. The huge search space decreases the efficiency of PSO, original BAS, and other similar metaheuristic optimization algorithms. Fig. 6(b) shows a nonlinear surface to model the point cloud of Fig. 3. The surface is drawn using the final estimated parameters for the BAS-ADAM algorithm. Fig. 6. Performance of optimization algorithms for neural network validation problem given in Section III-C. In this paper, we presented a robust nature-inspired metaheuristic algorithm, called BAS-ADAM. The proposed algorithm is inspired by the BAS algorithm and improves its performance for the objective function, having very steep valleys. We achieve this by using the adaptive moment estimation (ADAM) technique widely applied in gradientbased optimization algorithms. The proposed algorithm estimated a gradient direction at each iteration using the antennae fibers of the beetle. The estimated gradient is then used as input to the ADAM algorithm, which computes its first and second moments. These moments are then used to adjust the step size for the next iterations. This technique offers a significant advantage as compared to original BAS because it automatically adapts the step size for each dimension according to the shape of the objective function.Therefore the BAS-ADAM shows much smoother convergence performance near the optimum points, as shown in the experiments. For original BAS, the step size is calculated using exponential decay rule and independent of the shape of the objective function. It is demonstrated through extensive experiments that the proposed algorithm offers fast and robust convergence as compared to other metaheuristic optimization algorithms. We selected three benchmark optimization problems to test the performance of the proposed algorithm. We repeated the same experiments with original BAS and PSO optimizers to present the comparative results.The experimental results show that, on average, BAS-ADAM takes fewer iterations and able to reach better optimum values as compared to original BAS and PSO. Additionally, we trained a hidden layer neural network, which shows the potential of the proposed algorithm in real-time machine learning applications. Potential future research direction includes analyzing the performance of other step-size update rules, e.g., momentum,AdaGrad, and AdaDelta, as compared to ADAM.Additionally, extending the concept of gradient estimation to other metaheuristic optimization algorithms such as PSO,differential evolution, and different swarm-based algorithms will also constitute an important research direction. Similarly,formulating a strategy to improve the robustness of gradient estimation while using a small number of particles will also be an interesting topic to be explored.D. Complexity Analysis
III. Validation Methodology
A. Michalewicz Function
B. Linear Regression
C. Neural Network Training
IV. Results and Discussion
A. Solution: Michalewicz Function
B. Solution: Linear Regression
C. Solution: Neural Network
V. Conclusion and Future Work
IEEE/CAA Journal of Automatica Sinica2020年2期