ELEKTROTEHNI ˇ SKI VESTNIK 79(5): 231–236, 2012 ENGLISH EDITION Meta-Optimisation on a High-Performance Computing System ´ Arp´ ad B˝ urmen, Tadej Tuma, Iztok Fajfar Univerza v Ljubljani, Fakulteta za elektrotehniko, Trˇ zaˇ ska cesta 25, 1000 Ljubljana, Slovenija E-mail: arpad.buermen@fe.uni-lj.si Abstract. All optimisation algorithms have parameters that affect their performance and reliability. Usually the default values of these parameters are problem-dependent. Regardless of this fact it is common practice to use some default values that are provided with the optimisation algorithm. Finding the optimal values of these parameters is a computationally expensive optimisation problem also known as meta-optimization. The computational complexity comes from the fact that every cost-function evaluation in meta-optimisation involve several runs of an optimisation algorithm that evaluate its behavior for given values of algorithm parameters. The most common approach to making meta-optimisation feasible is the use of parallel computing. The paper presents the construction of the cost function for meta-optimisation of direct search optimisation algorithms. We demonstrate the approach by optimising the parameters of the Nelder-Mead simplex algorithm using a high-performance computing system comprising 100 processing units. The results of the meta-optimisation are surprising because the obtained values of parameters greatly differ from the values that were published 50 years ago, but are still used despite their suboptimality. Keywords: Meta-optimisation, global optimisation, simplex algorithm 1 INTRODUCTION Optimisation is the search for the minimum of a real- valued function of n variables. The function subject to optimisation is also referred to as the cost function (CF). The performance of an optimisation algorithm can be expressed with the number of CF evaluations (N) and the quality of the final result (CF value) returned by the optimisation algorithm. A better optimisation algorithm obtains a lower final CF value with fewer CF evaluations. Its performance directly depends on a set of real-valued algorithmic parameters also referred to as the optimisation algorithm’s strategy. The optimal strategy generally depends on the optimi- sation problem that is being solved. It is commonplace to find the suggested values of the (default) strategy published together with the algorithm. These values are often obtained with limited numerical trials or some (overly) simplified analysis. Despite this the published strategies are rarely challenged by optimization prac- titioners. Such careless choice of strategy is often the cause for optimisation algortihms being deemed as in- efficient. The optimal strategy generally depends on the nature of the optimisation problem. The only possible option is to find the optimal strategy before the optimisation itself. The problem of finding the optimal strategy is Received November 19, 2012 Accepted January 9, 2013 also referred to as meta-optimisation problem while the process of solving it is called meta-optimisation. The basic (optimisation) algorithm is the one for which the optimal strategy is being sought. Meta-optimisation also involves a CF that expresses the quality of a strategy with a real number. It maps vectors that represent strate- gies to real numbers that represent the basic algorithm’s performance. Lower values of the CF correspond to better strategies. The calculation of teh CF value in- volves several runs of the basic algorithm that capture its performance on a family of optimisation problems. This makes meta-optimisation a computationally intensive task where even a single CF evaluation can take hours. The CF in meta-optimisation is often discontinuous, multimodal and is littered with a numerical noise. This usually leaves no other choice, but to use a global optimisation algorithm for finding its minimum. The main disadvantage of these algorithms is the large numer of CF evaluations needed for finding the solution of a meta-optimisation problem (10000 and more). The run times can be significantly shortened by the use of parallel computing which is a viable option because many global optimisation algorithms can be efficiently parallelised. In this paper we focus on meta-optimisation of local optimisation algorithms. These algorithms search for a local minimum of the CF. Usually, they require a significantly smaller number of CF evaluations when compared to global optimisation algorithms. Although a 232 B ˝ URMEN, TUMA, FAJFAR local optimum does not represent the best possible solu- tion of an optimisation problem, optimisation algorithms are still often used in engineering practice. In many engineering optimisation problems evaluation of the CF is computationally expensive while the computing power is limited. Is such cases local optimisation is the only viable choice. A local optimisation algorithm is also adequate if finding a decrease in the CF value is sufficient for deeming the run as successfull. 2 COST FUNCTION IN META-OPTIMISATION Before any meta-optimisation, one has to choose the family of the optimisation problems for which the op- timal basic algorithm strategy is being sought. Real- world optimisation problems are often computationally too expensive and cannot be used for constructing the CF for meta-optimisation. A much better choice are mathematical test functions that are designed for the purpose of evaluating the basic optimisation algorithm. Two sets of test functions are commonly used: the Mor´ e-Garbow-Hillstrom set [1] and the CUTEr set [2]. The former comprises 35 unconstrained optimisation problems, while the latter offers over1000 unconstrained and constrained problems. A subset of these problems {f 1 ,f 2 ,...,f m } is used for constructing the CF in meta- optimisation. Local optimisation algorithms usually re- quire an initial point. We denote the corresponding initial points with x 0 1 ,x 0 2 ,...,x 0 m . The CF in meta-optimisation measures the perfor- mance of a strategy with respect to the performance of some reference strategy (or algorithm). In this e,sense a possible choice is the performance of the basic algorithm using a default strategy. We denote the (reference) number of CF evaluations with N ′ 1 ,N ′ 2 ,...,N ′ m . The quality of the final result returned by a local optimisa- tion algorithm can be expressed with the norm of the corresponding CF gradient. The lower values of this norm correspond to a better approximation of a local minimum. We denote the (reference) final results and the CF gradient with x ′ 1 ,x ′ 2 ,...,x ′ m and g ′ 1 ,g ′ 2 ,...,g ′ m , respectively. Denote the final results and the corresponding CF gradient obtained by the basic algorithm withx 1 ,x 2 , ..., x m and g 1 ,g 2 ,...,g m , respectively. Let N 1 , N 2 , ..., N m be the number of CF evaluations in the basic algorithm. It can be expected that many strategies tried during the course of meta-optimisation result in an excessive number of CF evaluations. This results in prolonged meta-optimisation runs. Such situations can be avoided if one places an upper bound (N max i ) on the number of CF (f i ) evaluations. A possible choice would be N max i = max(5N ′ i ,10000). (1) The goal of meta-optimisation is to find a strategy for which the basic algorithm’s performance exceeds that obtained with the default strategy. This requirement represents a constraint in meta-optimisation and can be handled with a CF that comprises several penalty contri- butions (C i ). [3] If some strategy behaves worse than the default strategy, the corresponding penalty contribution is positive. On the other hand, if a strategy outperforms the basic strategy, a small negative penalty contribu- tion rewards its behavior. In case a strategy exhibits same performance as the default strategy, the penalty contribution is zero. The CF in meta-optimisation can be expressed as a sum of m penalty contributions representing performance on individual test problems. C = m X i=1 C i . (2) C = 0 corresponds to a strategy that exhibits the same performance on m test problems as the defautl strategy. The contribution of the i-th test problem to the CF in meta-optimisation can be expressed as C i =  (N i /N ′ i −1)·10; N i >N ′ i (N i /N ′ i −1)/10; N i ≤N ′ i (3) +  log 10 (||g i ||/||g ′ i ||)·10; ||g i ||>||g ′ i || log 10 (||g i ||/||g ′ i ||)/10; ||g i ||≤||g ′ i || (4) One can use an arbitrary constant P > 1 instead of the fixed value (10) in the above equations. The absolute ratio between a penalty (C i > 0) and reward (C i < 0) attributed to a deterioration/improvement of the same magnitude from the default strategy’s performance is P 2 . One CF evaluation in meta-optimisation involves m runs of the basic optimisation algorithm. 3 COMPUTER HARDWARE Meta-optimisation is a computationally intensive proce- dure. To make it viable one has to take advantage of parallel processing. A sufficient computing power can be obtained by utilizing contemporary high-performance computing (HPC) systems. Such systems comprise a large number of processing units. The role of a proces- sign unit can be taken by ordinary desktop computers. Nowadays, the price of a typical desktop computer with a quad-core CPU is around e 600 (or e 150 per CPU core). Such computers usually come with a built-in network adapter (most commonly 1Gb Ethernet). By adding a network switch (its cost is usually lower than the price of a single desktop computer), one gets a fully functional HPC system. The price of a system comprising 100 CPU cores is arounde 15000. The interconnection between the processing units is the main bottleneck. Despite its speed (1Gbit/s), the latency in quite high (around 20μs). Part of the latency is inherent to Ethernet, while the rest (around 10μs, [4]) is caused by inefficient drivers. This is quite large META-OPTIMISATION ON A HIGH-PERFORMANCE COMPUTING SYSTEM 233 PC 1 Ethernet switch Ethernet switch PC 2 PC M Controlling PC Internet (1Gbit) Figure 1. Block diagram of a HPC system based on desktop computers and gigabit Ethernet. when compared to the latency of the HPC systems using InfiniBand (e.g. [5]) where it can be as low as 1μs. The throughput is also much higher in the InfiniBand systems (2Gb/s for a basic SDR link). The cost, however, is much greater if the InfiniBand links are used for interconnecting the processing units. Regardless of the lower throughput and higher latency, a HPC system with the Ethernet interconnection is capable of running parallel optimisation algorithms. Let R denote the throughput of the interconnection while τ denotes the latency. The interconnection does not behave as the system’s bottleneck if two conditions are satisfied. The time it takes for a processing unit to complete a task (T ) must be significantly higher than the latency. Secondly, the total amount of data D associated with a task (data sent to a processing unit describing the task and data sent from a processing unit with the results of a task) must be small enough so that it can be transferred in the fraction of time it takes for the task to complete. Both conditions can be formulated mathematically as T ≫ τ, (5) D/R ≪ T. (6) Meta-optimisation along with most engineering opti- misation problems satisfies conditions (5)-(6). Condition (5) is satisfied because T is in the 1s order of the magnitude or greater, while τ is in the 10μs order of the magnitude (even for ”slow” gigabit Ethernet links). Because every task (CF evaluation) can be specified with n ′ real numbers (where n ′ is the dimension of the meta-optimisation problem, i.e. the number of the basic algorithm’s parameters). A task produces a result which is a CF value and can be represented as a single real- valued number. Assuming that 64-bit real numbers are used, the amount of data transferred to and from every task is 64·(n ′ +1). This gives usD/R = (n ′ +1)·64ns. If completing one task takes T = 1ms, condition (6) is satisfied for all n ′ < 15624. 4 OPTIMISATION SOFTWARE Parallel optimisation algorithms on HPC systems consist of multiple programs (tasks) running in parallel. Every program runs on a single processor and solves a part of the problem. Programs communicate among them- selves and coordinate their work. The communication is usually implemented as messages. When a task receives a message, it processes the message by performing a (lengthy) computation upon which it sends back a mes- sage with results. Implementing parallel algorithms can be significantly simplified if one uses a library that wraps the process of sending and receiving messages into a simple application programming interface (API). This hides the details of communication protocols. Currently, two solutions are available: PVM [6] and MPI [7]. PVM is a somewhat older library, while MPI is only an API specification. Several implementations of the MPI specification are available. Developement of optimisation algorithms requires an environment that 1) simplifies implementation of com- plex matematical concepts and 2) speeds up debugging of the implemented code. The first requirement is satis- fied by several programming laguages in combination with appropriate mathematical libraries. The second requirement is usually fulfilled by various scripting languages. We chose Python [8] in combination with NumPy/SciPy [9] mathematical libraries for implement- ing all the described algorithms due to its simplicity, generality and the large number of available extensions. Meta-optimisation can be parallelised using two ap- proaches. The first approach distributes the evaluation of the basic algorithm’s performance for one candidate strategy among multiple processors. Every processor runs the basic algorithm on one of the m test problems. When the m optimisation runs are complete, the corre- sponding value of the CF is computed using (2). The meta-optimisation algorithm then determines the next candidate strategy and distributes its evaluation among parallel processors. This way a speedup factor of up to m can be achieved. Unfortunately the speedup is limited by the fact that not all of the m optimisation runs are finished at the same time. This way some processors remain idle until all of the m processors finish the evaluation (synchronisation penalty). As consequence, the actual speedup factor is smaller than m. To avoid synchronisation penalty, one can use an asynchronous optimisation algorithm for meta- optimisation. Such algorithms keep all processors busy all the time. The evaluation the meta-optimisation CF is not distributed among processors. This brings an additional advantage by greatly increasing T thus mak- ing conditions (5) and (6) satisfied even for the HPC systems with slow interconnections. An example of such algorithm can be found in [10]. 5 EXAMPLE The Nelder-mead simplex algorithm [11] searches for a local minimum of a function of n variables by moving 234 B ˝ URMEN, TUMA, FAJFAR n + 1 points (also referred to as simplex). We assume that the points are ordered so that the largest and smallest function value corresponds tox n+1 (worst point) andx 1 (best point), respectively. Most of the time the simplex moves around by replacing the worst point with one of the points lying on a ray with x n+1 for origin that runs through the centroid x of the n best points (x = (x 1 +...+x n )/n). The valueγ≥−1 determines a point x on this ray according to x =x+γ(x−x n+1 ). (7) x 1 x n+1 x n x x ic x oc x r x e Figure 2. Reflection, expansion, outer contraction, and inner contraction steps in the Nelder-Mead simplex algorithm. The values ofγ denoted byγ r ,γ e ,γ oc , andγ ic corre- spond to points (x r , x e , x oc , and x ic ) that represent the possible replacements for the worst point (see Fig. 2). The corresponding steps are also referred to as reflection, expansion, outer contraction, and inner contraction. If none of these steps improves the worst function value (at x n+1 ), the simplex is shrinked towards its best point (x 1 ) according to x i →x 1 +γ s (x i −x 1 ), i = 2,...,n+1. (8) γ s denotes the shrink factor. The values of γ that correspond to the above-mentioned steps must satisfy 0 < γ r < γ e in 0 < γ oc ,−γ ic ,γ s < 1. The rec- ommended values of these parameters found in [11] represent the default strategy [γ r ,γ e ,γ oc ,γ ic ,γ s ] = [1,2,1/2,−1/2,1/2]. (9) The algorithm can be described with the following steps. 1) Order points so that they satisfy f(x 1 )≤f(x 2 )≤...≤f(x n+1 ) . 2) Evaluate f r =f(~ x r ). If f r