Informatica 32 (2008) 307-317 307 A Distributed Multilevel Ant Colonies Approach Katerina Taškova, Peter Korošec and Jurij Šilc Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia E-mail: katerina.taskova@ijs.si Keywords: ant-colony optimization, distributed computing, mesh partitioning, multilevel approach Received: May 20, 2008 The paper presents a distributed implementations of an ant colony optimization metaheuristic for the solution of a mesh partitioningproblem. The usefulness and efficiency of the algorithm, in its sequential form, to solve that particular optimization problem has already been shown in previous work. In this paper a straightforward implementati ons on a distri buted architecture is presented and the main algori thmic issues thathad to be addressed are discussed. Algorithms are evaluated on a set of wellknown graph-partitioning problems from the Graph Collection Web page. Povzetek: V sestavku je predstavljena porazdeljena izvedba metahevristične optimizacije s kolonijami mravelj, kije uporabljena pri reševanju problema razdelitve mreže. 1 Introduction Real engineering problems have complex dynamic behavior and one of the widely accepted formalisms for their modeling are partial differential equations (PDEs). The fraction of PDEs that have solutions in a closed analytical form is quite small and in general their solution relies on numerical approximations. Finite-element method is a well known numerical method that efficiently solves complex PDEs problems. In order to find an approximation of an unknown solution function f (x), this method discretizes the underlying domain into a set of geometrical elements consisting of nodes. This process is known as meshing. The value of the function f (x) is then computed for each of these nodes, and the solutions for the other data points are interpolated from these values [4]. Generated mesh structures can have large number of elements, therefore a common approach would involve a mesh-partitioning task in order to solve the finite-element method using multiple parallel processors. Consequently, the mesh-partitioning task aims to achieve minimal inter-processor communication and at the same time to maintain a processor workload balance. Mesh-partitioning problem is a combinatorial optimization problem. Namely, it is a special case of the well-known graph-partitioning problem, which is known to be a NP-hard and is defined as follows: If G(V,, E) denotes an undirected graph consisting of a non-empty set V of vertices and a set E C V x V of edges, then k-partition D of G comprises k mutually disjointed subsets D1, D2,..., Dk (domains) of V whose union is V. The set of edges that connect the different domains of a partition D is called an edge-cut. A partition D is balanced if the sizes of the domains are roughly the same, i.e., if b(D) = maxi 0 Then structure[i - 1] = Refinement(structure[i]) End If End For End Multilevel_Framework. Figure 2: Multilevel framework 310 Informatica 32 (2008) 307-317 K. Taskova et al. all nonempty buckets, implemented as double-linked list of vertices, are organized in a 2-3 tree. Additionally, MACA keeps separate 2-3 tree for each colony on every grid cell that has vertices in order to gain even faster searches. 3 Distributed multilevel ant-colony approaches In general, ant-colony optimization algorithms can be parallelized on four different levels [5, 15, 16], as follows: (i) parallelism on colony level, (ii) parallelism on ant level, (iii) data level parallelization, and (iv) functional paral-lelization, where each one is differing in granularity and communication overhead between processors. We will in brief, in the first subsection, describe all four paralleliza-tion approaches, making a ground base for introduction of the proposed Semi-Independent Distributed MACA and Interactive Distributed MACA approaches in the second, and the third subsection, respectively. 3.1 Parallelization strategies (i) Parallelism on colony level is the most simple coarsegrained parallelization of the ant-colony optimization algorithms, where the problem is instantiated and solved simultaneously on all available processors. Furthermore, if no communication is required between processors (parallel independent algorithms searches, introduced by Stutzle [16]), then this approach is refereed to as parallel independent ant colonies and it is suitable for algorithms that perform stochastic searches. Otherwise, if colonies, while searching for food, exchange information at a specified iteration (requires synchronized communication which implies master/slave implementation), then we refer to this approach as parallel interactive ant colonies. The communication cost of the second approach can become very expensive due to the required broadcasting of entire pheromone structures. (ii) Parallelism on ant level is the first proposed parallel implementation [3] of an ant-colony optimization algorithm, where each ant (or a group of ants) is assigned a separate processor to build a solution. This means maintenance of a separate pheromone structures on every processor and therefore this approach requires a master processor that will synchronize the work of the rest (slave processors), including ant-processor scheduling, initializations, global pheromone updates and producing of the final solution. (iii) Data level parallelization is a suitable approach for solving the multi-objective optimization problems, since it divides the main problem into a number of subproblems (objectives to optimize) and each one is solved by a colony on a separate processor. (iv) Functional parallelization is a parallelization that introduces a concurrent execution of a specified operations (local search, solution construction, solution evaluation) performed by a single colony on a master-slave architecture. When local heuristic searches are computationally expensive, a so-called parallel local searches are the preferred case. In particular, the assignment of a slave processor is to refine the solutions received from the master with local search heuristics, while the master is responsible for a solution construction, pheromone updates and collection of the refined solutions. The parallel solution construction is a second approach that organizes the available slave processors in two operational groups. Processors in the first one are responsible for a solution construction, while the second group processors are additionally grouped and scheduled to refine the corresponding solutions constructed by the first group processors. The last functional paralleliza-tion approach is called parallel evaluation of solution elements. This approach gives best performance in case of a computationally expensive solution evaluations. Compared to all aforementioned parallel strategies parallel evaluation of solution elements is the only approach that does not exploits parallelism within the metaheuristic algorithm. An efficient parallelization of a given algorithm depends mainly on the available computing platform, the underlying problem and the algorithm itself. If there is a large communication overhead between the processors, then parallel performance can be degraded. When the algorithms uses global structures, such as the pheromone matrix or the grid matrix of 2-3 trees in MACA case, a shared memory system would gain on communication (less) over a distributed memory system. On the other hand, the most common and cheaper approach in the same time is a parallelization using distributed memory systems, i.e., MIMD architecture such as cluster of workstations. Our proposed MACA paral-lelization assumes distributed memory system as well and it is implemented on a cluster of workstations. 3.2 The semi-independent distributed MACA The Semi-Independent Distributed MACA (SIDMACA) is basically a distributed MACA approach that allows exchange of the best temporal solution at the end of every level of the multilevel optimization process. This exchange requires that the parallel executions of MACA instances on the available processors have to be synchronized once per level. Namely, the master processor is responsible for synchronizing the work of all slave processor that execute a copy of MACA, by managing the exchange information and communication process (sending commands and confirmation, such as Start, Stop, Initialize, Goto New Level, Best Partition, etc.), while the slave processors have to execute the instances of the MACA code, signal when finish the current level optimization and send the best partition to the master. When all finish the current level, the master determines the best solution and broadcasts it to the slaves. In order to proceed with next level optimization, slave processors have to first update local memory structures (grid matrix) and afterwards perform partition expansion (refine- A DISTRIBUTED MULTILEVEL ANT COLONIES APPROACH Informatica 32 (2008) 307-317 311 Level 1 rtition ( Refine] (Graph* [1] )) Coarse (Graph [L- Level K-partitiohï Refine (Graph* [L-1] )) K-partitiOh! Refine (Graph* [L] )) K-partition (Graph [L] ) = Graph* [L] Figure 3: The three phases of multilevel k-way graph partitioning. ment). 3.3 The interactive distributed MACA The Interactive Distributed MACA (ItDMACA) is based on the parallel interactive colony approach which, by definition, implies master/slave implementation and synchronized communication. The information exchange between the colonies across the concurrent processors is initiated every time a piece of food has been taken or dropped on a new position. The information about the specific food, its new position and its owner is part of the message sent to and received from the master processor when picked up or dropped food. The master keeps and updates its own local grid matrix of temporal food positions (plays the role of shared memory) in order to maintain normal and consistent slaves activities. The master processors is responsible for the synchronized work and communication of the slave processors, which includes listening, processing and broadcasting of the incoming clients messages during level optimization. When all slave processors finish level or run, it collects the best-level solution, determines and broadcast the global best-level solution to the slaves and guides them when to start the refinement procedure and all necessary updates in order to perform the next level optimization activities or a new run. A slave processor executes a single instance of the MACA code. While optimization executing informs the master and waits for master's confirmation on every poten- tial drop/pick, signals when finishes the current level optimization and send the best partition to the master. In the meantime, while waiting to go on the next level, it listens for an eventual changes send by the unfinished clients and performs the eventual updates on the grid. When the master signals that the current level is finished, by sending the new best temporal solution, the slave processor has to perform partition expansion (refinement) in order to start the next level optimization. 4 Experimental evaluation The proposed distributed versions of MACA were applied on a set of well-known graph problems and the results from their experimental evaluation are presented and discussed in this section. The section is structured in two subsection. The first subsection describes the implementation of the distributed code, the experimental setting and the benchmark suite, whereas the second subsection presents and discusses the evaluation results. 4.1 Setup Based on the MACA sequential code, both proposed distributed version, SIDMACA and ItDMACA, were implemented in Borland® Delphi™, using TCP/IP protocol for the server/client communication, based on the open source library Indy Sockets 10 (which supports clients and servers Level 0 Level L-1 312 Informatica 32 (2008) 307-317 K. Taskova et al. of TCP, UDP and RAW sockets as well as over 100 higher level protocols). All experiments were performed on a 8-node cluster connected via a Giga-bit switch, where each node consists of two AMD Opteron™ 1.8-GHz processors, 2GB of RAM, and Microsoft®Windows®XP operating system. The benchmark graphs used in the experimental analysis were taken from the Graph Collection Web page [8], and are described in Table 1. Graph G(V, E) \v \ \E\ gridl 252 476 grid2 3296 6432 crack 10240 30380 U1000.05 1000 2394 U1000.10 1000 4696 U1000.20 1000 9339 Table 1: Benchmark graphs Parameters Number of processors 1 2 4 6 8 ants/colony iteration/level runs 120 60 30 20 15 600 600 600 600 600 20 20 20 20 20 Table 2: Distribution of ants per colonies and number of iterations per level w.r.t the number of processors The total number of ants per colony was 120. As presented in Table 2, the number of ants per sub-colony is different and depends on the number p of processors, i.e., p of the total number of ants, while the number of total iterations per level per colony is constant. All experiments were run 20 times on each graph with each algorithm and as final results were presented the mean value (also best and worst values for edge-cut) of the considered evaluation criteria over all performed runs. 4.2 Results The results presented in the following tables show the performance of the introduced DMACA approaches on the 2-partitioning and 4-partitioning graph problem. The quality of the partitioned graph is described with the edge-cut, Z(D), and the balance, b(D). Balance is defined as the difference (in the number of vertices) between the largest and the smallest domain. Beside the quality, the second evaluation criteria is the effectiveness of the parallel algorithm which is, in our case, given by the speed-up measure, S, defined as: is S(p) = tT(P) and by the relative speed-up measure, Sr, which is defined as: *r(1) Sr (P) tT(P)' where ts is the time to solve a problem with the sequential code, tT(1) is time to solve a problem with the parallel code with the one processor, and tT (p) is time to solve the same problem with the parallel code on p processors. Note that S(p) and Sr (p) were calculated based on the average time values of the 20 runs. By theory, correct speed-up metric should be calculated according to the performance (elapsed computational time) of the best serial code for the underlying algorithm, as defined above and denoted with S(p), whereas in practice this is usually translated into calculation of the relative speedup metric Sr (p), since the best serial code is not available and writing two codes is not acceptable. In our case the serial code is available, and the values of both speed-up metrics are included in the tables with results. Additionally, for the reason of comparison, in the tables are given the measured CPU time for the computation of the obtained solutions, tT, as a triple of the time spent on pure computations, the time for communication with the master processor, tC, and the time for internal updates caused by the synchronization, tu. Note that tC and tu are part of the tT spent for communication and updates, respectively. Results in in Table 3 and Table 4 summarize the performance of SIDMACA for solving 2-partitioning and 4-partitioning graph problem, respectively, on the given graph set. General observation is that parallel performance of the system w.r.t speed-up over the serial MACA is poor compared to the theoretical expected speed-up of p when used p processors, having maximal speed-up of 2.29 (graph crack, p = 8) in case of 2-partitioning problem and maximal speed-up of 2.72 (graph U1000.05, p = 8) in case of 4-partitioning problem overall considered graphs and parallel scenarios (p = 2,4, 6,8). For more then 2 processors employed S > 1 (except for the graph U1000.10, p = 4, k = 4), while for 2-processor parallelization of the problems is evident speed-down up to 27% in case of 4-partitioning of graph grid2. On the other side, results on SIDMACA show overall comparable/improved quality of the obtained solutions. The best solutions found in case of 2-partitioning are equal or better then the best serial code produced solutions (except for graph U1000.10, p = 4 and crack, p = 6). Moreover, the worst solutions found by SIDMACA are better than the ones from the MACA on the U1000 graph set and crack graph. When solved the 4-partitioning problem, best found solution better than the best ones from the serial code are observed for graphs: grid2, U1000.05 and U1000.10. The remark on the better quality of the worst case found solutions is confirmed in case of graphs U1000.10, U1000.10 and partially for graphs grid2, U1000.05 and crack. Correspondingly, Table 5 and Table 6 illustrate the ItD-MACA performance on the same graph set for the 2- and 4-partitioning graph problems when 2, 4 and 8 processor employed in parallel. Note that for p = 8 results are available only for the graphs grid1, U1000.10 and U1000.20. A DISTRIBUTED MULTILEVEL ANT COLONIES APPROACH Informatica 32 (2008) 307-317 313 Quality Time [s] Speed-up C (D) b(D) tT tc S(p) Sr (p) Graph p best mean worst mean mean mean mean mean 1* 18 18 18 0 9.80 0 1.00 gridl 1 18 18 18 0 10.03 0.10 1.00 2 18 18 18 0 10.41 0.69 0.94 0.96 4 18 18 19 0 10.00 2.67 0.98 1.00 6 18 18 21 0 7.17 1.75 1.37 1.40 8 18 19 21 0 5.60 1.44 1.75 1.79 1* 35 44 68 0 20.37 0 1.00 grid2 1 34 41 68 0 23.81 0.21 1.00 2 35 40 69 0 23.28 1.58 0.88 1.02 4 35 40 69 0 15.86 2.99 1.28 1.50 6 35 41 70 0 11.73 2.51 1.74 2.03 8 35 49 70 0 9.31 2.22 2.19 2.56 1* 1 1 2 0 87.53 0 1.00 U1000.05 1 1 1 3 0 88.80 0.39 1.00 2 1 1 2 0 83.10 1.81 1.05 1.07 4 1 1 1 0 60.86 4.54 1.44 1.46 6 1 1 1 0 42.15 6.09 2.08 2.11 8 1 1 1 0 32.19 6.16 2.72 2.76 1* 50 62 78 1 14.49 0 1.00 U1000.10 1 39 62 73 1 15.03 0.17 1.00 2 40 59 76 1 14.97 1.16 0.97 1.00 4 40 59 71 1 11.88 2.57 1.22 1.27 6 50 61 72 1 8.72 1.95 1.66 1.72 8 57 61 72 1 7.12 1.76 2.04 2.11 1* 221 277 370 8 12.14 0 1.00 U1000.20 1 221 256 337 6 13.03 0.15 1.00 2 219 259 373 7 12.48 0.99 0.97 1.04 4 219 266 369 7 10.67 2.51 1.14 1.22 6 219 288 368 10 7.75 1.75 1.58 1.68 8 219 278 370 9 6.05 1.34 2.01 2.15 1* 185 211 234 1 64.91 0 1.00 crack 1 184 204 277 1 85.02 0.29 1.00 2 184 195 231 0 80.48 6.28 0.81 1.06 4 185 203 246 0 52.25 10.40 1.24 1.63 6 186 202 230 0 39.70 9.25 1.64 2.14 8 185 203 225 0 32.04 8.45 2.03 2.65 * sequential code Table 3: Experimental results: 2-partitioning problem with SIDMACA 314 Informatica 32 (2008) 307-317 K. Taskova et al. Quality Time [s] Speed-up Z(D) b(D) tT tc S(p) Sr (p) Graph p best mean worst mean mean mean mean mean 1* 38 39 41 1 18.01 0 1.00 gridl 1 38 39 42 1 19.67 0.08 1.00 2 38 39 41 0 19.38 0.67 0.93 1.01 4 38 39 41 0 18.12 2.72 0.99 1.09 6 38 39 41 0 13.50 1.69 1.33 1.46 8 38 39 41 0 10.42 1.14 1.73 1.89 1* 96 104 118 4 47.38 0 1.00 grid2 1 95 102 111 3 63.08 0.23 1.00 2 94 106 116 3 65.21 4.78 0.73 0.97 4 92 105 123 2 47.96 10.14 0.99 1.32 6 93 106 116 2 35.35 8.18 1.34 1.78 8 93 103 115 2 28.16 6.89 1.68 2.24 1* 9 14 20 3 50.78 0 1.00 U1000.05 1 7 14 22 2 57.88 0.20 1.00 2 9 14 23 2 60.96 8.28 0.83 0.95 4 7 14 21 1 45.15 12.00 1.12 1.28 6 8 13 18 0 36.26 9.99 1.40 1.60 8 7 11 17 0 36.03 11.15 1.41 1.61 1* 95 114 166 3 35.17 0 1.00 U1000.10 1 102 113 133 3 43.09 0.12 1.00 2 98 110 133 2 40.76 2.89 0.86 1.06 4 92 112 163 2 39.03 10.12 0.90 1.10 6 101 113 162 2 27.14 6.64 1.30 1.59 8 91 115 161 3 19.96 4.64 1.76 2.16 1* 485 580 856 6 32.27 0 1.00 U1000.20 1 479 592 838 6 36.77 0.13 1.00 2 485 586 817 6 36.19 1.85 0.89 1.02 4 490 593 687 5 32.29 7.85 1.00 1.14 6 490 632 730 6 22.65 4.70 1.42 1.62 8 491 649 727 8 17.02 3.34 1.90 2.16 1* 373 415 522 15 191.07 0 1.00 crack 1 374 425 496 14 259.03 0.27 1.00 2 377 426 495 11 217.40 14.39 0.88 1.19 4 373 423 506 8 139.52 25.92 1.34 1.86 6 384 431 493 6 109.29 23.66 1.75 2.37 8 378 429 526 6 83.35 18.40 2.29 3.11 * sequential code Table 4: Experimental results: 4-partitioning problem with SIDMACA A DISTRIBUTED MULTILEVEL ANT COLONIES APPROACH Informatica 32 (2008) 307-317 315 Quality Time [s] Speed-up Z(D) b(D) tT tc tu S(p) Sr (p) Graph p best mean worst mean mean mean mean mean mean 1* 18 18 18 0 9.80 0 0 1.00 gridl 1 18 18 18 0 44.79 34.45 0.11 1.00 2 18 18 18 0 37.61 17.87 1.54 0.26 1.19 4 18 18 18 0 18.86 8.26 3.01 0.52 2.38 8 18 18 18 0 11.83 5.01 3.00 0.83 3.79 1* 35 44 68 0 20.37 0 0 1.00 grid2 1 35 45 69 0 143.34 118.24 0.29 1.00 2 34 42 68 0 92.74 56.55 9.08 0.22 1.55 4 35 39 53 0 52.29 26.57 13.44 0.39 2.74 1* 1 1 2 0 87.53 0 0 1.00 U1000.05 1 1 1 2 0 463.82 373.45 0.32 1.00 2 1 1 2 0 295.38 190.25 41.64 0.30 1.57 4 1 1 2 0 182.04 90.89 61.65 0.48 2.55 1* 50 62 78 1 14.49 0 0 1.00 U1000.10 1 39 60 76 1 44.47 27.11 0.20 1.00 2 39 63 77 1 30.27 11.54 4.58 0.48 1.47 4 40 59 71 1 21.16 6.63 4.77 0.68 2.10 8 40 59 70 1 14.73 4.45 4.32 0.98 3.02 1* 221 277 370 8 12.14 0 0 1.00 U1000.20 1 219 268 373 7 24.41 11.23 0.15 1.00 2 219 272 371 8 21.83 5.43 2.58 0.56 1.12 4 219 255 368 7 16.48 2.77 3.92 0.74 1.48 8 235 262 308 5 10.65 1.73 3.14 1.14 2.29 1* 185 211 234 1 64.91 0 0 1.00 crack 1 184 191 262 0 312.25 205.08 0.42 1.00 2 184 189 211 0 223.60 95.82 57.03 0.29 1.40 4 184 187 207 0 150.94 48.21 63.33 0.43 2.07 * sequential code Table 5: Experimental results: 2-partitioning problem with ItDMACA The results show no speed-up in case of 2-processor and 4-processor parallelization. Speed-up S > 1 is evident when 8 processor applied on the graphs for solving the 4-partitioning problem and for 2-partitioning of graphs U1000.10, U1000.20. Speed-down and low speed-ups are due to the big amount of time spent on communication and memory updates (synchronizations) during level optimization activities. The performance of ItDMACA w.r.t the quality of obtained solutions confirms the observation from the SIDMACA results. More specifically for the 2-partitioning problem, equal partition solutions in all runs are obtained for graphs gridl and U1000.05, while significant improvement is evident for the U1000.10, and slightly better solution for the rest of the graphs. In general, comparable or improved solution quality is observed in the case of solving the 4-partitioning problem with ItDMACA as well. For p = 8, we gain speed-up and (i) better solution for graph U1000.20, (ii) equal best found solution for graph grid1, (iii) comparable solutions for graph U1000.10. As expected, the results on relative speed-up Sr (p) are better than the speed-up S(p) results. How big this differ- ence is, is dependent on the size of the problem and algorithm implementation. Consequently, for SIDMACA the difference is not significant (except for graph grid2 and crack) compared to the ones in the ItDMACA, which in case of the grid2 graph yields 7 times higher Sr (p) than S(p). This difference reveals that ItDMACA suffers from communication/update overhead, which for specific problems could be disadvantageous. Additional experiments are needed in order to confirm the conclusions drawn from the initial experimental evaluations results, based on small number of processing nodes and a small set of graphs. There is a large space with possible directions for further work, such as: - application on additional new graph problems, specially large and complex ones, - try to solve the partitioning problem with more than 8 processors in parallel and find how the number of processors influences the solution quality and speedup, - shared memory implementation, since distributed implementations suffer from increased communication 316 Informatica 32 (2008) 307-317 K. Taskova et al. Quality Time [s] Speed-up Z(D) b(D) tT tc tu S(p) Sr (p) Graph p best mean worst mean mean mean mean mean mean 1* 38 39 41 1 18.01 0 0 1.00 gridl 1 38 40 42 0 58.03 38.39 0.28 1.00 2 38 40 43 0 47.97 16.25 1.38 0.38 1.21 4 38 39 41 1 26.45 11.07 3.35 0.68 2.19 8 38 40 43 1 14.00 5.43 2.22 1.29 4.15 1* 96 104 118 4 47.38 0 0 1.00 grid2 1 94 106 116 4 332.74 259.84 0.89 1.00 2 95 103 114 4 210.78 110.65 45.41 0.22 1.58 4 95 105 118 4 132.13 66.09 30.91 0.36 2.52 1* 9 14 20 3 50.78 0 0 1.00 U1000.05 1 9 14 21 3 342.12 278.76 0.62 1.00 2 7 16 33 3 225.56 134.97 37.33 0.23 1.52 4 7 15 22 2 160.29 91.36 38.15 0.32 2.13 1* 95 114 166 3 35.17 0 0 1.00 U1000.10 1 93 116 159 3 79.74 34.02 0.53 1.00 2 96 112 129 3 63.46 17.23 7.32 0.55 1.26 4 98 117 197 5 46.29 8.99 9.70 0.76 1.73 8 98 118 157 3 26.08 5.13 7.34 1.35 3.06 1* 485 580 856 6 32.27 0 0 1.00 U1000.20 1 480 594 888 8 63.64 25.31 0.49 1.00 2 487 583 759 6 51.82 11.07 4.39 0.62 1.22 4 486 594 762 5 36.72 6.65 7.51 0.88 1.73 8 474 584 805 5 24.25 3.52 6.50 1.33 2.62 1* 373 415 522 15 191.07 0 0 1.00 crack 1 372 415 507 15 720.23 401.47 1.11 1.00 2 377 433 496 11 565.13 194.86 150.72 0.34 1.27 4 382 415 492 9 411.00 104.70 197.98 0.46 1.75 * sequential code Table 6: Experimental results: 4-partitioning problem with ItDMACA and local memory updates, - how statistically significant is the difference in the performances of the proposed parallel implementations among them or/and vs. the sequential MACA algorithm. 5 Conclusions An efficient parallelization of a given algorithm depends mainly on the available computing platform, the underlying problem and the algorithm itself. If there is a large communication overhead between the processors, then parallel performance can be degraded. When the algorithms uses global structures, such as the pheromone matrix or the grid matrix of 2-3 trees in MACA case, a shared memory system would gain on communication (less) over a distributed memory system. On the other hand, the most common and cheaper approach in the same time is a parallelization using distributed memory systems, i.e., MIMD architecture such as cluster of workstations. In this paper, two distributed MACA versions were pre- sented, Semi-Independent and Interactive, implemented on a cluster of workstations. The initial experimental evaluations confirms that parallelization efficiency is problem dependent. Overall, both approaches show comparable or better (stable) quality performance. While ItDMACA is more sensitive on the parallel performance efficiency, due to the synchronization overhead, SIDMACA can obtain same or better quality for less computational time, which is gain on both scales: quality and cost. In order to see how significant is this improvement and how robust is this approach additional experimental analysis regarding different problem type (large and complex) and experiment setup should be performed. References [1] A. Bahreininejad, B.H.V. Topping, and A.I. Khan. Finite Element Mesh Partitioning Using Neural Networks. Adv. Eng. Softw., 27(1-2):103-115, 1996. [2] S.T. Barnard and H.D. Simon. A Fast Multilevel Implementation of Recursive Spectral Bisection for A DISTRIBUTED MULTILEVEL ANT COLONIES APPROACH Informatica 32 (2008) 307-317 317 Partitioning Unstructured Problems. Concurr. Comp.-Pract. E., 6(2):101-117, 1994. [3] M. Blondi and M. Bondanza. Parallelizzazione di un Algoritmo per la Risoluzione del Problema del Commesso Viaggiatore. Master's thesis, Politecnico diMilano, 1993. [4] R.D. Cook, D.S. Malkus, M.E. Plesha, and R.J. Witt. Concepts and Applications of Finite Element Analysis. John Wiley & Sons, 2001. [5] M. Dorigo, G. Di Caro. The Ant Colony Optimization Meta-Heuristic. In D. Corne, M. Dorigo, and F. Glover, editors, New Ideas in Optimization, McGraw-Hill, 1999. [6] M. Dorigo. Optimization, Learning and Natural Algorithms. PhD Thesis, Dipartimento di Elettronica, Politecnico di Milano, 1992. [7] C.M. Fiduccia and R.M. Mattheyses. A Linear Time Heuristic for Improving Network Partitions. In Proc. 19th IEEE Design Automation Conf., Las Vegas, NV, 1982, pages 175-181. [8] Graph Collection. wwwcs.uni-paderborn.de/ cs/ag-monien/RESEARCH/PART/graphs. html. [9] B. Hendrickson and R. Leland. A Multilevel Algorithm for Partitioning Graphs. In Proc. ACM/IEEE Conf. Supercomputing, San Diego, CA, 1995. [10] P. Kadluczka and K. Wala. Tabu Search and Genetic Algorithms for the Generalized Graph Partitioning Problem. Control Cybern., 24(4)459-476, 1995. [11] G. Karypis and V. Kumar. Multilevel k-way Partitioning Scheme for Irregular Graphs. J. Parallel Distr. Com., 48(1):96-129, 1998. [12] B.W. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning Graph. Bell Sys. Tech. J., 49(2)291-307, 1970. [13] A.E. Langham and P.W. Grant. Using Competing Ant Colonies to Solve k-way Partitioning Problems with Foraging and raiding strategies. Lect. Notes Comp. Sc., 1674:621-625, 1999. [14] P. Korosec, J. Silc, and B. Robic. Solving the Mesh-partitioning Problem with an Ant-colony Algorithm. Parallel Comput., 30(5-6):785-801, 2004 [15] M. Randall and A. Lewis. A Parallel Implementation of Ant Colony Optimization. J. Parallel Distr. Com., 62(9):1421-1432, 2002. [16] T. Stutzle. Parallelization Strategies for Ant Colony Optimization. Lect. Notes Comp. Sc., 1498:722-741, 1998. [17] C. Walshaw and M. Cross. Mesh Partitioning: A Multilevel Balancing and Refinement Algorithm. SIAM J. Sci. Comput., 22(1):63-80, 2001.