Informatica 38 (2014) 229-240 229 Genetic Algorithm with Fast Greedy Heuristic for Clustering and Location Problems Lev A. Kazakovtsev and Alexander N. Antamoshkin Siberian State Aerospace University named after Academician M. F. Reshetnev prosp. Krasnoyarskiy rabochiy, 31, Krasnoyarsk 660014, Russian Federation E-mail: levk@bk.ru Keywords: continuous location problem, p-median problem, genetic algorithms, discrete optimization, clustering, k-means Received: April 4, 2014 Authors propose new genetic algorithm for solving the planar p-median location problem and k-means clustering problem. The ideas of the algorithm are based on the genetic algorithm with greedy heuristic for the p-median problem on networks and information bottleneck (IB) clustering algorithms. The proposed algorithm uses the standard k-means procedure or any other similar algorithm for local search. The efficiency of the proposed algorithm in comparison with known algorithms was proved by experiments on large-scale location and clustering problems. Povzetek: Razvitje nov algoritem za grucenje in lokalizacijo s hitro požrešno hevristiko. 1 Introduction The aim of a continuous location problem [18] is to determine the location of one or more new facilities in a continuum of possible locations. Main parameters of such problems are the coordinates of the facilities and distances between them [54, 19, 22]. Examples of the location problems include the location of warehouses [22], computer and communication networks, base stations of wireless networks [44, 30], statistical estimation problems [41], signal and image processing and other engineering applications. In addition, many problems of cluster analysys [21,34,47] can be considered as location problems [37, 32] with squared Euclidean [21, 26], Euclidean [37, 48] or other distance functions [23]. The Weber problem [52, 54] is the problem of searching for such a point that a sum of weighted Euclidean distances from this point to the given points (existing facilities which are also called "demand points" or "data vectors" in case of a clustering problem) is minimal: N arg min F(X) = £wiL(X,Ai). (1) i=1 Here, L( ) is a distance function (norm), Euclidean in case of Weber Problem. For solving this problem (serching for its center), we can use an iterative Weiszfeld procedure [53] or its improved modifications [51, 17]. Analogous problems with Manhattan and Chebyshev distances are well investigated [55, 50, 11]. Convergence of this algorithm is proved for various distance metrics [40]. One of possible generalizations [22, 14] of the Weber problem is the p-median problem [22] where the aim is to find optimal locations of p new points (facilities): N argminF(X1,..., Xp) = YJ Wj min L(Xj, Aj). (2) i=i je{1'P} Here, {A^i = 1, N} is a set of the demand points (data vectors), {Xj | j = 1, p} is a set of new placed facilities, wj is a weight coefficient of the ith demand point, L( •) is a distance function defined on a continuous or discrete set [24]. In this paper, we consider continuous problems in an n-dimensional space. In the simplest case, L( ) is Euclidean norm. In this case, the Weiszfeld procedure is implemented up to p times at aech iteration of the the iterative alternating location-allocation (ALA) method [13]. If the distance function (metric norm) is squared Euclidean (I2) then the solution of the single-facility problem (1) is the mean point (centroid) [22]: NN Xj = ^ Wjaj/^Wj. (3) j=1 ¿=1 Here, we assume that X = (x1,...,xd), Aj = («¿,1,..., aj,d) Vi = 1, N. The simplest and probably most popular clustering [9, 49] model is k-means [33, 34] which can be formulated as a p-median problem (2) where wj = 1 Vi = 1, N and L(-) is squared Euclidean norm 12: d L(X,Y) = £>1 - y1)2 ¿=1 where X = (x1,...,xd) e Rd, Y = (y1,...,yd) e Rd. Searching for the centroid takes less computational re-sourses than searching iteratively for the center in case of 230 Informatica 38 (2014) 229-240 L. Kazakovtsev et al. the Weber problem and the ALA method works faster in this case. The p-median problem with Euclidean (l2), squared Euclidean (¿2) or other lp distances [16] is a problem of global optimization: the objective function is not concave nor convex [13]. The ALA method and analogous algorithms can find one local minimum of the objective function, their result depends on the initial solution. Moreover, such global optimization problems are proved to be NP-hard [20, 2, 35] for both continuous and discrete location [45, 46, 25] which makes usage of a brute force methods impossible for large datasets. Therefore, many heuristics [39] are used to improve the obtained results. One of widely used heuristics for initial seeding is k-means++ [8]. Most popular ALA procedures for the k-means problem are based on an algorithm proposed by Lloyd [33]. The algorithm known as standard k-means procedure [34] is fast local search procedure based on Lloyd's procedure. However, many authors proposed faster methods based on this standard procedure [56, 12, 1] for datasets and continuous supplemented data streams. Many authors propose approaches based on data sampling [38]: solving reduced problem with randomly selected part of the initial dataset and using this result as the initial solution of the ALA algorithm on the whole dataset [15, 29, 43]. Analogous approach was proposed for discrete p-median problems [6]. Many authors propose various genetic algorithms (GA) for improving the results of the local search [28, 36, 31, 42]. Most of such algorithms use evolutionary approach for recombination of the initial solution of the ALA algorithm. Hosage and Goodchild [27] proposed the first genetic algorithm for the p-median problem. Genetic algurithms are based on the idea of recombination of elements of a set ("population") of candidate solutions called "individuals" coded by special alphabet. In [10], authors propose a genetic algorithm providing rather precise results but its convergence is slow. Alp et al. [3] proposed a quick and precise genetic algorithm with special "greedy" heuristic for solving discrete p-median problems on networks which was improved by Antamoshkin and Kazakovtsev [4]. This algorithm can be used for generating the initial solutions for the ALA algorithm [42]. The idea of the "greedy heuristic" is as follows. After selecting two "parent" solutions, new infeasible solution (a candidate solution) is composed as the union of the facility sets of the "parent" solutions. From new solution, the facilities are eliminated until the solution becomes feasible. At each step, algorithm eliminates such facility that its elimination gives minimal addition to the objective function. If this algorithm is used for the continuous p-median problem, it generates the initial solution for the ALA algorithm [42] which must be implemented at each step to estimate the result of eliminating of each facility from the candidate solution. In this paper, we present a new genetic algorithm with floating point alphabet based on the ideas of algorithm proposed by Alp et al. [3]. Original Alp's algorithm uses inte- ger alphabet (number of vertices of the network) in "chromosomes" (interim solutions) of the GA. Its version for planar location problems [42] uses integer alphabet for coding numbers of data vectors used as initial solutions of the ALA algorithm. In our algorithm, we use floating point alphabet. Elements of "chromosomes" of our genetic algorithm are coordinates of centers or centroids of the interim solution which are altered by steps of the ALA algorthm and eliminated until a feasible solution is obtained. Such combination of the greedy heuristic and ALA procedure allows the algorithm to get more precise results. In case of continuous locating problems, the greedy heuristic is a computationally intensive procedure. We propose new procedure which allows eliminating sets of the centers or centroids from the candidate solution which gives multiple reduce of the running time. 2 Known algorithms The basic idea of the alternating location-allocation ALA is recalculating the centers or centroids of the clusters and reallocating of the data vectors to the closest center or cen-troid: Algorithm 1. ALA method [28]. Require: Set V = (A1,...,AN) of N data vectors in d-dimensional space, Ai = (a1t1, ...,a1 generate a random value i= 1 r G [0; S) with the uniform distribution and use imin = argminie{T;N}:Ej=i Pi F* then set F** = F*; x** = x*. 5: Check the stop conditions. If they are not reached then go to Step 2. 6: Otherwise, STOP. The solution is x**. The scheme of the genetic algorithm with greedy heuristic proposed by Alp et al. for continuous location problems is as follows [3,42]. Algorithm 5. GA with greedy heuristic. Require: Set V = (AT,...,AN) g Rd, number p of clusters, popultion size Np. 1: Initialize Np sets of data vectors indexes x^ —xNp : xi C {IN}, |xfc | = p Vk = iTNp. For each k G ON), calculate the fitness function. In case of the continuous p-median problem, to obtain the fitness function value for xk, algorithm performs the ALA procedure with initial solution xk and calculate N Fk = F (xk )=£> : min L(X, A). (4) number p of Here, xk is the result of running the ALA procedure with the initial solution xk. 2: If the stop conditions are reached then go to Step 7. 3: Choose randomly two "parent" sets xkl and xk2, kT, k2 G {1, Np}, kT = k2. Running special crossover procedure with greedy heuristic, obtain "child" set of indexes xc. Calculate the fitness function value Fc in accordance with (4). 4: If 3k G {1, Np} : xk = xc then go to Step 2. 5: Choose index kworst = argmaxk=TN Fk. If Fwotst < Fc then go to Step 2. ' P 6: Replace xk„orst with xc, set Fkmorst = Fc and go to Step 2. 7: STOP. The result is a set xk, k* = arg mink=^^ Fk. In the above version of this algorithm, at Steps 5 and 6, the worst solution xworst is replaced by new solution. In our experiments, we used other procedure at Step 5 (simplest tournament selection): choose randomly two indexes ki and k2, ki = k2; set kworst = argmaxk£{k1,k2} Fk. This version of Step 5 gives better results. In both random multistart and genetic algorithms, various stop conditions can be used. We used maximum running time limit. Unlike most genetic algorithms, this method does not use any mutation procedure. However, the crossover procedure uses a special heuristic: Algorithm 6. Greedy crossover heuristic. Require: Set V = (AT,...,AN) g Rd, number p of clusters, two "parent" sets of centers or centroids xkl and xk2 . 232 Informatica 38 (2014) 229-240 L. Kazakovtsev et al. 1: Set Xc = Xki U Xk2. Note that p < |xc| < 2p, i.e., the candidate solution xc is not feasible. 2: If |xc| = p then STOP and return the solution xc. 3: Calculate j* = argminjeXc F(xc \ {j}). 4: Set xc = Xc \{j*}. 5: Go to Step 2. At each iteration, one index of the center or centroid is eliminated (Step 4). At Step 3, Algorithm 6 chooses the index of a center or centroid which can be eliminated with minimum change of the fitness function. To estimate the fitness function, ALA procedure must be performed. Thus, Step 3 of Algorithm 6 is computationally intensive. In case of Euclidean metric, iteratice Weiszfeld procedure must run at each iteration of the iterative ALA procedure performed |Xc| times. Therefore, Algorithm 6 is a computationally intensive procedure, slow for very large datasets in case of k-means problem and almost inapplicable in case of large-scale continuous p-median problems with Euclidean metric. Idea of this heuristics correlates to ideas of the information bottleneck (IB) clustering method [48]. In the IB algorithms, at the start, all data vectors form individual clusters. At each step, one cluster is eliminated, its members join other clusters. To choose such cluster, for each of them, algorithm calculates the "information loss". In the case of "geometric" clustering based on distance metrics, this loss can be estimated as the distance function increase. The computational load in case of the IB clustering allows to implement this method to small datasets only (N < 1000). This form of the method proposed by Alp et al. for continuous location problems is a compromise between the IB clustering simpler heuristics like traditional genetic algorithms or random multistat. This algorithm can be used for solving a discrete p-median problem (2) with an additional condition: Xj e V Vj €{17P} (5) which can be used as an initial solution of the ALA method. Algorithm 7. Greedy crossover heuristic for initial seeding. Require: Set V = (A1,...,AN) e Rd, number p of clusters, two "parent" sets of centers or centroids xkl and Xk2 . 1: Set Xc = Xki U Xk2. 2: If |xc| = p then STOP and return the solution xc. 3: Calculate j * = N argmmfceXc E (minje(Xc\{fc}) WiL(Ai, Aj)). 4: Set xc = Xc \ {j *}. 5: Go to Step 2. In this case, the ALA method always starts from a local minimum of the discrete problem (2) with an additional constraint (5). This version of the algorithm is much faster, it gives better results than the random multistat (Algorithm 4) for most popular test datasets (see Section 4). However, such results can be improved. We propose two modifications. One of them decreases the computational intensiveness of the algorithm, the second one improves its accuracy. Their combination makes new algorithm faster and more precise in case of large datasets. 3 Our contribution Let us consider Steps 3 and 4 of Algorithms 6 and 7. At each iteration, Step 3 selects one index of data vectors and eliminates it from the candidate solution. Let us assume that at some kth iteration, j*th index is eliminated and at (k + 1)th iteration, algorithm eliminates j**th index. Our first modification is based on the supposition that if Aj* is distant from Aj** (i. e. L(Aj*, Aj**) > Lmin, Lmin is some constant) then the fact of eliminating or keeping j**th index "almost" does not depend on the fact of elimination or keeping of j * th index at previous iteration. If the facts of choosing of indexes of two distant data vectors at Step 3 in two successive iterations are independent then the decisions on their eliminating (or keeping) can be made simultaneously. We propose the following modification of Steps 3 and 4. Algorithm 8. Fast greedy heuristic crossover: modified steps of the greedy heuristic procedure (Algorithm 6). 3: For each j e Xc, calculate ¿j = F(xc \ {k}). 4.1: Sort ¿i and select a subset Xelim = {ei,..., enj} c Xc of na indexes with minimal values of ¿i. Value na e {1, | Xc | - p} must be calculated in each iteration. Maximum number of the extra data elements of set Xc must be eliminated in the first iterations and only one element in the final iterations (final iterations coincide with Algorithm 6 or 7): na =max{[(|Xc|- p) * 0"e], 1}. (6) We ran Algorithm 8 with ae = 0.2. Smaller values ( 0.3) change the order of eliminating the clusters and reduce the accuracy. 4.2: _Fromxeiim, remove close data vectors. For each j e {2, |Xeiim|}, if 3k e {1j—T} : L(Aej,Aefc) < Lmi„ then remove ej from Xeiim. 4.3: Set Xc = Xc \ Xelim. Algorithm 6 performs up to p iterations. For real large datasets, computational experiments demonstrate that p or p -1 iterations are performed in most cases (data vectors of the "parent" solutions at Step 3 of Algorithm 5 do not coincide). In each iteration, ALA algorithm runs |xc| times. Thus, ALA algorithm runs up to 2p + (2p - 1) + ... + 1 = 2p2 - p + 1 times. Popular test datasets, BIRCH 1-3 are generated for testing algorithms on problems with 100 clusters. Thus, the ALA algorithm must run up to 19901 times. Genetic Algorighm with Fast Greedy Heuristic... Informatica 38 (2014) 229-240 233 Depending on parameter Lmin, in each iteration of Algorithm 8 eliminates up to ae ■ p members from Xc- If Lmin is big and ae = 0.2, in the first iteration, ALA runs 2p times, in the second iteration [1.8p] times, then [1.64p], [1.512p] times etc. In case of 100 clusters and big Lmin, the ALA procedure runs 200 + 180 + 164 + 152 + 142 + 134 + 128 + 123 + 118 + 116 + 113 + 111 + 109 + 108 + 107+106+105+104+103+102+101 = 2626 times only. Taking into account computational intenseness or the ALA procedure such as standard k-means algorithm which is estimated O(N34p34log4(N)/ a6) in case of independently perturbed data vectors by a normal distribution with variance a2 [7], reducing number of runs of the local search procedure is crucial in case of large-scale problems. Step 3 of Algorithm 7 can be modified as follows. Algorithm 9. Fast greedy heuristic crossover for initial seeding: modified steps of the greedy heuristic procedure (Algorithm 7). For each j Xc calculate 5, N £ (minje(xc\{fc}) wiL(Ai, A,))- i= l 4.1: Sort 5i and select a subset Xeiim = {e1,..., en} c Xc of ns indexes with minimal values of 5i. 4.2: For each j € {2, | Xelim|}, if 3k € {1,j - 1} : L(Aej ,Aek) < L„ \elim- 4.3: Set Xc = Xc \ Xelim - then remove e, from The aim of Step 4.2 of Algorithm 8 is to hold the order of elimination of the clusters provided by Algorithms 6 or 7. In Fig. 1, two cases of running Algorithm 8 are shown. Let us assume that p = 4 and distances between the centers of clusters 1 and 3, 3 and 4, 1 and 4, 6 and 7 are less than Lmin. Let us assume that parameter ae allows eliminating up to 3 clusters simultaneously in the 1st iteration. After Step 3 of Algorithm 8 and sorting 5i, we have a sequence of clusters 4, 3, 6,... . If Step 4.2 is included in Algorithm 8 then only one of clusters 1, 3 and 4 can be removed in the 1st iteration (Case A). Thus, only two clusters (4 and 7) are eliminated in the 1st iteration. If we remove Step 4.2 from our algorithm or assign big value to Lmin then the simultaneous elimination of clusters 3 and 4 is allowed (Case B) which gives worse value of the squared distances sum. If the original Algorithm 7 runs, it eliminates cluster 4 first, then cluster 6. In its 3rd iteration, Algorithm 7 eliminates cluster 1 and we have the set of clusters shown in Fig. 1, Case A after two iterations which coincides with the result of Algorithm 8. Algorithm 6 starts the ALA procedure many times, it is a precise but slow method. Having included Algorithm 8 into Algorithm 6, we reduce the number of starts of the ALA procedure, however, as explained above, at least 2626 starts of the local search algorithm in each iteration of the genetic algorithm in case of 100 clusters make using this method impossible for very a large dataset, especially for the Euclidean metric. Algorithm 7 optimizes the fitness Figure 1: Succeeding and simultaneous elimination of clusters. function calculated for the initial seeding of the ALA procedure. This approach is fast, however, an optimal value of the fitness function for the initial seeding does not guarantee its optimal value for the final result of the ALA procedure. We propose a compromise version of two algorithms which implements one step of the ALA procedure after each elimination of the clusters. Since the result of the ALA procedure does not coincide with the data vectors Ai (in general), using integer numbers as the alphabet of the GA (i.e. for coding the solutions forming the population of the GA) is impossible and we use real vectors (coordinates of the interim solutions of the ALA procedure) for coding the solutions in the population of the GA. The whole algorithm is as follows. Algorithm 10. GA with greedy heuristic and floating point alphabet. e 234 Informatica 38 (2014) 229-240 L. Kazakovtsev et al. Require: Set V = (Ai,...,An) e Rd, number p of clusters, population size Np. 1: Initialize Np sets of coordinates xi, ...,XNp: chi C Rd, |Xfc | = P Vk = 1, Np with solutions of the ALA algorithm initialized by the k-means++ procedure (Algorithm 2). Thus, each xi is a local minimum of (2). Store corresponding values of the function 2 to array F1,..., FNp. 2: If the stop conditions are reached then go to Step 7. 3: Choose randomly two "parent" sets Xkl and Xfc2, k1, k2 e {1, Np}, k1 = k2. Running Algorithm 11, obtain "child" set of coordinates xc which is a local minimum of (2). Store the value of (2) to Fc. 4: If 3k e {1, Np} : xk = Xc then go to Step 2. 5: Choose index kworst = argmaxk=^^ Fk 4.4: Reassign data vectors to the closest centers or cen-troids. Fwotst < Fc then go to Step 2. 6: Choose randomly two indexes k1 and k2, k1 = k2; set kworst = argmaxk£{k1,k2> Fk. 6: Replace Xk„orst with Xc, set Fku Step 2. 7: STOP. The result is a set xk, k* = Fc and go to argmmk=T:wp Fk di = min L(X,Ai) Vi = 1, N. X exc Assign each data vector to the corresponding cluster with its center in an element of xc. Ci = arg min L(X, Ai) Vi = 1, N. Xexc Calculate the distances from each data vector to the second closest element of xc. Di = min L(Y,Ai). Ye(xc\{Ci}) 3: For each X e xc, calculate SX = F(xc \ {X}) = £ (Di - di). i:Ci]X 4.1: Calculate n in accordance with (6). Sort SX and select a subset Xelim = {X1,..., X„,} C Xc of n coordinates with minimal values of SX. 4.2: For each j e {2, |xeiim|}, if 3k e {1, j — 1} : L(Xj ,Xk) < Lmin then remove Xj from Xelim. 4.3: Set Xc = Xc \ Xelim. C* = arg min L(X, Ai) Vi = 1, N. Xexc If The greedy heuristic procedure is modified as follows. Algorithm 11. Greedy crossover heuristic with floating point alphabet. Require: Set V = (A1,...,AN) e Rd, number p of clusters, two "parent" sets of centers or centroids Xkl and Xk2, parameters ^e and Lmin. 1: Set xc = Xkl u Xk2. Run the ALA procedure for |xc| clusters starting from xc. Store its result to xc. 2: If |xc| = p then run the ALA procedure with the initial solution xc, then STOP and return its result. 2.1: Calculate the distances from each data vector to the closest element of xc. For each X e Xc, if 3i e {1, N} : Ci = X and C*i = X then recalculate center or centroid X * of the cluster CXust = {Ai|C* = X,i = iTN}. Set Xc = (Xc\{X *}) U {XX}. 5: Go to Step 2. An important parameter of Algorithms 8 and 11 is Lmin. Performed series of experiments on various data, we propose the following method of its determining for each pair of centers or centroids Xj and Xk (see Step 4.2 of Algorithm 11): Lmin = min {max{L(X,Xj),L(X, Xk)}}. X exc We ran this algorithm with large datasets and proved its efficiency experimentally. 4 Computational experiments 4.1 Datasets and computing facilities For testing purposes, we used real data and generated datasets collected by Speech and Image Processing Unit of School of Computing of University of Eastern Finland1 and UCI Machine Learning Repository2. Other authors use such problems for their experiments [56, 1,43]. Number of data vectors N varies from 150 (classical Iris plant problem) to 581013 (Cover type dataset), number of dimensions d varies from 2 to 54, number of clusters from 3 to 1000. In addition, we used specially generated datasets for p-median problems (uniformly distributed data vecrots in R2, each coordinate in interval [0; 10) with uniformly distributed weights in range [0; 10)). Computational experiments were performed for problems with Euclidean (12) and squared Euclidean (12 ) distances (p-median and k-means problems, correspondingly). For our experiments, we used a computer Depo X8Sti (6-core CPU Xeon X5650 2.67 GHz, 12Gb RAM), hyper-threading disabled and ifort compiler with full optimization and implicit parallelism (option -O3). For algorithms comparison purposes, we ran each algorithm with each of datasets 30 times. 4.2 Algorithm parameters tuning An important parameter of the genetic algorithm is number of individuals (candidate solutions) Np in its population (population size). Running Algorithm 10 for the generated datasets (d = 2, N = 1000 and N = 10000, p =10 and 1 http://cs.joensuu.fi/sipu/datasets/ 2https://archive.ics.uci.edu/ml/datasets.html Genetic Algorighm with Fast Greedy Heuristic... Informatica 38 (2014) 229-240 235 ro > (0 < 35720 35710 35700 i 5 individuals 10 individuals 15 individuals 20 individuals 100 individuals 3 4 5 Time, sec. 0 2 6 7 8 Figure 2: Results of Algorithm 10 with various population sizes. p = 100) and real datasets (MissAmerical with p = 100) show that large populations (Np > 50) slow down the convergence. Analogously, running with very small populations (Np < 10) decrease the accuracy. Results of running our algorithm for a generated problem with squared Euclidean metric are shown in Fig. 2. In this diagram, we fixed the best results achieved by the algorithms and the spent time after each improvement of the results in a special array. This diagram shows the average or worst results of 30 starts of the algorithms. Experiments show that a population of 10-25 individuals is optimal for all tested problems for all greedy crossover heuristics considered in this paper (Algorithms 6, 7, 8 and 11). There is no obviois relation between d and Np, p and Np nor N and Np. In all experiments below, we used Np = 15. 4.3 Numerical results For all datasets, we ran the genetic algorithm with greedy heuristic (Algorithm 5) with various crossover heuristics (Step 3 of Algorithm 5): its original version proposed by Alp et al. [3, 42] (Algorithm 6), its version for initial cluster centers seeding only (Algorithm 7), our modifications allowing elimination of many cluster centers in one step (Algorithm 8) and our new genetic algorithm with floating point alphabet (Algorithm 11). Results for each of datasets are shown in Tables 1 and 2. We used the sampling k-means procedure (Algorithm 3) for datasets with N > 10000 as the ALA procedure at Step 1 of Algorithms 5, 10 and 11. For smaller datasets, we ran all algorithms without sampling. However, running algorithms without sampling k-means procedure for large datasets equally delays the genetic algorithm with all considered greedy crossover heuristics. Computation process with each of the algorithms was performed 30 times. Time limit shown in the first column was used as the stop condition. Value of this maximum running time was chosen so that adding equal additional time does not allow to obtain better results in case of using the original greedy crossover heuristic for initial seeding (Algorithm 7) in at least 27 attempts of 30. In addition, for the problems listed in Table 1, we fixed the average time needed for achieving the average result of the original genetic algorithm with greedy crossover heuristic (Algorithm 5 + Algorithm 6, see [3, 42]). For more complex problems listed in Table 2 where the original greedy crossover procedure is inapplicable due to huge computational complexity, we fixed the average time needed for achieving the average result of the original genetic algorithm with greedy crossover heuristic applied for optimizing the fitness function value of the initial dataset of the ALA procedure (Algorithm 5 + Algorithm 7). Algorithm 5 with the original greedy crossover heuristic (Algorithm 6) proposed by Alp et al. [3, 42] shows excellent results for comparatively small datasets (see Table 1). For the least complex problems ("Iris" dataset), using the algorithm proposed in this article (Algorithm 10, Problems 1 and 3) reduces the accuracy of the solution in comparison with the original algorithm of Alp et al. [3,42] (Algorithms 5 and 6). For larger datasets, new algorithm (Algorithm 10+Algorithm 11) is faster and more precise. Moreover, using the original greedy crossover heuristic is impossible for large datasets (for all larger gatasets with p > 30, N > 10000) due to very intensive computation at each iteration. For such datasets, we used the algorithm of Alp et al. applied for optimizing the fitness function value of the initial dataset of the ALA procedure (Algorithms 5 and 7) for comparison purposes. In this case, for all solved large-scale test problems with both Euclidean (/2, planar p-median problem) and squared Euclidean (/¡, k-means problem) metrics, our Algorithm 10 with floating point alphabet and modified greedy crossover heuristic (Algorithm 11) works faster and gives more precise results than Algorithm 5 with greedy heuristic implemented for the initial seeding only (Algorithm 7, [3, 42]). 236 Informatica 38 (2014) 229-240 L. Kazakovtsev et al. Table 1: Results of running new Algorithm 11 and original genetic algorithm with greedy crossover heuristic. Dataset, Dis- Method Avgerage Average Worst Avg. and its tance result time needed result speedup parameters, for reaching (new vs. time limit result of the original original method) method, sec. Iris f2 h Original 1.40026039044 0.0096 1.40026039044 (n = 150, d = 4, • 1014 • 1014 p = 3), ALA mult. 1.40026039044 0.0103 1.40026039044 100 sec. • 1014 • 1014 New 1.400262 • 1014 - 1.4002858 • 1014 - Iris /2 Original 46916083985700 2.4 46916083985700 (n = 150, d = 4, ALA mult. 46917584582154 - 46935815209300 p = 10), New 46916083985700 2.5 46916083985700 - 100 sec. - - MissAmerical /:2 1 2 Original 105571815.061 603 105663081.95 (n = 6480, d = 16, ALA mult. 105714622.427 - 106178506.965 p = 30), New 105440299.602 13.8 105440299.601 43.69 1500 sec. - - Europe /2 '2 Original 1099348562.46 1050.8 1099355026.03 (n = 169309, ALA mult. 1099345009.09 15.6 1099345033.08 d = 2,p = 10), New 1099345067.99 123.8 1099345210.55 8.48 1500 sec. - - Note:"Original" algorithm is Algorithm 5 with original greedy crossover heuristic (Algorithm 6), "ALA mult." algorithm is multiple start of the ALA procedure (Algorithm 4), "New" algorithm is the genetic algorithm with floating point alphabet (Algorithm 10 with Algorithm 11 as the greedy crossover procedure). To illustrate the dynamics of the solving process, we present the timing diagrams which show the average results of 30 runs of each algorithm for various datasets in Fig. 3 and 4. Diagrams show that new algorithm with floating point alphabet allows to increase the accuracy at early stages of the computation process in comparison with known methods which allows to use it for obtaining quick approximate solutions. In addition, results of the fast greedy heuristic (Algorithm 8) are shown in the diagrams. Using this heuristic without other modifications to the genetic algorithm can reduce the accuracy of the results. 5 Conclusion New genetic algorithm based on ideas of the p-median genetic algorithm with greedy heuristic and two original procedures can be used for fast and precise solving the large-scale p-median and k-means problems. For the least complex problems, the results of our method are less precise than the original GA with greedy heuristic proposed by Alp et al. However, new algorithm provides more precise results in appropriate time for the large-scale problems. References [1] M. R. Ackermann et al. (2012) StreamKM: A Clustering Algorithm for Data Streams, J. Exp. Algo-rithmics, Vol 17, Article 2.4 (May 2012), DOI: 10.1145/2133803.2184450 [2] D. Aloise, A. Deshpande, P. Hansen, P. Popat (2009) NP-Hardness of Euclidean Sum-of-Squares Clustering, Machine Learning, Vol. 75, pp. 245-249, DOI: 10.1007/s10994-009-5103-0 [3] O. Alp, E. Erkut and Z. Drezner (2003) An Efficient Genetic Algorithm for the p-Median Problem, Annals of Operations Research, Vol.122, Issue 1-4, pp. 2142, doi 10.1023/A:1026130003508 [4] A. Antamoshkin, L. Kazakovtsev (2013) Random Search Algorithm for the p-Median Problem, Informatica (Ljubljana), Vol. 37, No. 3, pp. 267-278 [5] A. Antamoshkin, H. P. Schwefel, A. Toern, G. Yin, A. Zhilinskas (1993) Systems Analysis, Design and Optimization. An Introduction, Krasnoyarsk, Ofset [6] P. Avella, M. Boccia, S. Salerno and I. Vasilyev (2012) An Aggregation Heuristic for Large Scale p-median Problem, Computers & Operations Research, 39 (7), pp. 1625-1632, doi 10.1016/j.cor.2011.09.016 Genetic Algorighm with Fast Greedy Heuristic... Informatica 38 (2014) 229-240 237 Table 2: Results of running new Algorithm 11 and original genetic algorithm with greedy crossover heuristic used for initial seeding. Dataset Dis- Method Avgerage Average Worst Avg. its tance result time needed result speedup parameters, for reaching (new vs. time limit result of the original method, sec. original method for init. seeding) Europe h Orig. init. 400370576 397.6 400904292 (n = 169309, ALA mult. 400755480 - 401007437 - d = 2, p = 100), New 400345100 193.6 400595350 2.05 1200 sec. Generic h Orig. init. 85318.44 47.65 85482.67 (n = 10000, ALA mult. 85374.62 - 86114.83 - d = 2,p = 100), New 85167.01 0.895 85179.72 53.24 300 sec. Europe /2 1 2 Orig. init. 2.2767 • 1012 557.6 2.2933 • 1012 (n = 169309, ALA mult. 2.3039 • 1012 - 2.3111 • 1012 - d = 2,p = 100), New 2.2752 • 1012 143.1 2.2862 • 1012 3.89 1200 sec. BIRCH1 / 2 1 2 Orig. init. 9.277283 • 1013 46.08 9.277287 • 1013 (n = 100000, ALA mult. 9.746921 • 1013 - 9.276386 • 1013 - d = 2,n = 100), New 9.277282 • 1013 31.56 9.274231 • 1013 1.46 120 sec. BIRCH3 / 2 1 2 Orig. init. 4.08652 • 1012 802.8 4.105231 • 1012 (n = 100000, ALA Mult. 4.16053 • 1012 - 4.162683 • 1012 - d = 2,p = 1000), New 4.02040 • 1012 8.81 4.022190 • 1012 91.12 2400 sec. MissAmerica2 / 2 1 2 Orig.init. 80264286 85.8 81039148 (n = 6480, ALA Mult. 81869326 - 82316364 - d = 16, p = 100), New 79837119 0.752 80147971 114.1 300 sec. CoverType / 2 1 2 Orig.init. 3122934.7 905.4 3146107.4 (n = 581013, ALA Mult. 3163271.9 - 3182076.8 - d = 54, p = 100), New 3069213.6 53.1 3072299.5 17.05 2400 sec. Note: "Orig. init." algorithm is Algorithm 5 with original greedy crossover heuristic (Algorithm 7), "ALA mult." algorithm is multiple start of the ALA procedure (Algorithm 4), "New" algorithm is the genetic algorithm with floating point alphabet (Algorithm 10 with Algorithm 11 as the greedy crossover procedure). [7] D. Arthur, B. Manthey and H. Roglin (2009) k-Means Has Polynomial Smoothed Complexity, in Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS '09), IEEE Computer Society, Washington, DC, USA, pp. 405414 DOI: 10.1109/F0CS.2009.14 [8] D. Arthur and S. Vassilvitskii (2007) k-Means++: The Advantages of Careful Seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete algorithms, ser. SODA '07, pp. 1027-1035 [9] K. Bailey (1994) Numerical Taxonomy and Cluster Analysis, in: Typologies and Taxonomies, Sage Pubns, DOI: 10.4135/9781412986397 [10] B. Bozkaya, J. Zhang and E. Erkut (2002) A Genetic Algorithm for the p-Median Problem, in Z. Drezner and H. Hamacher (eds.), Facility Location: Applications and Theory, Springer [11] A. V. Cabot, R. L. Francis and M. A. Stary (1970) A Network Flow Solution to a Rectilinear Distance Facility Location roblem, American Institute of Industrial Engineers Transactions, 2, pp. 132-141 [12] L. O'Callaghan, A. Meyerson, R. Motwani, N. Mishra and S. Guha (2002) StreamingData Algorithms for High-Quality Clustering, Data Engineering, 2002. Proceedings. 18th In- 238 Informatica 38 (2014) 229-240 L. Kazakovtsev et al. Europe dataset, l2,d=2,n=169309,p=100,avg.results a v t s e .Q X ro > (O 3.4e+06 - CO e -Q 3.3e+06 - X LL 3.2e+06 3.1e+06 - 1— 3e+06 ro > CO