Informatica 23 ( 1999) 507-511 507 Adapted Methods For Clustering Large Datasets Of Mixed Units Simona Korenjak-Černe IMFM Ljubljana, Dept. of TCS, Jadranska 19, 1 000 Ljubljana, Slovenia E-mail: simona.korenjak@fmf.uni-lj.si Keywords: clustering, large datasets, mixed units, hierarchical clustering, cluster description compatible with merging of clusters, leaders method, adding clustering method Edited by: Cene Bavec and Matjaž Gams Received: October 17, 1999 Revised: October 30, 1999 Accepted: December 11, 1999 The proposed clustering methods are based on the recoding of the original mixed units and their clusters into a uniform representation. The description of a cluster consists for each variable of the frequencies of the variable values over its range partition. The proposed representation can be used also for clustering symbolic data. On the basis of this representation the adapted version of the leaders method and adding clustering method were implemented. We describe both approaches, which were successfully applied on several large datasets. 1 Introduction Abstraction is the main tool to deal with large amounts of data. The first step is to identify groups of similar units -clusters. In data analysis this is a task of clustering methods. The most popular are hierarchical clustering methods. Because they usually use a similarity/dissimilarity matrix they are appropriate only for clustering datasets of a moderate size (some hundreds of units). On the other hand well known nonhierarchical methods are mostly implemented for datasets of variables measured in the same scale type (such as for example 'k-means method'). Because of these limits we are searching for new clustering methods or at least trying to adapt known methods to be appropriate for clustering large datasets of mixed units, where variables (properties) of the units are measured in different scales. Let E be a finite set of units. A nonempty subset C C E is called a cluster. A set of clusters C = {C»} forms a clustering. In this paper we shall require that every clustering C is a partition of E. The clustering problem can be formulated as an optimization problem: Determine the clustering C* 6 for which P( C*) = minP(C) v ' ces where $ is a set of feasible clusterings and P : $ IR J is a criterion function. In many clustering methods the criterion function measures the deviation of units from representatives (leaders) of corresponding clusters. In our method we select the criterion function in one of the most frequent form P(C)=J2 E d(X^c) cecxec where Rc is a representative of cluster C and d is a dissimilarity. The cluster representatives usually consist of variable-wise summaries of variable values over the cluster. For homogeneous units with only numerical variables their means are usually selected as representatives of clusters. For mixed (nonhomogeneous) units a new description has to be selected. In this paper we investigate a description satisfying two additional requirements: 1. it should require a fixed space per variable; 2. it should be compatible with merging of clusters -knowing the description of two disjoint clusters we can, without additional information, produce the description of their union. Note that only some of the cluster descriptions are compatible with merging. For example mean (as sum and number of units) for numerical variables and (min, max) intervals for ordinal variables. 2 A description of a cluster For our adaptation of clustering methods to be appropriate for clustering large datasets of mixed units, we choose a cluster description based on frequencies. For this purpose, the ranges of the variables are partitioned into selected number of classes. Let {Vi,i = 1,... ,k(V)} be a partition of the range of values of variable V (the number of classes k(V) depends on variable). Then we can define for a cluster C the sets Q(i, C-,V) = {XeC: V(X) EVi},i = I,..., k{V) 508 Informática 23 (1999) 507-511 S. Korenjak-Cerne where V(X) denotes the value of variable V on unit X. In the case of an ordinal variable V (numerical scales are a special case of ordinal scales) the partition {Vi,i = 1,... , k{V)} usually consists of intervals determined by selected threshold values to < ti < < ¿3 < • • • < tk(v)~i < h(v), to = inf V, tk(V) = sup V. For nominal variables we can obtain the partition, for example, by selecting k(V) — 1 values ti, t2, h,... , tk(V)-i from the range of variable V (usually the most frequent values on E) and setting V* = {ti}, i = 1,... , k(V) — 1; and putting all the remaining values in class V/.(v) • Units are not necessarily represented with single value for each variable, but they can also be represented with frequencies over the classes of variables ranges. Using classes of ranges we get frequencies q(i,C;V) = card Q(i,C;V) and relative frequencies g(i,C;V) P(i, C; V) = cardC Note that k(V) $>(*, ■ • • > k(V) (or by some other distribution). It is easy to see that such a description is compatible with merging, because for two disjoint clusters C\ and C2 we have Q(i, Ci U C2; V) = Q(i, CuV) U Q(i, C2;V), q(i, Cx U C2; V) = q(i, Cx\ V) + q{i,C2\V). The threshold values are usually determined in such a way that, for the given set of units E (or the space of units £), it holds thatp(z, E] V) « ¡^.i = 1,... ,k(V). As a compatible description or nominal variable over a cluster C also its range V(C) can be used, since we have V(Ci UC2) = V(Ci) UV(CS). Example: Recoding of flags dataset Original data are taken from the address ftp://ftp.ics.uci.edu/pub/ machine-learning-databases/flags (Flags from Collins Gem Guide to Flags, donated by Richard S. Forsyth.) Let us consider the following three variables: - population (in round millions), - mainhue (predominant color in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue)), - text (1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise). The range of the variable population is divided into 5 classes with approximately the same number of units in each of them. The ranges of the others variables are so small that we put for discretization of them each possible value in a separate class: var= population var= mainhue var= text map 1 = {0} 2 = (0,4] 3 = (4,18] 4 = (18,158] 5 = (158,1100] map 1 = {red} 2 = {green} 3 = {blue} 4 = {gold} 5 = {white} 6 = {black} 7 = {orange} map 1 = {0} 2 = {1} ORIGINAL DATA unit ID population mainhue text Austria 8 red 0 New-Zealand 2 blue 0 Saudi-Arabia 9 green 1 Switzerland 6 red 0 USA 231 white 0 RECODED DATA Austria New-Zealand Saudi-Arabia Switzerland USA In our case for each variable a unit is represented with index of the appropriate class. The description of a cluster Ce (only for considered variables) obtained with the leaders method is q(Cß ; population) 8 1 q(C6] mainhue) 1 0 q(Cß-,text) . 6 4 1 0 0 9 0 0 0 0 >From this description we can see that in eight countries the population is less than a million, in one country is between 1 and 4 millions and in one country population is between 4 and 18 millions. This cluster is one of the seven clusters obtained with the adapted version of the leaders method with maximal allowed dissimilarity between a unit and its nearest leader 0.5. In one of the countries flags red is a dominant color and all of the remaining units have blue ADAPTED METHODS FOR CLUSTERING LARGE DATASETS Informatica 23 (1999) 507-511 509 mainhue. In six units some text is presented and four units inside cluster C& have no text in their description. For better understanding, cluster C'o consists of: Bermuda, Brit. Virg. Isles, Cayman Islands, Falklands Malvi, Fiji, HongKong, Montserrat, St. Helena, Turks Cocos Islands and Tuvalu. 3 Dissimilarity between clusters Let us return to our approach to clustering problem as an optimization problem. After deciding to use the uniform representation of units and clusters, we have to define a measure of dissimilarity between clusters (a unit is a special case of a cluster with only one element). First the dissimilarity between clusters for individual variable V is defined as , KV) d(C1,C2;V) = \P(i,Ci;V) -P(i,Cr,V)\. z i=1 We shall use the abbreviation d(X,C-, V) = d({X},C;V). In both cases it can be shown that 1. d(Ci, C2; V) is a semidistance on clusters; i.e. (a) d(C1:C2;V)> 0 (b) d(C,C;V) = 0 (c) d(Ci, C2; V)+ d(C2, C3; V) > d(Ci,C3\V) 2. d(Ci,C2;V)e[0,1] and for the representation of a single unit also X € Q(i, E- V) d(X, C; V) = 1 - p(i, C; V) The semidistances on clusters for individual variable can be combined into a semidistance on clusters for complete descriptions by m d(Ci,C2) = X>>d(Ci,C2;Vj), 3=i where m is the number of variables and aj are weights (aj > 0 and aj = 1); often aj = i. We can use weights to consider dependencies among variables or to tune the dissimilarity to a given learning set in AI applications. 4 Clustering procedures In the proposed approach the original nonhomogeneous data are first recoded to a uniform representation. For the recoded data efficient clustering procedures can be built by adapting leaders method (Hartigari, 1975) or adding clustering method (Zupan 1982, Jambu and Lebeaux 1983, Batagelj andMandelj 1993). 4.1 The adapted version of the leaders method The adapted version of the leaders method is a variant of a dynamic clustering method (Diday 1979, Batagelj 1985). To describe the dynamic clustering method for solving the clustering problem let us denote: A a set of representatives', L Ç A a representation-, $ a set offeasible representations', P : $ —> IRq criterion function', G : $ \I/ a representation function', F : $ -> $ a clustering function and suppose that the functions G and F tend to improve (diminish) the value of the criterion function P. Then a simple version of the dynamic clustering method can be described by the scheme: L := Lo; repeat C := F(L) L := G(C) until the leaders stabilize We begin with the initial representation and then repeat to assign each unit to the nearest leader and after that select leaders for each (new) cluster until we reach the minimum of the criterion function or until the leaders don't change any more (local minimum). Let us assume the following model C = {Cj}i6/, L = {Li}ieI, L(X) = Li : X € Ci (the nearest leader to the unit X), L = [L(Vi),... , L(Vm)], L{V) = [s(l,L;V),... ,s(k(V),L-,V)}, s(j, L; V) ' = 1 (the description of a leader has the same form as the description of a cluster) and 1 k(V) d(C, L; V) = - £ Ip(j, C; V) - s(j, L;V)|. j=1 For selected criterion function P(C) = ¿2 d(X,L(X)) = xeE iei where p(C,L) = J2 d(X,L) X€C we define F(L) = {Ci} with X e C'i : i = minArgmin{d(X,Lj) : Lj £ L}. 3 This means that each unit is assigned to the (first) nearest leader. We define G(C) = {LJ} with L'i = argminp(C, L). Lev The unique symmetric optimal solution of this optimization problem is 510 Informatica 23 (1999) 507-511 S. Korenjak-Cerne where M — {j : q(j, C; V) = maxj q(i, C; V)} and t = card M. The representative (leader) of a cluster is obtained from the most frequent range(s) of values of variables on this cluster. Example: Leader of a cluster For the description of a cluster Cq q(Ce;population) 8 1 10 0 q(Ce; mainhue) 1 0 9 0 0 0 0 q(Ce',text) 6 4 the optimal leader Le is q(Ce; population) 1 0 0 0 0 q(C6] mainhue) 0 0 1 0 0 0 0 q(Ce', text) 1 0 The characteristics of the cluster are population = less than a million 80 % mainhue in the flag = blue 90 % text in the flag = no 60 % For example, 80% of all countries in the cluster Cq have less than a million inhabitants, 90% of all countries flags have blue mainhue and 60% of the flags in the cluster have no text in their descriptions. Properties of the leaders method The main properties of the adapted version of the leaders method are: 1. Selection of the leaders and formation of new clusters diminish the value of the criterion function. 2. The program always stops (converges). The number of iterations is usually less than 10. 3. The program is suitable for clustering (very) large datasets. 4. The leaders descriptions provide us with simple interpretations of clustering results. 4.2 The adapted adding method The adding clustering method is a hierarchical clustering method in which a new unit is added in a clustering tree. Each vertex corresponds to a cluster. For large datasets usually only the upper part of the hierarchy is maintained, the lower levels subtrees are replaced by 'bags' containing all units from a subtree. We shall use the same description of a cluster (vertex) and the same definition of a dissimilarity as in the leaders method. Every time we add a unit in a cluster (vertex) the frequencies are recalculated. There are two possible ways how to add a new unit: a) To maximize the dissimilarity between clusters (sons) of the current vertex or, b) To minimize the dissimilarity from clusters (sons) of the current vertex. In the first case (see Figure 1) the dissimilarities between both sons of a current vertex are calculated. Because of greedy approach the case with the biggest dissimilarity is chosen: max.{d(Cp U {X},Cg),d(Cp,Cg U {.X}),d(C,{X})}. c A ■x CU{X} A Cv U{X} Cq Cp Cq U{X} Cp Cq Figure 1 : Maximize the dissimilarity between clusters c A -X Cp Cq ifd(Cp,X) < d(Cq,X) then C U {X} A else Cp U {X} C, Cu{X} A Cp Cq U {X} Figure 2: Minimize the dissimilarity from clusters In the second case (see Figure 2) the dissimilarities from each of the sons of current vertex are calculated and the unit is added to the nearest one: min {d(Cp,{X}),d(Cq,{X})}- The proposed approaches can also be extended on non-binary trees. The adding clustering method has some advantages: 1. Presentation of the result with a tree. ADAPTED METHODS FOR CLUSTERING LARGE DATASETS . Infonnatica 23 (1999) 507-511 511 2. It can be used for classification. 3. Speed up - if the tree has many (hundreds of) leaves which represent the leaders, it is more efficient adding unit into the tree with this method than to calculate the dissimilarities to each of the leaders. A drawback of the adding method is that the result strongly depends on the ordering of the input sequence of units. A possible way to avoid this problem is to select a 'good' initial tree. We are suggesting to built the initial tree with some agglomerative hierarchical clustering method on leaders obtained with the leaders method. The other possibility is to include balancing of the tree in the process of adding new unit. Both possibilities are still under the development. 5 Conclusion We successfully applied the proposed approach on the dataset of types of cars (1 349 units, 26 variables), on the ISSP data (45 784 units, 21 variables) and also on some large datasets from AI collection http://www.ics.uci.edu/~mlearn/ MLRepos i tory.html The first version of the program ClaMix (based on the adapted version of the leaders method) and some of the results are available at http://www.educa.fmf.uni-1j.si/datana/ Acknowledgment: This work was supported by the Ministry of Science and Technology of Slovenia, Project Jl-8532. References [1] Batagelj, V. (1985) Notes on the dynamic clusters method. Proceedings of the IV conference on applied mathematics, Split, May 28-30, 1984. University of Split, Split, p. 139-146. [2] Batagelj, V. & Bren, M. (1995) Comparing Resemblance Measures. Journal of Classification, 12, 1, p. 7390. [3] Batagelj, V. & Mandelj, M. (1993) Adding Clustering Algorithm Based on L-W-J Formula. Paper presented at: IFCS 93, Paris, 31 .aug-4.sep 1993. [4] Brucker, P. (1978) On the complexity of clustering problems. Lecture Notes in Economics and Mathematical Systems 175, in: Optimization and Operations Research, Proceedings, Bonn. Henn,R., Korte,B., Oettli,W. (Eds.), Springer-Verlag, Berlin 1978. [5] Diday, E. (1979) Optimisation en classification automatique, Tome l.,2. INRIA, Rocquencourt, (in French). [6] Diday, E. (1997) Extracting Information from Extensive datasets by Symbolic Data Analysis. Indo-French Workshop on Symbolic Data Analysis and its Applications, Paris, 23-24. September 1997, Paris IX, Dauphine, p. 3-12. [7] Hartigan, J.A. (1975) Clustering Algorithms. Wiley, New York. [8] Jambu, M. & Lebeaux, M.O. (1983) Cluster Analysis and Data Analysis. North-Holland Publishing Company. [9] Korenjak-Cerne, S. & Batagelj, V. (1998). Clustering large datasets of mixed units. Advances in Data Science and Classification. Rizzi, A., Vichi, M. and Bock, H.-H. (Eds.), Springer, Berlin, 1998, p. 43-48. [10] Tukey, J.W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading, MA. [11] Zupan, J. (1982) Clustering of Large Data Sets. Research Studies Press, John Wiley & Sons LTD. [12] Flags from Collins Gem Guide to Flags. Collins Publishers (1986). Donated by Richard S. Forsyth. ftp://ftp.ics.uci.edu/pub/ machine-learning-databases/flags