Metodoloˇ skizvezki,Vol. 17,No. 1,2020,33–54 Associationsamongsimilarityanddistance measuresforbinarydatainclusteranalysis JanaCibulkov´ a 1 Zdenek ˇ Sulc 2 Hana ˇ Rezankov´ a 3 SergejSirota 4 Abstract Thepaperfocusesonsimilarityanddistancemeasuresforbinarydataandtheir application in cluster analysis. There are 66 measures for binary data analyzed in thepaperinordertoprovideacomprehensiveinsightintotheproblematicsandto createtheirwell-arrangedoverview. Forthispurpose,formulasbywhichtheywere definedarestudied. Inthenextpartoftheresearch,theresultsofobjectclustering on generated datasets are compared, and the ability of measures to create similar oridenticalclusteringsolutionsisevaluated. Thisisdonebyusingchoseninternal and external evaluation criteria, and comparing the assignments of objects into clustersintheprocessofhierarchicalclustering. Thepapershowswhichsimilarity measures and distance measures for binary data lead to similar or even identical resultsinhierarchicalclusteranalysis. 1 Introduction A binary vector is one of the most common representations of patterns in datasets. Therefore many similarity and distance measures for binary data have been proposed over the years. They are usually well examined; they are often implemented in both – commercial(e.g.,SPSS,MatLab)andnon-commercial(e.g.,R,Python)software. This paper focuses on associations among selected similarity measures and distance mea- sures for binary data in the field of cluster analysis. Similarity measures and distance 1 Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Eco- nomics,Prague,CzechRepublic;jana.cibulkova@vse.cz 2 Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Eco- nomics,Prague,CzechRepublic;zdenek.sulc@vse.cz 3 Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Eco- nomics,Prague,CzechRepublic;hana.rezankova@vse.cz 4 Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Eco- nomics,Prague,CzechRepublic;xsirs00@vse.cz 34 Cibulkov´ aetal. measuresforbinarydataareroutinelyusednotonlyintheclusteranalysisofbinarydata but in clustering nominal data as well. In fact, it is considered as standard procedure to performabinarytransformationwhenclusteringnominaldata. The demand for processing binary (or nominal) data caused that numerous similar- ity measures and distance measures for binary data have been proposed over the years in various fields for various purposes. For example, the Jaccard similarity measurewas designed for clustering flowers (Jaccard, 1901), while Forbes (1925) proposed a co- efficient for clustering ecologically related species of fishes. However, an enormous quantity of available measures for binary data might be counterproductive, and it can cause confussion. Duplicities can easily occur, there might be a functional relationship betweenmanymeasuresforbinarydata,acertainmeasurecanbereferredtobyseveral differentnamesintheliterature,etc. Ontheonehand,therearemanyauthorswhodefinedtheirownmeasuresforbinary data,oftenbyadjustingexistingonesfortheneedofaspecificstudyornotknowingthat themeasurealreadyhadbeendefinedbefore(maybeinadifferentfieldofstudy). Onthe otherhand,onlyahandfulofauthorsstudiedandcomparedthesemeasuresamongeach other. Jacksonetal. (1989)comparedeightbinarysimilaritymeasuresbeforechoosing the best measure for his ecological study. Hub´ alek (1982) collected and studied 43 measures. Twenty of them were used for cluster analysis on real data to produce five clustersofrelatedmeasures. Hisstudyprovedthatmanymeasuresmightleadtosimilar clustering solutions. Todeschini (2012) studied similarity measures for binary data on chemoinformatics dataset. Based on his study, he claims that binary data measures are oftenlinearlydependent,andthus,themajorityofthemproducesthesameclusters. The best summary of similarity measures and distance measures for binary data so far was createdbyChoandChai(2010). Theycollected76measuresusedoverthelastcentury and revealed their relationships through a simple hierarchical clustering of values that were calculated by applying given measures on the same dataset. All of these authors provide some kind of survey of similarity measures; however, the studies contained a limited number of similarity measures or they were applied on one specific dataset. Furthermore, no one (except Hub´ alek, 1982) studied the measures formulas that create theseassociationsinthefirstplace. This paper uses the studies of the previously mentioned authors as a baseline for furtherresearchthattriestoberobustenoughandbasedonmathematicsanddefinitions that form these associations among measures for binary data. The paper aims to deter- minewhichmeasuresforbinarydataleadtodifferentclusteringresultsandwhichones lead to similar or even identical results of cluster analysis. Firstly, the theory of gener- atingdatasets,clusteringmethodandevaluationcriteriaarebrieflypresentedinSection 2. Also similarity and distance measures for binary data are introduced in this section. Section 3 is focused on revealing associations among the measures in their formulas. We expect the formulas that are functions of the same statistics to lead to the same or Associationsamongsimilarityanddistancemeasuresforbinarydata... 35 similarresults. Thisexpectationisverifiedandanalyzedlaterinthepaperonclustering solutions of generated datasets. Measures are used in hierarchical clustering of gener- ated datasets in Section 4. The results of object clustering on generated datasets are compared, the quality of clustering solutions is evaluated, and the ability of measures to create similar or identical clusters is examined. This approach makes our research differentfrompreviousresearchmentionedabove. 2 Theoreticalbackground Theoretical background for the experiment, such as data generating process, similarity and distance measures for binary data, a chosen clustering method, and evaluation cri- teria are briefly introduced in this section. The vocabulary presented in the section is usedintherestofthepaper. 2.1 Datagenerator Numerousdatasetswitha givenspecificsetoffeaturesare neededinordertomakethe resultsoftheanalysisreliable. Forthispurpose,thedatageneratorsuitablefortheneeds of the experiment is used (Cibulkov´ a and ˇ Rezankov´ a, 2018). The data generator was designedmainlyforgeneratingmultivariatedatasetsappropriateforclusteranalysis. Agenerateddatasetsconsistsofagivennumberofclusters,whereeachclustercor- respondstoonesampleofagivenmultivariatedistribution. Theideaofadatasetbeing a mixture of several samples from given multivariate distributions follows the logic of finite mixture models from model-based clustering. Finite mixture models assume that the population is made up of several distinct clusters, each following a different multi- variateprobability densitydistribution (StahlandSallis, 2012). Therefore, theproblem ofgeneratingdatasetswithgivenfeaturescanactuallybereducedtogeneratingsamples fromgivenmultivariatedistributions. The algorithm for generating samples from given multivariate distributions is in- spiredbytheNORTA(NORmal-To-Anything)transformationincombinationwithCho- lesky’sdecomposition. TheNORTAalgorithm,inspiredbyCarioandNelson(1997),is used in order to generate samples from a given multivariate distribution. The Cholesky matrixtransformsavectorofuncorrelatednormally-distributedrandomvariablesintoa vectorofcorrelatednormally-distributedrandomvariables. 1. Generate a multivariate vector of uncorrelated normally-distributed random vari- ablesZ ind = (Z ind 1 ,Z ind 2 ,...,Z ind n ). 2. Suppose, that (ρ ij ) n i,j=1 is given correlation matrix. Positive definiteness of the matrixisverifiedand(ifnecessary)thematrixisadjustedtobetheclosestpositive definitecorrelationmatrix (ρ ∗ ij ) n i,j=1 . 36 Cibulkov´ aetal. 3. Get a multivariate standard-normal random vector Z = (Z 1 ,Z 2 ,...,Z n ), such that Corr(Z i ,Z j ) = ρ ∗ ij , for 1 ≤ i,j ≤ n using Cholesky’s decomposition fol- lowingthealgorithmdescribedin(Higham,2009). 4. ComputeU i = Φ(Z i )fori = 1,2,...,n, where Φ(·)isstandard-normalcumula- tivedistributionfunction. 5. Compute X i = F −1 i (U i ) for i = 1,2,...,n, where F −1 i is the inverse of given marginalcumulativedistributionfunctionsF i . Assumingeachclusterinthedatasetisgeneratedfromagivenmultivariatedistribu- tion,thegenerateddatasetisamixtureofseveralsamplesobtainedbythisapproach. Thisgeneratorallowsustogeneratenumerousdatasetswithdesiredfeaturestocover awiderangeofdatasetstypes,makingtheresultsoftheanalysismorerobust. 2.2 Hierarchicalclusteringmethodandevaluationcriteria Agglomerativehierarchicalclusteranalysisisusedinthispaper. Itsalgorithmconsiders each object to start in its own cluster, and at each step, the nearest two clusters are combinedintoahigher-levelcluster(SokalandMichener,1958). The average linkage method was chosen for the experiment since it can be seen as a compromise between the sensitivity to outliers of complete linkage method and the tendency to form long chains (that do not correspond to the intuitive notion of clus- ters as compact, spherical objects) of single linkage method. This method takes av- erage pairwise dissimilarity between objects in two different clusters. Let us denote D average (C p ,C q ) the distance between clusters C p and C q , with the number of objects n p inthep-thclusterandn q intheq-thcluster. Thendissimilaritybetweentwoclusters followstheformula: D average (C p ,C q ) = P x i ∈Cp P x j ∈Cq D(x i ,x j ) n p n q , (2.1) whereD(x i ,x j )isadistancebetweenobjectsx i andx j . Since the analyses are performed on the generated datasets, and thus, the objects’ cluster memberships are known, the produced clusters can be evaluated using both in- ternalandexternalevaluationcriteria. Thepurityisanexternalevaluationcriterionofclusterquality. Itisthepercentageof thetotalnumberofobjectsthatwereclassifiedcorrectly. Thelargerthepurity,thebetter the clustering performance. Let us suppose there are l classes (real clusters), while the clustering method generates k clusters, then the purity of the clustering solution with respecttotheknownclassesisgivenbytheformula: purity = 1 n k X q=1 max n p q  , (2.2) Associationsamongsimilarityanddistancemeasuresforbinarydata... 37 where n is the total number of objects; n p q is the number of objects in the q-th cluster, thatbelongstothep-thclass; (p = 1,2,...,l). Theentropyisananotherexternalevaluationcriterion. Itisameasureofuncertainty, andthesmallertheentropy,thebettertheclusteringperformance(Shannon,1948). The entropyoftheclusteringsolutionwithrespecttotheknownclassesfollowstheformula: entropy = 1 n log 2 l k X q=1 l X p=1 n p q log 2 n p q n q , (2.3) wheren q isthetotalnumberofobjectsintheq-thcluster (q = 1,2,...,k). The Dunn index is an internal evaluation criterion (Dunn, 1974). It is the ratio of the smallest distance between observations not in the same cluster to the largest intra- cluster distance. The Dunn index takes on values between zero and infinity and should be maximized. Let us denote a particular clustering partitionC = C 1 ,C 2 ,...,C k of n objectsintok disjointclusters. ThentheDunnindexiscomputedas: Dunnindex = min Cq,Cp∈C;Cq,6=Cp min x i ∈Cq;x j ∈Cp D(x i ,x j )  max Cq∈C diam(C q ) , (2.4) wherediam(C q )isthemaximumdistancebetweenobservationsinclusterC q . The silhouette coefficient is an internal evaluation criterion which is calculated as the average of each object’s silhouette width (which is usually displayed as the width in the silhouette graph). The silhouette width measures the degree of confidence in the clustering assignment of a particular object (Rousseeuw, 1987). The well-clustered objects do have values near 1 and poorly clustered objects have values near −1. For objectx i ,itisdefinedas: S(i) = b i −a i max(b i ,a i ) , (2.5) where a i is the average distance between object x i and all other objects in the same cluster, and b i is the average distance between x i and the objects in the nearest neigh- boring cluster following the formula b i = min C k ∈C;C k 6=C(i) P j∈C k D(x i ,x j ) n k , where C(i) is the cluster containing observationx i ; D(x i ,x j ) is the distance between observations x i andx j ;n k numberofobjectsinclusterC k . Theaveragesilhouettewidththusliesin theinterval [−1,1],andshouldbemaximized. 2.3 Similarityanddistancemeasuresforbinarydata Let us denote the data matrixX = [x ic ], where i = 1,2,...,n and c = 1,2,...,m; n is the total number of objects; m is the total number of variables. Suppose that two 38 Cibulkov´ aetal. objects,x i andx j are represented by the binary vector form. The symbols used for the numbers of variables with certain combinations of categories for objects, as shown in Table 1, are used for definitions of binary distance measures in this paper (Dunn and Everitt, 1982). In Table 1, a is the number of variables where the values ofx i andx j are both equal to 1, meaning “positive matches”, b is the number of variables where the value ofx i andx j is (0,1), meaning “x i absence mismatches”, c is the number of variableswherethevalueofx i andx j is (1,0),meaning“x j absencemismatches”,and d is the number of variables where bothx i andx j are equal to 0, meaning “negative matches”. Then,m = a+b+c+d. Table1: Symbolsusedforthenumbersofvariableswithcertaincombinationsofcategories forobjectsx i andx j x i \x j 1(Presence) 0(Absence) 1(Presence) a b 0(Absence) c d Tables 2–4 provide the overview of formulas of the 66 similarity measures or dis- tance measures for binary data. The main source for the formulas was Cho and Chai (2010). ThesecondcolumninTables2–4givesadefinitionofameasureusingsymbols fromTable1. Thethirdcolumndetermineswhetherameasureisdefinedasasimilarity measureoradistancemeasure. The transformation from a similarity measure S(x i ,x j ) into a distance measure D(x i ,x j ), that is necessary in order to be able to calculate a proximity matrix in clus- teringprocess,isinspiredbyDezaandDeza(2013). • Ifasimilaritymeasuregetsvaluesfromtheinterval [0,1],thenthecorresponding distancemeasureisobtainedbythetransformation 1−S(x i ,x j ). • Ifasimilaritymeasuregetsvaluesfromtheinterval [−1,1], thenthecorrespond- ingdistancemeasureisobtainedbythetransformation 1−S(x i ,x j ) 2 . • If a similarity measure gets values from a different interval, let’s say it is the interval [min,max], then thesimilarity measureistransformed bymin-max nor- malization in the first step, resulting that the values are from the interval [0,1]. Then the corresponding distance measure is obtained by substracting 1 from the min-maxnormalizedsimilaritymeasure. Associationsamongsimilarityanddistancemeasuresforbinarydata... 39 Table2: Similarity/distancemeasuresforbinarydataoverview–part1 Measure Formula Type JACCARD a a+b+c similarity DICE 2a 2a+b+c similarity CZEKANOWSKI 2a 2a+b+c similarity JACCARD 3W 3a 3a+b+c similarity NEI LI 2a (a+b)+(a+c) similarity SOKAL SNEATH I a a+2b+2c similarity SOKAL MICHENER a+d a+b+c+d similarity SOKAL SNEATH II 2a+2d 2a+b+c+2d similarity ROGER TANIMOTO a+d a+2(b+c)+d similarity FAITH a+0.5d a+b+c+d similarity GOWER LEGENDRE a+d a+0.5(b+c)+d similarity INTERSECTION a similarity INNERPRODUCT a+d similarity RUSSEL RAO a a+b+c+d similarity HAMMING b+c distance EUCLID √ b+c distance SQUARED EUCLID p (b+c) 2 distance CANBERRA (b+c) 2/2 distance MANHATTAN b+c distance MEAN MANHATTAN b+c a+b+c+d distance CITY BLOCK b+c distance 40 Cibulkov´ aetal. Table3: Similarity/distancemeasuresforbinarydataoverview–part2 Measure Formula Type MINKOWSKI (b+c) 1/1 distance VARI b+c 4(a+b+c+d) distance SIZE DIFFERENCE (b+c) 2 (a+b+c+d) 2 distance SHAPE DIFFERENCE m(b+c)−(b−c) 2 (a+b+c+d) 2 distance PATTERN DIFFERENCE 4bc (a+b+c+d) 2 distance LANCE WILLIAMS b+c 2a+b+c distance BRAY CURTIS b+c 2a+b+c distance HELLINGER 2 q 1− a √ (a+b)(a+c) distance CHORD s 2  1− a √ (a+b)(a+c)  distance COSINE a ( √ (a+b)(a+c)) 2 similarity GILBERT WELLS log(a)−log(m)−log a+b m  −log a+c m  similarity OCHIAI I a √ (a+b)(a+c) similarity FORBESI ma (a+b)(a+c) similarity FOSSUM m(a−0.5) 2 (a+b)(a+c) similarity SORGENFREI a 2 (a+b)(a+c) similarity MOUNTFORD a 0.5(ab+ac)+bc similarity OTSUKA a ((a+b)(a+c)) 0.5 similarity MCCONNAUGHEY a 2 −bc (a+b)(a+c) similarity TARWID ma−(a+b)(a+c) ma+(a+b)(a+c) similarity KULCZYNSKI II a 2 (2a+b+c) (a+b)(a+c) similarity DRIVER KROEBER a 2 1 a+b + 1 a+c  similarity JOHNSON a a+b + a a+c similarity DENNIS ad−bc √ m(a+b)(a+c) similarity SIMPSON a min(a+b,a+c) similarity BRAUN BANQUET a max(a+b,a+c) similarity FAGER MCGOWAN a √ (a+b)(a+c) − max(a+b,a+c) 2 similarity FORBES II ma−(a+b)(a+c) mmin(a+b,a+c)−(a+b)(a+c) similarity Associationsamongsimilarityanddistancemeasuresforbinarydata... 41 Table4: Similarity/distancemeasuresforbinarydataoverview–part3 Measure Formula Type SOKAL SNEATH IV 1 4 a a+b + a a+c + d b+d + d c+d  similarity GOWER a+d √ (a+b)(a+c)(b+d)(c+d) similarity PEARSON I χ 2 ;whereχ 2 = m(ad−bc) 2 (a+b)(a+c)(b+d)(c+d) similarity PEARSON II  χ 2 m+χ 2 1 2 similarity PEARSON III  ρ m+ρ 1 2 ;whereρ = ad−bc √ (a+b)(a+c)(b+d)(c+d) similarity PEARSON HERON I ρ = ad−bc √ (a+b)(a+c)(b+d)(c+d) similarity PEARSON HERON II cos  π √ bc √ ad+ √ bc  similarity SOKAL SNEATH III a+d b+c similarity SOKAL SNEATH V ad √ (a+b)(a+c)(b+d)(c+d) similarity COLE √ 2(ad−bc) √ (ad−bc) 2 −(a+b)(a+c)(b+d)(c+d) similarity STILES log 1 0 m(|ad−bc|− n 2 ( 2 (a+b)(a+c)(b+d)(c+d) similarity OCHIAI II ad √ (a+b)(a+c)(b+d)(c+d) similarity YULE Q 2bc ad+bc distance YULE W √ ad− √ bc √ ad− √ bc similarity KULCZYNSKI I a b+c similarity TANIMOTO a (a+b)+(a+c)−a similarity HAMANN (a+d)−(b+c) a+b+c+d similarity MICHAEL 4(ad−bc) (a+d) 2 +(b+c) 2 similarity 42 Cibulkov´ aetal. 3 Associationsinformulas It is possible to detect several groups of measures defined in Section 2.3 where func- tional relationships among measures exist. The formulas were studied and compared in order to unhide any functional relationships among them. Various associations were discovered when studying the formulas. These associations may provide the first cate- gorization of similarity and distance measures for binary data. We may consider four groupsofmeasures: 1. Euclid-basedmeasures, 2. Pearson-basedmeasures, 3. Hellinger-basedmeasures, 4. othermeasures. 3.1 Euclid-basedmeasures NumerousmeasuresforbinarydataprovedtobedependentontheEucliddistance. Let usdenotethesquaredEucliddistancemeasureas eu = (b+c), (3.1) where b and c are symbols used for a number of variables with certain combinations of categories for objectsx i andx j described in Table 1. Table 5 shows, how are some of measures,presentedinSection2.3,dependentoneu. The first column contains a measure’s name, a measure is expressed in relation to the squared Euclid distance measure in the second column. If there is any restriction for the relationship between a measure from the first column and the squared Euclid distancemeasure,thisrestrictionwouldbenotedinthethirdcolumn. 3.2 Pearson-basedmeasures OthermeasuresforbinarydataprovedtobedependentonthePearsoncorrelationcoef- ficient. Let’sdenote ρ = ad−bc p (a+b)(a+c)(b+d)(c+d) , (3.2) where a,b,c,d are symbols used for number of variables with certain combinations of categories for objectsx i andx j described in Table 1. The measures that are based on the Pearson correlation coefficient can be re-written using ρ in their formulas as shown inTable 7. Measures’names areinthe firstcolumn, measuresare expressedin relation toρinthesecondcolumn. Associationsamongsimilarityanddistancemeasuresforbinarydata... 43 Table5: AssociationamongthemeasuresbasedonEucliddistance Measure Formula Condition SQUARED EUCLID eu - EUCLID √ eu - HAMMING eu - CANBERRA eu - MANHATTAN eu - MEAN MANHATTAN eu m - CITY BLOCK eu - MINKOWSKI eu - VARI eu 4m - SOKAL MICHENER 1− eu m - INNERPRODUCT m−eu - SIZE DIFFERENCE eu 2 m 2 - HAMMANN m−2eu m - GOWER LEGENDRE 2(m−eu) 2m−eu - SOKAL SNEATH II 2(m−eu) 2m−eu - SOKAL SNEATH III n eu −1 - ROGER TANIMOTO m−eu m+eu - FAITH m−eu−0.5d m - SHAPE DIFFERENCE eu m ifb = c 44 Cibulkov´ aetal. Table6: AssociationamongthemeasuresbasedonPearsondistance Measure Formula PEARSON HERON I ρ PEARSON I nρ 2 PEARSON II  (nρ 2 ) n+(nρ 2 ) 1 2 PEARSON III  ρ n+ρ 1 2 3.3 Hellinger-basedmeasures Some distance and similarity measures for binary data are dependent on the Hellinger distance. FormulasofthesemeasuresdemonstratethedependencyinTable7. Table7: AssociationamongthemeasuresbasedonHellingerdistance Measure Formula HELLINGER hel CHORD hel √ 2 OCHIAI I 1− hel 2  2 OTSUKA 1− hel 2  2 FAGER MCGOWAN 1− hel 2  2 − max(a+b,a+c) 2 SORGENFREI h 1− hel 2  2 i FORBESI m a h 1− hel 2  2 i Measures’ names are in the first column, measures are expressed in relation to the Hellingerdistanceinthesecondcolumn. Letusdenote hel = 2 s 1− a p (a+b)(a+c) , (3.3) wherea,b,canddaresymbolsusedfornumberofvariableswithcertaincombinations ofcategoriesforobjectsx i andx j describedinTable1. Associationsamongsimilarityanddistancemeasuresforbinarydata... 45 3.4 Othermeasuresandassociationsamongthem SeveralgroupsuptofourmeasuresofsimilarityordistancemeasuresfromSection2.3 seemtohavesomekindofdependencyamongthemselves. TheJACCARDsimilaritymeasureisidenticalwithsimilaritymeasureTANIMOTO. JACCARD = a a+b+c = = TANIMOTO = a (a+b)+(a+c)−a (3.4) SimilaritymeasuresDICE,CZEKANOWSKI,NEI LIareidentical. DICE = 2a 2a+b+c = = CZEKANOWSKI = 2a 2a+b+c = = NEI LI = 2a (a+b)+(a+c) (3.5) AnotherpairofsimilaritymeasuresthataredefinedbythesameformulaisLANCE WILLIAMS andBRAY CURTIS. LANCE WILLIAMS = b+c 2a+b+c = = BRAY CURTIS = b+c 2a+b+c (3.6) Moreover, if 2a = b + c, then all five measures DICE, CZEKANOWSKI, NEI LI, LANCE WILLIAMSandBRAY CURTISwouldbeequaltoeachother. ThereisalineardependencybetweensimilaritymeasuresINTERSECTIONand RUSSEL RAO. INTERSECTION = a = = m(RUSSEL RAO) = m  a a+b+c+d  (3.7) Similarity measures KULCZYNSKI II and DRIVER KROEBER are identical and theyarelinearlydependentwithJOHNSONsimilaritymeasure. KULCZYNSKI II = a 2 (2a+b+c) (a+b)(a+c) = = DRIVER KROEBER = a 2  1 a+b + 1 a+c  = = 0.5(JOHNSON) = 0.5  a a+b + a a+c  (3.8) 46 Cibulkov´ aetal. SimilaritymeasuresSOKAL SNEATH Vand OCHIAI IIareidentical. SOKAL SNEATH V = ad p (a+b)(a+c)(b+d)(c+d) = = OCHIAI II = ad p (a+b)(a+c)(b+d)(c+d) (3.9) If d = 0, then formulas of measures KULCZYNSKI II and SOKAL SNEATH IV areidentical. KULCZYNSKI II = a 2 (2a+b+c) (a+b)(a+c) = = SOKAL SNEATH IV = 1 4  a a+b + a a+c + d b+d + d c+d  (3.10) Ifd = 0,thensimilaritymeasuresKULCZYNSKI IandSOKAL SNEATH III are identical. SOKAL SNEATH III = a+d b+c = = KULCZYNSKI I = a b+c (3.11) 4 Associationsinclusteringsolutions We have shown that some measures are functionally dependent, linearly dependent, or even identical in Section 3. Therefore, we may expect that the process of assigning objects into clusters would be somehow associated. If there is a linear dependence between measures, we may even assume that the clustering process will be identical. Thus, it would lead to the same clustering solutions and that the dendrograms will be similar. Figure 1 demonstrates nearly identical outcome of hierarchical clustering pro- cess(average-linkage)onasampledatasetwhenfourdifferent(butlinearlydependent) similarity measures were used. The goal of this section is to express such a similarity ofdendrogramsforvariousmeasuresinnumericalterms. 4.1 Experimentdesign The datasets were generated using the data generator described in Section 2.1. Desired features of the generated datasets were chosen with an aim to cover a wide range of possible situations that can occur. Since the paper focuses on measures suitable for binary data, all data are generated from the Bernoulli distribution, but they differ in a number of objects, a number of variables, and the strength of a relationship among Associationsamongsimilarityanddistancemeasuresforbinarydata... 47 0.0 0.5 1.0 1.5 2.0 distance: EUCLID 3 9 8 1 7 5 10 6 2 4 0 1 2 3 4 distance: MANHATTAN 3 9 8 1 7 5 10 6 2 4 0.00 0.10 0.20 distance: VARI 3 9 8 1 7 5 10 6 2 4 0.0 0.2 0.4 0.6 0.8 distance: SOKAL MICHENER 3 9 8 1 7 5 10 6 2 4 Figure 1: Example of 4 different measures used in hierarchical clustering of a sample datasetleadingtothesameresult variables (expressed by correlation matrix). The following features were used in the data-generatingprocess: 1. probabilitydistribution: Bernoulli, 2. numberofvariables: 5,10,15,20, 3. numberofclusters: 2,3,4,5, 4. numberofobservationsinacluster: 30–60,60–120,30-120, 5. correlationamongvariables:∈ [−1;1];atrandomforeachcluster. This leads to 48 possible combinations of dataset features (numbers of variables, num- bers of clusters, and numbers of observations). Each combination of features is gen- erated ten times to make the outcomes of the experiment more robust, and the final number of generated datasets is hence equal to 480. The information about “real clus- ter”membershipisstoredforeachobservationofeachdataset,anditislaterusedinthe processofevaluationofclusteringsolutions. The average linkage method combined with every binary similarity distance from Section 2.3 is applied to each dataset. Clustering solutions with three clusters are used toexamineassociationsamongsimilarityanddistancemeasuresinclusteringsolutions viainternalevaluationcriteria(averagesilhouettewidth,Dunnindex),inordertomimic a real-life situation, where the number of real clusters is usually unknown. The exam- ination of associations among measures in clustering solutions via external evaluation 48 Cibulkov´ aetal. criteria (purity, entropy) uses information about “real cluster” membership. These val- ues of four evaluation criteria from Section 2.2 are calculated for each clustering solu- tion, and their averages and standard deviations across all datasets are compared with respecttosimilarity/distancemeasureused. Due to the complexity of computations, the evaluation of the assignment of objects into clusters is performed on two-to-four-clusters clustering solutions. For each num- ber of clusters in the clustering solution (the last three steps of the clustering process) and each pair of measures from Section 2.3, the percentage of identical objects-into- clusters assignments was calculated. The percentage gives a value between 0 and 1, where 1 means the two clustering outcomes match identically. The percentage of iden- tical objects-into-clusters assignments is also known as the Rand index (Rand, 1971). The pairs of last three steps of the clustering process for every two similarity measures are then compared, and measures that lead to identical/similar clustering process are identified. All the calculations and visualizations were done in R programming language (R CoreTeam,2018). 4.2 Resultsoftheexperiment There are 66 similarity and distance measures for binary data examined in the paper and used in the clustering process. The ability of the measures to produce a quality clusteringsolutionismeasuredbyfourchosenevaluationcriteria: purity,entropy,Dunn index,andsilhouettecoefficient. Average values of evaluation criteria (calculated as described in Section 4.1) are shown in Figure 2. In general, Euclid-based measures (red color in all graphs) pro- duce clusters with the highest quality according to the evaluation criteria. The quality of Pearson-based measures (purple color in all graphs) is, on average, not very good, butitissteadyacrossallfourmeasures. ThesameappliestoHellinger-basedmeasures (blackcolorinallgraphs),andallpairs/groupsofmeasures,whereanyfunctionalrela- tionshipwasdetectedinSection3. Alhoughsomemeasuresleadtoclusteringsolutions with low average quality, keep in mind that these measures might perform very well when clustering datasets with specific features. Therefore, one should not jump into anyconclusionsandgeneralrecommendationsforuniversaluse. Even though the evaluation criteria’ average values might not be beneficial for for- mulating general recommendations, they help reveal associations among the measures. The relationships among binary similarity/distance measures are quite evident based on the average values of evaluation criteria and their standard deviations. Figure 3 showsthearrangementoftheclustersproducedbyaverage-linkagehierarchicalcluster- ing with Euclid distance in the dendrogram. From this graph, it is possible to see that some measures lead to more similar clustering solutions than others. All Euclid-based measuresleadtoclusteringsolutionwithsimilarvaluesofevaluationcriteria,exceptfor Associationsamongsimilarityanddistancemeasuresforbinarydata... 49 INNERPRODUCT_ SOKAL_MICHENER_ EUCLID_ SQUARED_EUCLID_ CANBERRA_ HAMMING_ HAMANN_ CITY_BLOCKS_ MINKOWSKI_ MANHATTAN_ MEAN_MANHATTAN_ VARI_ SIZE_DIFFERENCE_ GOWER_LEGENDRE_ SOKAL_SNEATH_II_ SOKAL_SNEATH_III_ ROGER_TANIMOTO_ FAITH_ SHAPE_DIFFERENCE_ SOKAL_SNEATH_V_ STILES_ LANCE_WILLIAMS_ BRAY_CURTIS_ MOUNTFORD_ MCCONNAUGHEY_ TARWID_ COSINE_ JACCARD_ DICE_ CZEKANOWSKI_ JACCARD_3W_ NEI_LI_ SOKAL_SNEATH_I_ INTERSECTION_ RUSSEL_RAO_ HELLINGER_ CHORD_ OCHIAI_I_ OCHIAI_II_ FORBESI_ FOSSUM_ SORGENFREI_ OTSUKA_ KULCZYNSKI_II_ DRIVER_KROEBER_ JOHNSON_ SIMPSON_ BRAUN_BANQUET_ FAGER_MCGOWAN_ SOKAL_SNEATH_IV_ GOWER_ PEARSON_I_ PEARSON_II_ PEARSON_III_ PEARSON_HERON_II_ KULCZYNSKI_I_ TANIMOTO_ COLE_ PATTERN_DIFFERENCE_ GILBERT_WELLS_ YULE_Q_ DENNIS_ FORBES_II_ PEARSON_HERON_I_ YULE_W_ MICHAEL_ INNERPRODUCT_ SOKAL_MICHENER_ EUCLID_ SQUARED_EUCLID_ CANBERRA_ HAMMING_ HAMANN_ CITY_BLOCKS_ MINKOWSKI_ MANHATTAN_ MEAN_MANHATTAN_ VARI_ SIZE_DIFFERENCE_ GOWER_LEGENDRE_ SOKAL_SNEATH_II_ SOKAL_SNEATH_III_ ROGER_TANIMOTO_ FAITH_ SHAPE_DIFFERENCE_ PEARSON_I_ PEARSON_II_ PEARSON_III_ PEARSON_HERON_II_ HELLINGER_ CHORD_ OCHIAI_I_ FORBESI_ SORGENFREI_ OTSUKA_ FAGER_MCGOWAN_ −0.25 0.25 0.75 1.25 1.75 2.25 2.75 3.25 Silhouette Dunn index Purity Entropy Figure 2: Average value of evaluation criteria representing clusters quality of examined similarityordistancemeasuresforbinarydata 50 Cibulkov´ aetal. SHAPE DIFFERENCE.Thatcanbeexplainedbythefact,thatthereexistsafunctional relationship among all these measures, but only SHAPE DIFFERENCE has this rela- tionshipconditiondbyarestriction(seeSection3.1). Allidenticalorlinearlydependent similarity and distance measures, see Section 3, lead to the identical cluster solutions as well. Clearly, the Euclid-based measures lead to very similar clustering solutions in terms of cluster quality. Due to non-linear relationships among Pearson-based mea- sures, all the Pearson-based measures are much less similar in clustering solution qual- ity. Negative-matches exclusive measures (such as JACCARD, HELLINGER, DICE, ...) also produce clusters of similar quality. Measures for which we have not found any association to other measures generally provide qualitatively different clustering solutions. Figure 4 shows the average percentages of identical objects-into-clusters assign- ments for every pair of similarity or distance measures for binary data. The figure reflects the average similarity of dendrograms in the final three steps of the clustering process (2–4 clusters) for any two measures. The associations among all measures in theclusteringprocessareindisputable. Almostallmeasuresleadtomoreorlesssimilar clusteringsolutions(atleastinthelastthreestepsoftheclusteringprocess). Especially dendrograms for identical or linearly dependent measures are essentially all the same. Atthesametime,itcanbeseenthatmeasureswithoutanyapparentdependencyonany othermeasure(suchasDENNIS,MICHAEL,orMOUNTFORD)produceleastsimilar dendrograms. Interestingly, the absence of negative matches in a measure’s formula and the presence of only positive matches in a measure’s formula seems to be a com- mon feature for measures that produced clustering solutions of similar quality and al- most identical dendrograms. These negative-matches exclusive measures produce very similardendrogramsregardlessofwhethertheyarePearson-basedorHellinger-based. 5 Discussionandconclusion Many similarity measures and distance measures for binary data have been used in variousfieldsofstudy. Thisledtoasituationwheremeasuresareduplicatedorstrongly dependent on each other. Despite the fact that these measures are widely known and often used, only a limited number of authors aimed their research in this area. Even thoughtheirstudiesusuallyfocusedonalimitednumberofsimilarity/distancemeasures for binary data or they were applied on only one specific dataset, they concluded that binarydatameasuresareoftenlinearlydependent,andthus,theyoftenproducethesame clusters. Not only were we able to support this claim on hundreds of generated datasets, but wealsoexplainedthereasonforthisbehavior(thatisrootedinthemeasures’definition) while providing a comprehensive study of 66 selected measures for binary data. Mea- sureswereexaminedbasedonthequalityofclusteringsolutionstheyproduceandbased Associationsamongsimilarityanddistancemeasuresforbinarydata... 51 INNERPRODUCT SOKAL MICHENER EUCLID SQUARED EUCLID CANBERRA HAMMING HAMANN CITY BLOCKS MINKOWSKI MANHATTAN MEAN MANHATTAN VARI SIZE DIFFERENCE GOWER LEGENDRE SOKAL SNEATH II SOKAL SNEATH III ROGER TANIMOTO FAITH SHAPE DIFFERENCE SOKAL SNEATH V STILES LANCE WILLIAMS BRAY CURTIS MOUNTFORD MCCONNAUGHEY TARWID COSINE JACCARD DICE CZEKANOWSKI JACCARD 3W NEI LI SOKAL SNEATH I INTERSECTION RUSSEL RAO HELLINGER CHORD OCHIAI I OCHIAI II FORBESI FOSSUM SORGENFREI OTSUKA KULCZYNSKI II DRIVER KROEBER JOHNSON SIMPSON BRAUN BANQUET FAGER MCGOWAN SOKAL SNEATH IV GOWER PEARSON I PEARSON II PEARSON III PEARSON HERON II KULCZYNSKI I TANIMOTO COLE PATTERN DIFFERENCE GILBERT WELLS YULE Q DENNIS FORBES II PEARSON HERON I YULE W MICHAEL Figure3: Dendrogramrepresentingclustersqualityofexaminedsimilarityordistancemea- suresforbinarydata 52 Cibulkov´ aetal. 0.44 0.58 0.72 0.86 1 INNERPRODUCT_ SOKAL_MICHENER_ EUCLID_ SQUARED_EUCLID_ CANBERRA_ HAMMING_ HAMANN_ CITY_BLOCKS_ MINKOWSKI_ MANHATTAN_ MEAN_MANHATTAN_ VARI_ SIZE_DIFFERENCE_ GOWER_LEGENDRE_ SOKAL_SNEATH_II_ SOKAL_SNEATH_III_ ROGER_TANIMOTO_ FAITH_ SHAPE_DIFFERENCE_ SOKAL_SNEATH_V_ STILES_ LANCE_WILLIAMS_ BRAY_CURTIS_ MOUNTFORD_ MCCONNAUGHEY_ TARWID_ COSINE_ JACCARD_ DICE_ CZEKANOWSKI_ JACCARD_3W_ NEI_LI_ SOKAL_SNEATH_I_ INTERSECTION_ RUSSEL_RAO_ HELLINGER_ CHORD_ OCHIAI_I_ OCHIAI_II_ FORBESI_ FOSSUM_ SORGENFREI_ OTSUKA_ KULCZYNSKI_II_ DRIVER_KROEBER_ JOHNSON_ SIMPSON_ BRAUN_BANQUET_ FAGER_MCGOWAN_ SOKAL_SNEATH_IV_ GOWER_ PEARSON_I_ PEARSON_II_ PEARSON_III_ PEARSON_HERON_II_ KULCZYNSKI_I_ TANIMOTO_ COLE_ PATTERN_DIFFERENCE_ GILBERT_WELLS_ YULE_Q_ DENNIS_ FORBES_II_ PEARSON_HERON_I_ YULE_W_ MICHAEL_ INNERPRODUCT_ SOKAL_MICHENER_ EUCLID_ SQUARED_EUCLID_ CANBERRA_ HAMMING_ HAMANN_ CITY_BLOCKS_ MINKOWSKI_ MANHATTAN_ MEAN_MANHATTAN_ VARI_ SIZE_DIFFERENCE_ GOWER_LEGENDRE_ SOKAL_SNEATH_II_ SOKAL_SNEATH_III_ ROGER_TANIMOTO_ FAITH_ SHAPE_DIFFERENCE_ SOKAL_SNEATH_V_ STILES_ LANCE_WILLIAMS_ BRAY_CURTIS_ MOUNTFORD_ MCCONNAUGHEY_ TARWID_ COSINE_ JACCARD_ DICE_ CZEKANOWSKI_ JACCARD_3W_ NEI_LI_ SOKAL_SNEATH_I_ INTERSECTION_ RUSSEL_RAO_ HELLINGER_ CHORD_ OCHIAI_I_ OCHIAI_II_ FORBESI_ FOSSUM_ SORGENFREI_ OTSUKA_ KULCZYNSKI_II_ DRIVER_KROEBER_ JOHNSON_ SIMPSON_ BRAUN_BANQUET_ FAGER_MCGOWAN_ SOKAL_SNEATH_IV_ GOWER_ PEARSON_I_ PEARSON_II_ PEARSON_III_ PEARSON_HERON_II_ KULCZYNSKI_I_ TANIMOTO_ COLE_ PATTERN_DIFFERENCE_ GILBERT_WELLS_ YULE_Q_ DENNIS_ FORBES_II_ PEARSON_HERON_I_ YULE_W_ MICHAEL_ Figure4: Theaveragepercentageofidenticalobjects-into-clustersassignations Associationsamongsimilarityanddistancemeasuresforbinarydata... 53 onthepercentageofidenticalobjects-into-clustersassignationsinthelastthreestepsof the hierarchical clustering process. Based on this we might claim that Euclid-based measures lead to identical (or very similar) clustering solutions. Another big group of measures that lead to very similar clustering results are negative-matches exclusive measures. Acknowledgment This work was supported by the University of Economics, Prague under Grant IGA F4/44/2018. References [1] Cairo,M.andNelson,B.(1997): ModelingandGeneratingRandomVectorswith Arbitrary Marginal Distributions and Correlation Matrix. Technical Report. De- partmentofIndustrialEngineeringandManagementSciences,NorthwesternUni- versity,Evanston,IL. [2] Charu,C.A.andChandan,K.R.(2013): Dataclustering: Algorithmsandapplica- tions.BocaRaton,FL:Chapman&Hall. [3] Chen, H. (2001): Initialization for NORTA: Generation of random vectors with specified marginals and correlations. INFORMS Journal on Computing, 13(4), 312–331. [4] Choi, S.S., Cha, S.H. and Tappert, C.C. (2010): A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43–48. [5] Cibulkov´ a,J.and ˇ Rezankov´ a,H.(2018): CategoricalDataGenerator.InT.Loster, T. Pavelka (eds.): International Days of Statistics and Economics 2018. Slan´ y: Melandrium,288–296. [6] Deza, M.M. and Deza, E. (2013): Encyclopedia of Distances. New York, NY: Springer. [7] Dunn, G. and Everitt, B.S. (1982): An Introduction to Mathematical Taxonomy. Cambridge: CambridgeUniversityPress. [8] Forbes, S. A. (1925): Method of determining and measuring the associative rela- tionsofspecies.Science,61,5–24. 54 Cibulkov´ aetal. [9] Higham, N.J. (2009): Cholesky factorization. Wiley Interdisciplinary Reviews: ComputationalStatistics,1(2),251–254. [10] Hub´ alek, Z. (1982): Coefficients of association and similarity, based on binary (presence-absence)data: Anevaluation.BiologicalReviews,57(4),669–689. [11] Rand, W.M. (1971): Objective criteria for the evaluation of clustering methods. JournaloftheAmericanStatisticalAssociation,66,846–850. [12] Rousseeuw P.J. (1987): Silhouettes: A graphical aid to the interpretation and val- idation of cluster analysis. Journal of Computational and Applied Mathematics, 20,53–65. [13] Shannon, C.E. (1948): A mathematical theory of communication. Bell System TechnicalJournal,27,379–423. [14] Jaccard, P. (1901): Etude comparative de la distribuition florale dans une portion des Alpes et des Jura. Bulletin de la Soci´ et´ e vaudoise des sciences naturelles, 37, 547–579. [15] Jackson, D.A., Somers, K.M. and Harvey, H.H. (1989): Similarity coefficients: Measures of co-occurrence and association or simply measures of occurrence?. TheAmericanNaturalist,133(3),436–453. [16] R Core Team (2018): R: A language and environment for statistical comput- ing. R Foundation for Statistical Computing, Vienna, Austria. https://www. R-project.org/. [17] SokalR.and Michener, C., (1958): A statisticalmethod forevaluating systematic relationships.UniversityofKansasScienceBulletin,38,1409–1438. [18] Stahl, D. and Sallis, H. (2012): Model-based cluster analysis. Wiley Interdisci- plinaryReviews: ComputationalStatistics,4(4),341–358. [19] Todeschini, R., Consonni, V., Xiang, H., Holliday, J., Buscema, M. and Willet, P. (2012): Similarity coefficients for binary chemoinformatics Data: Overview and extended comparison using simulated and real data sets. Journal of Chemical InformationandModeling,52(11),2884–2901.