Metodoloˇ skizvezki,Vol. 17,No. 1,2020,33–54
Associationsamongsimilarityanddistance
measuresforbinarydatainclusteranalysis
JanaCibulkov´ a
1
Zdenek
ˇ
Sulc
2
Hana
ˇ
Rezankov´ a
3
SergejSirota
4
Abstract
Thepaperfocusesonsimilarityanddistancemeasuresforbinarydataandtheir
application in cluster analysis. There are 66 measures for binary data analyzed in
thepaperinordertoprovideacomprehensiveinsightintotheproblematicsandto
createtheirwell-arrangedoverview. Forthispurpose,formulasbywhichtheywere
deﬁnedarestudied. Inthenextpartoftheresearch,theresultsofobjectclustering
on generated datasets are compared, and the ability of measures to create similar
oridenticalclusteringsolutionsisevaluated. Thisisdonebyusingchoseninternal
and external evaluation criteria, and comparing the assignments of objects into
clustersintheprocessofhierarchicalclustering. Thepapershowswhichsimilarity
measures and distance measures for binary data lead to similar or even identical
resultsinhierarchicalclusteranalysis.
1 Introduction
A binary vector is one of the most common representations of patterns in datasets.
Therefore many similarity and distance measures for binary data have been proposed
over the years. They are usually well examined; they are often implemented in both –
commercial(e.g.,SPSS,MatLab)andnon-commercial(e.g.,R,Python)software. This
paper focuses on associations among selected similarity measures and distance mea-
sures for binary data in the ﬁeld of cluster analysis. Similarity measures and distance
1
Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Eco-
nomics,Prague,CzechRepublic;jana.cibulkova@vse.cz
2
Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Eco-
nomics,Prague,CzechRepublic;zdenek.sulc@vse.cz
3
Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Eco-
nomics,Prague,CzechRepublic;hana.rezankova@vse.cz
4
Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Eco-
nomics,Prague,CzechRepublic;xsirs00@vse.cz
34 Cibulkov´ aetal.
measuresforbinarydataareroutinelyusednotonlyintheclusteranalysisofbinarydata
but in clustering nominal data as well. In fact, it is considered as standard procedure to
performabinarytransformationwhenclusteringnominaldata.
The demand for processing binary (or nominal) data caused that numerous similar-
ity measures and distance measures for binary data have been proposed over the years
in various ﬁelds for various purposes. For example, the Jaccard similarity measurewas
designed for clustering ﬂowers (Jaccard, 1901), while Forbes (1925) proposed a co-
efﬁcient for clustering ecologically related species of ﬁshes. However, an enormous
quantity of available measures for binary data might be counterproductive, and it can
cause confussion. Duplicities can easily occur, there might be a functional relationship
betweenmanymeasuresforbinarydata,acertainmeasurecanbereferredtobyseveral
differentnamesintheliterature,etc.
Ontheonehand,therearemanyauthorswhodeﬁnedtheirownmeasuresforbinary
data,oftenbyadjustingexistingonesfortheneedofaspeciﬁcstudyornotknowingthat
themeasurealreadyhadbeendeﬁnedbefore(maybeinadifferentﬁeldofstudy). Onthe
otherhand,onlyahandfulofauthorsstudiedandcomparedthesemeasuresamongeach
other. Jacksonetal. (1989)comparedeightbinarysimilaritymeasuresbeforechoosing
the best measure for his ecological study. Hub´ alek (1982) collected and studied 43
measures. Twenty of them were used for cluster analysis on real data to produce ﬁve
clustersofrelatedmeasures. Hisstudyprovedthatmanymeasuresmightleadtosimilar
clustering solutions. Todeschini (2012) studied similarity measures for binary data on
chemoinformatics dataset. Based on his study, he claims that binary data measures are
oftenlinearlydependent,andthus,themajorityofthemproducesthesameclusters. The
best summary of similarity measures and distance measures for binary data so far was
createdbyChoandChai(2010). Theycollected76measuresusedoverthelastcentury
and revealed their relationships through a simple hierarchical clustering of values that
were calculated by applying given measures on the same dataset. All of these authors
provide some kind of survey of similarity measures; however, the studies contained
a limited number of similarity measures or they were applied on one speciﬁc dataset.
Furthermore, no one (except Hub´ alek, 1982) studied the measures formulas that create
theseassociationsintheﬁrstplace.
This paper uses the studies of the previously mentioned authors as a baseline for
furtherresearchthattriestoberobustenoughandbasedonmathematicsanddeﬁnitions
that form these associations among measures for binary data. The paper aims to deter-
minewhichmeasuresforbinarydataleadtodifferentclusteringresultsandwhichones
lead to similar or even identical results of cluster analysis. Firstly, the theory of gener-
atingdatasets,clusteringmethodandevaluationcriteriaarebrieﬂypresentedinSection
2. Also similarity and distance measures for binary data are introduced in this section.
Section 3 is focused on revealing associations among the measures in their formulas.
We expect the formulas that are functions of the same statistics to lead to the same or
Associationsamongsimilarityanddistancemeasuresforbinarydata... 35
similarresults. Thisexpectationisveriﬁedandanalyzedlaterinthepaperonclustering
solutions of generated datasets. Measures are used in hierarchical clustering of gener-
ated datasets in Section 4. The results of object clustering on generated datasets are
compared, the quality of clustering solutions is evaluated, and the ability of measures
to create similar or identical clusters is examined. This approach makes our research
differentfrompreviousresearchmentionedabove.
2 Theoreticalbackground
Theoretical background for the experiment, such as data generating process, similarity
and distance measures for binary data, a chosen clustering method, and evaluation cri-
teria are brieﬂy introduced in this section. The vocabulary presented in the section is
usedintherestofthepaper.
2.1 Datagenerator
Numerousdatasetswitha givenspeciﬁcsetoffeaturesare neededinordertomakethe
resultsoftheanalysisreliable. Forthispurpose,thedatageneratorsuitablefortheneeds
of the experiment is used (Cibulkov´ a and
ˇ
Rezankov´ a, 2018). The data generator was
designedmainlyforgeneratingmultivariatedatasetsappropriateforclusteranalysis.
Agenerateddatasetsconsistsofagivennumberofclusters,whereeachclustercor-
respondstoonesampleofagivenmultivariatedistribution. Theideaofadatasetbeing
a mixture of several samples from given multivariate distributions follows the logic of
ﬁnite mixture models from model-based clustering. Finite mixture models assume that
the population is made up of several distinct clusters, each following a different multi-
variateprobability densitydistribution (StahlandSallis, 2012). Therefore, theproblem
ofgeneratingdatasetswithgivenfeaturescanactuallybereducedtogeneratingsamples
fromgivenmultivariatedistributions.
The algorithm for generating samples from given multivariate distributions is in-
spiredbytheNORTA(NORmal-To-Anything)transformationincombinationwithCho-
lesky’sdecomposition. TheNORTAalgorithm,inspiredbyCarioandNelson(1997),is
used in order to generate samples from a given multivariate distribution. The Cholesky
matrixtransformsavectorofuncorrelatednormally-distributedrandomvariablesintoa
vectorofcorrelatednormally-distributedrandomvariables.
1. Generate a multivariate vector of uncorrelated normally-distributed random vari-
ablesZ
ind
= (Z
ind
1
,Z
ind
2
,...,Z
ind
n
).
2. Suppose, that (ρ
ij
)
n
i,j=1
is given correlation matrix. Positive deﬁniteness of the
matrixisveriﬁedand(ifnecessary)thematrixisadjustedtobetheclosestpositive
deﬁnitecorrelationmatrix (ρ
∗
ij
)
n
i,j=1
.
36 Cibulkov´ aetal.
3. Get a multivariate standard-normal random vector Z = (Z
1
,Z
2
,...,Z
n
), such
that Corr(Z
i
,Z
j
) = ρ
∗
ij
, for 1 ≤ i,j ≤ n using Cholesky’s decomposition fol-
lowingthealgorithmdescribedin(Higham,2009).
4. ComputeU
i
= Φ(Z
i
)fori = 1,2,...,n, where Φ(·)isstandard-normalcumula-
tivedistributionfunction.
5. Compute X
i
= F
−1
i
(U
i
) for i = 1,2,...,n, where F
−1
i
is the inverse of given
marginalcumulativedistributionfunctionsF
i
.
Assumingeachclusterinthedatasetisgeneratedfromagivenmultivariatedistribu-
tion,thegenerateddatasetisamixtureofseveralsamplesobtainedbythisapproach.
Thisgeneratorallowsustogeneratenumerousdatasetswithdesiredfeaturestocover
awiderangeofdatasetstypes,makingtheresultsoftheanalysismorerobust.
2.2 Hierarchicalclusteringmethodandevaluationcriteria
Agglomerativehierarchicalclusteranalysisisusedinthispaper. Itsalgorithmconsiders
each object to start in its own cluster, and at each step, the nearest two clusters are
combinedintoahigher-levelcluster(SokalandMichener,1958).
The average linkage method was chosen for the experiment since it can be seen as
a compromise between the sensitivity to outliers of complete linkage method and the
tendency to form long chains (that do not correspond to the intuitive notion of clus-
ters as compact, spherical objects) of single linkage method. This method takes av-
erage pairwise dissimilarity between objects in two different clusters. Let us denote
D
average
(C
p
,C
q
) the distance between clusters C
p
and C
q
, with the number of objects
n
p
inthep-thclusterandn
q
intheq-thcluster. Thendissimilaritybetweentwoclusters
followstheformula:
D
average
(C
p
,C
q
) =
P
x
i
∈Cp
P
x
j
∈Cq
D(x
i
,x
j
)
n
p
n
q
, (2.1)
whereD(x
i
,x
j
)isadistancebetweenobjectsx
i
andx
j
.
Since the analyses are performed on the generated datasets, and thus, the objects’
cluster memberships are known, the produced clusters can be evaluated using both in-
ternalandexternalevaluationcriteria.
Thepurityisanexternalevaluationcriterionofclusterquality. Itisthepercentageof
thetotalnumberofobjectsthatwereclassiﬁedcorrectly. Thelargerthepurity,thebetter
the clustering performance. Let us suppose there are l classes (real clusters), while the
clustering method generates k clusters, then the purity of the clustering solution with
respecttotheknownclassesisgivenbytheformula:
purity =
1
n
k
X
q=1
max
 
n
p
q

, (2.2)
Associationsamongsimilarityanddistancemeasuresforbinarydata... 37
where n is the total number of objects; n
p
q
is the number of objects in the q-th cluster,
thatbelongstothep-thclass; (p = 1,2,...,l).
Theentropyisananotherexternalevaluationcriterion. Itisameasureofuncertainty,
andthesmallertheentropy,thebettertheclusteringperformance(Shannon,1948). The
entropyoftheclusteringsolutionwithrespecttotheknownclassesfollowstheformula:
entropy =
1
n log
2
l
k
X
q=1
l
X
p=1
n
p
q
log
2
n
p
q
n
q
, (2.3)
wheren
q
isthetotalnumberofobjectsintheq-thcluster (q = 1,2,...,k).
The Dunn index is an internal evaluation criterion (Dunn, 1974). It is the ratio of
the smallest distance between observations not in the same cluster to the largest intra-
cluster distance. The Dunn index takes on values between zero and inﬁnity and should
be maximized. Let us denote a particular clustering partitionC = C
1
,C
2
,...,C
k
of n
objectsintok disjointclusters. ThentheDunnindexiscomputedas:
Dunnindex =
min
Cq,Cp∈C;Cq,6=Cp
 
min
x
i
∈Cq;x
j
∈Cp
D(x
i
,x
j
)

max
Cq∈C
diam(C
q
)
, (2.4)
wherediam(C
q
)isthemaximumdistancebetweenobservationsinclusterC
q
.
The silhouette coefﬁcient is an internal evaluation criterion which is calculated as
the average of each object’s silhouette width (which is usually displayed as the width
in the silhouette graph). The silhouette width measures the degree of conﬁdence in
the clustering assignment of a particular object (Rousseeuw, 1987). The well-clustered
objects do have values near 1 and poorly clustered objects have values near −1. For
objectx
i
,itisdeﬁnedas:
S(i) =
b
i
−a
i
max(b
i
,a
i
)
, (2.5)
where a
i
is the average distance between object x
i
and all other objects in the same
cluster, and b
i
is the average distance between x
i
and the objects in the nearest neigh-
boring cluster following the formula b
i
= min
C
k
∈C;C
k
6=C(i)
P
j∈C
k
D(x
i
,x
j
)
n
k
, where C(i)
is the cluster containing observationx
i
; D(x
i
,x
j
) is the distance between observations
x
i
andx
j
;n
k
numberofobjectsinclusterC
k
. Theaveragesilhouettewidththusliesin
theinterval [−1,1],andshouldbemaximized.
2.3 Similarityanddistancemeasuresforbinarydata
Let us denote the data matrixX = [x
ic
], where i = 1,2,...,n and c = 1,2,...,m;
n is the total number of objects; m is the total number of variables. Suppose that two
38 Cibulkov´ aetal.
objects,x
i
andx
j
are represented by the binary vector form. The symbols used for the
numbers of variables with certain combinations of categories for objects, as shown in
Table 1, are used for deﬁnitions of binary distance measures in this paper (Dunn and
Everitt, 1982). In Table 1, a is the number of variables where the values ofx
i
andx
j
are both equal to 1, meaning “positive matches”, b is the number of variables where
the value ofx
i
andx
j
is (0,1), meaning “x
i
absence mismatches”, c is the number of
variableswherethevalueofx
i
andx
j
is (1,0),meaning“x
j
absencemismatches”,and
d is the number of variables where bothx
i
andx
j
are equal to 0, meaning “negative
matches”. Then,m = a+b+c+d.
Table1: Symbolsusedforthenumbersofvariableswithcertaincombinationsofcategories
forobjectsx
i
andx
j
x
i
\x
j
1(Presence) 0(Absence)
1(Presence) a b
0(Absence) c d
Tables 2–4 provide the overview of formulas of the 66 similarity measures or dis-
tance measures for binary data. The main source for the formulas was Cho and Chai
(2010). ThesecondcolumninTables2–4givesadeﬁnitionofameasureusingsymbols
fromTable1. Thethirdcolumndetermineswhetherameasureisdeﬁnedasasimilarity
measureoradistancemeasure.
The transformation from a similarity measure S(x
i
,x
j
) into a distance measure
D(x
i
,x
j
), that is necessary in order to be able to calculate a proximity matrix in clus-
teringprocess,isinspiredbyDezaandDeza(2013).
• Ifasimilaritymeasuregetsvaluesfromtheinterval [0,1],thenthecorresponding
distancemeasureisobtainedbythetransformation 1−S(x
i
,x
j
).
• Ifasimilaritymeasuregetsvaluesfromtheinterval [−1,1], thenthecorrespond-
ingdistancemeasureisobtainedbythetransformation
1−S(x
i
,x
j
)
2
.
• If a similarity measure gets values from a different interval, let’s say it is the
interval [min,max], then thesimilarity measureistransformed bymin-max nor-
malization in the ﬁrst step, resulting that the values are from the interval [0,1].
Then the corresponding distance measure is obtained by substracting 1 from the
min-maxnormalizedsimilaritymeasure.
Associationsamongsimilarityanddistancemeasuresforbinarydata... 39
Table2: Similarity/distancemeasuresforbinarydataoverview–part1
Measure Formula Type
JACCARD
a
a+b+c
similarity
DICE
2a
2a+b+c
similarity
CZEKANOWSKI
2a
2a+b+c
similarity
JACCARD 3W
3a
3a+b+c
similarity
NEI LI
2a
(a+b)+(a+c)
similarity
SOKAL SNEATH I
a
a+2b+2c
similarity
SOKAL MICHENER
a+d
a+b+c+d
similarity
SOKAL SNEATH II
2a+2d
2a+b+c+2d
similarity
ROGER TANIMOTO
a+d
a+2(b+c)+d
similarity
FAITH
a+0.5d
a+b+c+d
similarity
GOWER LEGENDRE
a+d
a+0.5(b+c)+d
similarity
INTERSECTION a similarity
INNERPRODUCT a+d similarity
RUSSEL RAO
a
a+b+c+d
similarity
HAMMING b+c distance
EUCLID
√
b+c distance
SQUARED EUCLID
p
(b+c)
2
distance
CANBERRA (b+c)
2/2
distance
MANHATTAN b+c distance
MEAN MANHATTAN
b+c
a+b+c+d
distance
CITY BLOCK b+c distance
40 Cibulkov´ aetal.
Table3: Similarity/distancemeasuresforbinarydataoverview–part2
Measure Formula Type
MINKOWSKI (b+c)
1/1
distance
VARI
b+c
4(a+b+c+d)
distance
SIZE DIFFERENCE
(b+c)
2
(a+b+c+d)
2
distance
SHAPE DIFFERENCE
m(b+c)−(b−c)
2
(a+b+c+d)
2
distance
PATTERN DIFFERENCE
4bc
(a+b+c+d)
2
distance
LANCE WILLIAMS
b+c
2a+b+c
distance
BRAY CURTIS
b+c
2a+b+c
distance
HELLINGER 2
q
1−
a
√
(a+b)(a+c)
distance
CHORD
s
2

1−
a
√
(a+b)(a+c)

distance
COSINE
a
(
√
(a+b)(a+c))
2
similarity
GILBERT WELLS log(a)−log(m)−log
 
a+b
m

−log
 
a+c
m

similarity
OCHIAI I
a
√
(a+b)(a+c)
similarity
FORBESI
ma
(a+b)(a+c)
similarity
FOSSUM
m(a−0.5)
2
(a+b)(a+c)
similarity
SORGENFREI
a
2
(a+b)(a+c)
similarity
MOUNTFORD
a
0.5(ab+ac)+bc
similarity
OTSUKA
a
((a+b)(a+c))
0.5
similarity
MCCONNAUGHEY
a
2
−bc
(a+b)(a+c)
similarity
TARWID
ma−(a+b)(a+c)
ma+(a+b)(a+c)
similarity
KULCZYNSKI II
a
2
(2a+b+c)
(a+b)(a+c)
similarity
DRIVER KROEBER
a
2
 
1
a+b
+
1
a+c

similarity
JOHNSON
a
a+b
+
a
a+c
similarity
DENNIS
ad−bc
√
m(a+b)(a+c)
similarity
SIMPSON
a
min(a+b,a+c)
similarity
BRAUN BANQUET
a
max(a+b,a+c)
similarity
FAGER MCGOWAN
a
√
(a+b)(a+c)
−
max(a+b,a+c)
2
similarity
FORBES II
ma−(a+b)(a+c)
mmin(a+b,a+c)−(a+b)(a+c)
similarity
Associationsamongsimilarityanddistancemeasuresforbinarydata... 41
Table4: Similarity/distancemeasuresforbinarydataoverview–part3
Measure Formula Type
SOKAL SNEATH IV
1
4
 
a
a+b
+
a
a+c
+
d
b+d
+
d
c+d

similarity
GOWER
a+d
√
(a+b)(a+c)(b+d)(c+d)
similarity
PEARSON I χ
2
;whereχ
2
=
m(ad−bc)
2
(a+b)(a+c)(b+d)(c+d)
similarity
PEARSON II

χ
2
m+χ
2
1
2
similarity
PEARSON III

ρ
m+ρ
1
2
;whereρ =
ad−bc
√
(a+b)(a+c)(b+d)(c+d)
similarity
PEARSON HERON I ρ =
ad−bc
√
(a+b)(a+c)(b+d)(c+d)
similarity
PEARSON HERON II cos

π
√
bc
√
ad+
√
bc

similarity
SOKAL SNEATH III
a+d
b+c
similarity
SOKAL SNEATH V
ad
√
(a+b)(a+c)(b+d)(c+d)
similarity
COLE
√
2(ad−bc)
√
(ad−bc)
2
−(a+b)(a+c)(b+d)(c+d)
similarity
STILES log
1
0
m(|ad−bc|−
n
2
(
2
(a+b)(a+c)(b+d)(c+d)
similarity
OCHIAI II
ad
√
(a+b)(a+c)(b+d)(c+d)
similarity
YULE Q
2bc
ad+bc
distance
YULE W
√
ad−
√
bc
√
ad−
√
bc
similarity
KULCZYNSKI I
a
b+c
similarity
TANIMOTO
a
(a+b)+(a+c)−a
similarity
HAMANN
(a+d)−(b+c)
a+b+c+d
similarity
MICHAEL
4(ad−bc)
(a+d)
2
+(b+c)
2
similarity
42 Cibulkov´ aetal.
3 Associationsinformulas
It is possible to detect several groups of measures deﬁned in Section 2.3 where func-
tional relationships among measures exist. The formulas were studied and compared
in order to unhide any functional relationships among them. Various associations were
discovered when studying the formulas. These associations may provide the ﬁrst cate-
gorization of similarity and distance measures for binary data. We may consider four
groupsofmeasures:
1. Euclid-basedmeasures,
2. Pearson-basedmeasures,
3. Hellinger-basedmeasures,
4. othermeasures.
3.1 Euclid-basedmeasures
NumerousmeasuresforbinarydataprovedtobedependentontheEucliddistance. Let
usdenotethesquaredEucliddistancemeasureas
eu = (b+c), (3.1)
where b and c are symbols used for a number of variables with certain combinations of
categories for objectsx
i
andx
j
described in Table 1. Table 5 shows, how are some of
measures,presentedinSection2.3,dependentoneu.
The ﬁrst column contains a measure’s name, a measure is expressed in relation to
the squared Euclid distance measure in the second column. If there is any restriction
for the relationship between a measure from the ﬁrst column and the squared Euclid
distancemeasure,thisrestrictionwouldbenotedinthethirdcolumn.
3.2 Pearson-basedmeasures
OthermeasuresforbinarydataprovedtobedependentonthePearsoncorrelationcoef-
ﬁcient. Let’sdenote
ρ =
ad−bc
p
(a+b)(a+c)(b+d)(c+d)
, (3.2)
where a,b,c,d are symbols used for number of variables with certain combinations of
categories for objectsx
i
andx
j
described in Table 1. The measures that are based on
the Pearson correlation coefﬁcient can be re-written using ρ in their formulas as shown
inTable 7. Measures’names areinthe ﬁrstcolumn, measuresare expressedin relation
toρinthesecondcolumn.
Associationsamongsimilarityanddistancemeasuresforbinarydata... 43
Table5: AssociationamongthemeasuresbasedonEucliddistance
Measure Formula Condition
SQUARED EUCLID eu -
EUCLID
√
eu -
HAMMING eu -
CANBERRA eu -
MANHATTAN eu -
MEAN MANHATTAN
eu
m
-
CITY BLOCK eu -
MINKOWSKI eu -
VARI
eu
4m
-
SOKAL MICHENER 1−
eu
m
-
INNERPRODUCT m−eu -
SIZE DIFFERENCE
eu
2
m
2
-
HAMMANN
m−2eu
m
-
GOWER LEGENDRE
2(m−eu)
2m−eu
-
SOKAL SNEATH II
2(m−eu)
2m−eu
-
SOKAL SNEATH III
n
eu
−1 -
ROGER TANIMOTO
m−eu
m+eu
-
FAITH
m−eu−0.5d
m
-
SHAPE DIFFERENCE
eu
m
ifb = c
44 Cibulkov´ aetal.
Table6: AssociationamongthemeasuresbasedonPearsondistance
Measure Formula
PEARSON HERON I ρ
PEARSON I nρ
2
PEARSON II

(nρ
2
)
n+(nρ
2
)
1
2
PEARSON III

ρ
n+ρ
1
2
3.3 Hellinger-basedmeasures
Some distance and similarity measures for binary data are dependent on the Hellinger
distance. FormulasofthesemeasuresdemonstratethedependencyinTable7.
Table7: AssociationamongthemeasuresbasedonHellingerdistance
Measure Formula
HELLINGER hel
CHORD
hel
√
2
OCHIAI I 1−
 
hel
2

2
OTSUKA 1−
 
hel
2

2
FAGER MCGOWAN 1−
 
hel
2

2
−
max(a+b,a+c)
2
SORGENFREI
h
1−
 
hel
2

2
i
FORBESI
m
a
h
1−
 
hel
2

2
i
Measures’ names are in the ﬁrst column, measures are expressed in relation to the
Hellingerdistanceinthesecondcolumn. Letusdenote
hel = 2
s
1−
a
p
(a+b)(a+c)
, (3.3)
wherea,b,canddaresymbolsusedfornumberofvariableswithcertaincombinations
ofcategoriesforobjectsx
i
andx
j
describedinTable1.
Associationsamongsimilarityanddistancemeasuresforbinarydata... 45
3.4 Othermeasuresandassociationsamongthem
SeveralgroupsuptofourmeasuresofsimilarityordistancemeasuresfromSection2.3
seemtohavesomekindofdependencyamongthemselves.
TheJACCARDsimilaritymeasureisidenticalwithsimilaritymeasureTANIMOTO.
JACCARD =
a
a+b+c
=
= TANIMOTO =
a
(a+b)+(a+c)−a
(3.4)
SimilaritymeasuresDICE,CZEKANOWSKI,NEI LIareidentical.
DICE =
2a
2a+b+c
=
= CZEKANOWSKI =
2a
2a+b+c
=
= NEI LI =
2a
(a+b)+(a+c)
(3.5)
AnotherpairofsimilaritymeasuresthataredeﬁnedbythesameformulaisLANCE WILLIAMS
andBRAY CURTIS.
LANCE WILLIAMS =
b+c
2a+b+c
=
= BRAY CURTIS =
b+c
2a+b+c
(3.6)
Moreover, if 2a = b + c, then all ﬁve measures DICE, CZEKANOWSKI, NEI LI,
LANCE WILLIAMSandBRAY CURTISwouldbeequaltoeachother.
ThereisalineardependencybetweensimilaritymeasuresINTERSECTIONand
RUSSEL RAO.
INTERSECTION = a =
= m(RUSSEL RAO) = m

a
a+b+c+d

(3.7)
Similarity measures KULCZYNSKI II and DRIVER KROEBER are identical and
theyarelinearlydependentwithJOHNSONsimilaritymeasure.
KULCZYNSKI II =
a
2
(2a+b+c)
(a+b)(a+c)
=
= DRIVER KROEBER =
a
2

1
a+b
+
1
a+c

=
= 0.5(JOHNSON) = 0.5

a
a+b
+
a
a+c

(3.8)
46 Cibulkov´ aetal.
SimilaritymeasuresSOKAL SNEATH Vand OCHIAI IIareidentical.
SOKAL SNEATH V =
ad
p
(a+b)(a+c)(b+d)(c+d)
=
= OCHIAI II =
ad
p
(a+b)(a+c)(b+d)(c+d)
(3.9)
If d = 0, then formulas of measures KULCZYNSKI II and SOKAL SNEATH IV
areidentical.
KULCZYNSKI II =
a
2
(2a+b+c)
(a+b)(a+c)
=
= SOKAL SNEATH IV =
1
4

a
a+b
+
a
a+c
+
d
b+d
+
d
c+d
 (3.10)
Ifd = 0,thensimilaritymeasuresKULCZYNSKI IandSOKAL SNEATH III are
identical.
SOKAL SNEATH III =
a+d
b+c
=
= KULCZYNSKI I =
a
b+c
(3.11)
4 Associationsinclusteringsolutions
We have shown that some measures are functionally dependent, linearly dependent, or
even identical in Section 3. Therefore, we may expect that the process of assigning
objects into clusters would be somehow associated. If there is a linear dependence
between measures, we may even assume that the clustering process will be identical.
Thus, it would lead to the same clustering solutions and that the dendrograms will be
similar. Figure 1 demonstrates nearly identical outcome of hierarchical clustering pro-
cess(average-linkage)onasampledatasetwhenfourdifferent(butlinearlydependent)
similarity measures were used. The goal of this section is to express such a similarity
ofdendrogramsforvariousmeasuresinnumericalterms.
4.1 Experimentdesign
The datasets were generated using the data generator described in Section 2.1. Desired
features of the generated datasets were chosen with an aim to cover a wide range of
possible situations that can occur. Since the paper focuses on measures suitable for
binary data, all data are generated from the Bernoulli distribution, but they differ in
a number of objects, a number of variables, and the strength of a relationship among
Associationsamongsimilarityanddistancemeasuresforbinarydata... 47
0.0 0.5 1.0 1.5 2.0
distance: EUCLID 
3
9
8
1
7
5
10
6
2
4
0 1 2 3 4
distance: MANHATTAN 
3
9
8
1
7
5
10
6
2
4
0.00 0.10 0.20
distance: VARI 
3
9
8
1
7
5
10
6
2
4
0.0 0.2 0.4 0.6 0.8
distance: SOKAL MICHENER 
3
9
8
1
7
5
10
6
2
4
Figure 1: Example of 4 different measures used in hierarchical clustering of a sample
datasetleadingtothesameresult
variables (expressed by correlation matrix). The following features were used in the
data-generatingprocess:
1. probabilitydistribution: Bernoulli,
2. numberofvariables: 5,10,15,20,
3. numberofclusters: 2,3,4,5,
4. numberofobservationsinacluster: 30–60,60–120,30-120,
5. correlationamongvariables:∈ [−1;1];atrandomforeachcluster.
This leads to 48 possible combinations of dataset features (numbers of variables, num-
bers of clusters, and numbers of observations). Each combination of features is gen-
erated ten times to make the outcomes of the experiment more robust, and the ﬁnal
number of generated datasets is hence equal to 480. The information about “real clus-
ter”membershipisstoredforeachobservationofeachdataset,anditislaterusedinthe
processofevaluationofclusteringsolutions.
The average linkage method combined with every binary similarity distance from
Section 2.3 is applied to each dataset. Clustering solutions with three clusters are used
toexamineassociationsamongsimilarityanddistancemeasuresinclusteringsolutions
viainternalevaluationcriteria(averagesilhouettewidth,Dunnindex),inordertomimic
a real-life situation, where the number of real clusters is usually unknown. The exam-
ination of associations among measures in clustering solutions via external evaluation
48 Cibulkov´ aetal.
criteria (purity, entropy) uses information about “real cluster” membership. These val-
ues of four evaluation criteria from Section 2.2 are calculated for each clustering solu-
tion, and their averages and standard deviations across all datasets are compared with
respecttosimilarity/distancemeasureused.
Due to the complexity of computations, the evaluation of the assignment of objects
into clusters is performed on two-to-four-clusters clustering solutions. For each num-
ber of clusters in the clustering solution (the last three steps of the clustering process)
and each pair of measures from Section 2.3, the percentage of identical objects-into-
clusters assignments was calculated. The percentage gives a value between 0 and 1,
where 1 means the two clustering outcomes match identically. The percentage of iden-
tical objects-into-clusters assignments is also known as the Rand index (Rand, 1971).
The pairs of last three steps of the clustering process for every two similarity measures
are then compared, and measures that lead to identical/similar clustering process are
identiﬁed.
All the calculations and visualizations were done in R programming language (R
CoreTeam,2018).
4.2 Resultsoftheexperiment
There are 66 similarity and distance measures for binary data examined in the paper
and used in the clustering process. The ability of the measures to produce a quality
clusteringsolutionismeasuredbyfourchosenevaluationcriteria: purity,entropy,Dunn
index,andsilhouettecoefﬁcient.
Average values of evaluation criteria (calculated as described in Section 4.1) are
shown in Figure 2. In general, Euclid-based measures (red color in all graphs) pro-
duce clusters with the highest quality according to the evaluation criteria. The quality
of Pearson-based measures (purple color in all graphs) is, on average, not very good,
butitissteadyacrossallfourmeasures. ThesameappliestoHellinger-basedmeasures
(blackcolorinallgraphs),andallpairs/groupsofmeasures,whereanyfunctionalrela-
tionshipwasdetectedinSection3. Alhoughsomemeasuresleadtoclusteringsolutions
with low average quality, keep in mind that these measures might perform very well
when clustering datasets with speciﬁc features. Therefore, one should not jump into
anyconclusionsandgeneralrecommendationsforuniversaluse.
Even though the evaluation criteria’ average values might not be beneﬁcial for for-
mulating general recommendations, they help reveal associations among the measures.
The relationships among binary similarity/distance measures are quite evident based
on the average values of evaluation criteria and their standard deviations. Figure 3
showsthearrangementoftheclustersproducedbyaverage-linkagehierarchicalcluster-
ing with Euclid distance in the dendrogram. From this graph, it is possible to see that
some measures lead to more similar clustering solutions than others. All Euclid-based
measuresleadtoclusteringsolutionwithsimilarvaluesofevaluationcriteria,exceptfor
Associationsamongsimilarityanddistancemeasuresforbinarydata... 49
INNERPRODUCT_
SOKAL_MICHENER_
EUCLID_
SQUARED_EUCLID_
CANBERRA_
HAMMING_
HAMANN_
CITY_BLOCKS_
MINKOWSKI_
MANHATTAN_
MEAN_MANHATTAN_
VARI_
SIZE_DIFFERENCE_
GOWER_LEGENDRE_
SOKAL_SNEATH_II_
SOKAL_SNEATH_III_
ROGER_TANIMOTO_
FAITH_
SHAPE_DIFFERENCE_
SOKAL_SNEATH_V_
STILES_
LANCE_WILLIAMS_
BRAY_CURTIS_
MOUNTFORD_
MCCONNAUGHEY_
TARWID_
COSINE_
JACCARD_
DICE_
CZEKANOWSKI_
JACCARD_3W_
NEI_LI_
SOKAL_SNEATH_I_
INTERSECTION_
RUSSEL_RAO_
HELLINGER_
CHORD_
OCHIAI_I_
OCHIAI_II_
FORBESI_
FOSSUM_
SORGENFREI_
OTSUKA_
KULCZYNSKI_II_
DRIVER_KROEBER_
JOHNSON_
SIMPSON_
BRAUN_BANQUET_
FAGER_MCGOWAN_
SOKAL_SNEATH_IV_
GOWER_
PEARSON_I_
PEARSON_II_
PEARSON_III_
PEARSON_HERON_II_
KULCZYNSKI_I_
TANIMOTO_
COLE_
PATTERN_DIFFERENCE_
GILBERT_WELLS_
YULE_Q_
DENNIS_
FORBES_II_
PEARSON_HERON_I_
YULE_W_
MICHAEL_
INNERPRODUCT_
SOKAL_MICHENER_
EUCLID_
SQUARED_EUCLID_
CANBERRA_
HAMMING_
HAMANN_
CITY_BLOCKS_
MINKOWSKI_
MANHATTAN_
MEAN_MANHATTAN_
VARI_
SIZE_DIFFERENCE_
GOWER_LEGENDRE_
SOKAL_SNEATH_II_
SOKAL_SNEATH_III_
ROGER_TANIMOTO_
FAITH_
SHAPE_DIFFERENCE_
PEARSON_I_
PEARSON_II_
PEARSON_III_
PEARSON_HERON_II_
HELLINGER_
CHORD_
OCHIAI_I_
FORBESI_
SORGENFREI_
OTSUKA_
FAGER_MCGOWAN_
−0.25 0.25 0.75 1.25 1.75 2.25 2.75 3.25
Silhouette
Dunn index
Purity
Entropy
Figure 2: Average value of evaluation criteria representing clusters quality of examined
similarityordistancemeasuresforbinarydata
50 Cibulkov´ aetal.
SHAPE DIFFERENCE.Thatcanbeexplainedbythefact,thatthereexistsafunctional
relationship among all these measures, but only SHAPE DIFFERENCE has this rela-
tionshipconditiondbyarestriction(seeSection3.1). Allidenticalorlinearlydependent
similarity and distance measures, see Section 3, lead to the identical cluster solutions
as well. Clearly, the Euclid-based measures lead to very similar clustering solutions
in terms of cluster quality. Due to non-linear relationships among Pearson-based mea-
sures, all the Pearson-based measures are much less similar in clustering solution qual-
ity. Negative-matches exclusive measures (such as JACCARD, HELLINGER, DICE,
...) also produce clusters of similar quality. Measures for which we have not found
any association to other measures generally provide qualitatively different clustering
solutions.
Figure 4 shows the average percentages of identical objects-into-clusters assign-
ments for every pair of similarity or distance measures for binary data. The ﬁgure
reﬂects the average similarity of dendrograms in the ﬁnal three steps of the clustering
process (2–4 clusters) for any two measures. The associations among all measures in
theclusteringprocessareindisputable. Almostallmeasuresleadtomoreorlesssimilar
clusteringsolutions(atleastinthelastthreestepsoftheclusteringprocess). Especially
dendrograms for identical or linearly dependent measures are essentially all the same.
Atthesametime,itcanbeseenthatmeasureswithoutanyapparentdependencyonany
othermeasure(suchasDENNIS,MICHAEL,orMOUNTFORD)produceleastsimilar
dendrograms. Interestingly, the absence of negative matches in a measure’s formula
and the presence of only positive matches in a measure’s formula seems to be a com-
mon feature for measures that produced clustering solutions of similar quality and al-
most identical dendrograms. These negative-matches exclusive measures produce very
similardendrogramsregardlessofwhethertheyarePearson-basedorHellinger-based.
5 Discussionandconclusion
Many similarity measures and distance measures for binary data have been used in
variousﬁeldsofstudy. Thisledtoasituationwheremeasuresareduplicatedorstrongly
dependent on each other. Despite the fact that these measures are widely known and
often used, only a limited number of authors aimed their research in this area. Even
thoughtheirstudiesusuallyfocusedonalimitednumberofsimilarity/distancemeasures
for binary data or they were applied on only one speciﬁc dataset, they concluded that
binarydatameasuresareoftenlinearlydependent,andthus,theyoftenproducethesame
clusters.
Not only were we able to support this claim on hundreds of generated datasets, but
wealsoexplainedthereasonforthisbehavior(thatisrootedinthemeasures’deﬁnition)
while providing a comprehensive study of 66 selected measures for binary data. Mea-
sureswereexaminedbasedonthequalityofclusteringsolutionstheyproduceandbased
Associationsamongsimilarityanddistancemeasuresforbinarydata... 51
INNERPRODUCT 
SOKAL MICHENER 
EUCLID 
SQUARED EUCLID 
CANBERRA 
HAMMING 
HAMANN 
CITY BLOCKS 
MINKOWSKI 
MANHATTAN 
MEAN MANHATTAN 
VARI 
SIZE DIFFERENCE 
GOWER LEGENDRE 
SOKAL SNEATH II 
SOKAL SNEATH III 
ROGER TANIMOTO 
FAITH 
SHAPE DIFFERENCE 
SOKAL SNEATH V 
STILES 
LANCE WILLIAMS 
BRAY CURTIS 
MOUNTFORD 
MCCONNAUGHEY 
TARWID 
COSINE 
JACCARD 
DICE 
CZEKANOWSKI 
JACCARD 3W 
NEI LI 
SOKAL SNEATH I 
INTERSECTION 
RUSSEL RAO 
HELLINGER 
CHORD 
OCHIAI I 
OCHIAI II 
FORBESI 
FOSSUM 
SORGENFREI 
OTSUKA 
KULCZYNSKI II 
DRIVER KROEBER 
JOHNSON 
SIMPSON 
BRAUN BANQUET 
FAGER MCGOWAN 
SOKAL SNEATH IV 
GOWER 
PEARSON I 
PEARSON II 
PEARSON III 
PEARSON HERON II 
KULCZYNSKI I 
TANIMOTO 
COLE 
PATTERN DIFFERENCE 
GILBERT WELLS 
YULE Q 
DENNIS 
FORBES II 
PEARSON HERON I 
YULE W 
MICHAEL 
Figure3: Dendrogramrepresentingclustersqualityofexaminedsimilarityordistancemea-
suresforbinarydata
52 Cibulkov´ aetal.
0.44
0.58
0.72
0.86
1
INNERPRODUCT_
SOKAL_MICHENER_
EUCLID_
SQUARED_EUCLID_
CANBERRA_
HAMMING_
HAMANN_
CITY_BLOCKS_
MINKOWSKI_
MANHATTAN_
MEAN_MANHATTAN_
VARI_
SIZE_DIFFERENCE_
GOWER_LEGENDRE_
SOKAL_SNEATH_II_
SOKAL_SNEATH_III_
ROGER_TANIMOTO_
FAITH_
SHAPE_DIFFERENCE_
SOKAL_SNEATH_V_
STILES_
LANCE_WILLIAMS_
BRAY_CURTIS_
MOUNTFORD_
MCCONNAUGHEY_
TARWID_
COSINE_
JACCARD_
DICE_
CZEKANOWSKI_
JACCARD_3W_
NEI_LI_
SOKAL_SNEATH_I_
INTERSECTION_
RUSSEL_RAO_
HELLINGER_
CHORD_
OCHIAI_I_
OCHIAI_II_
FORBESI_
FOSSUM_
SORGENFREI_
OTSUKA_
KULCZYNSKI_II_
DRIVER_KROEBER_
JOHNSON_
SIMPSON_
BRAUN_BANQUET_
FAGER_MCGOWAN_
SOKAL_SNEATH_IV_
GOWER_
PEARSON_I_
PEARSON_II_
PEARSON_III_
PEARSON_HERON_II_
KULCZYNSKI_I_
TANIMOTO_
COLE_
PATTERN_DIFFERENCE_
GILBERT_WELLS_
YULE_Q_
DENNIS_
FORBES_II_
PEARSON_HERON_I_
YULE_W_
MICHAEL_
INNERPRODUCT_
SOKAL_MICHENER_
EUCLID_
SQUARED_EUCLID_
CANBERRA_
HAMMING_
HAMANN_
CITY_BLOCKS_
MINKOWSKI_
MANHATTAN_
MEAN_MANHATTAN_
VARI_
SIZE_DIFFERENCE_
GOWER_LEGENDRE_
SOKAL_SNEATH_II_
SOKAL_SNEATH_III_
ROGER_TANIMOTO_
FAITH_
SHAPE_DIFFERENCE_
SOKAL_SNEATH_V_
STILES_
LANCE_WILLIAMS_
BRAY_CURTIS_
MOUNTFORD_
MCCONNAUGHEY_
TARWID_
COSINE_
JACCARD_
DICE_
CZEKANOWSKI_
JACCARD_3W_
NEI_LI_
SOKAL_SNEATH_I_
INTERSECTION_
RUSSEL_RAO_
HELLINGER_
CHORD_
OCHIAI_I_
OCHIAI_II_
FORBESI_
FOSSUM_
SORGENFREI_
OTSUKA_
KULCZYNSKI_II_
DRIVER_KROEBER_
JOHNSON_
SIMPSON_
BRAUN_BANQUET_
FAGER_MCGOWAN_
SOKAL_SNEATH_IV_
GOWER_
PEARSON_I_
PEARSON_II_
PEARSON_III_
PEARSON_HERON_II_
KULCZYNSKI_I_
TANIMOTO_
COLE_
PATTERN_DIFFERENCE_
GILBERT_WELLS_
YULE_Q_
DENNIS_
FORBES_II_
PEARSON_HERON_I_
YULE_W_
MICHAEL_
Figure4: Theaveragepercentageofidenticalobjects-into-clustersassignations
Associationsamongsimilarityanddistancemeasuresforbinarydata... 53
onthepercentageofidenticalobjects-into-clustersassignationsinthelastthreestepsof
the hierarchical clustering process. Based on this we might claim that Euclid-based
measures lead to identical (or very similar) clustering solutions. Another big group
of measures that lead to very similar clustering results are negative-matches exclusive
measures.
Acknowledgment
This work was supported by the University of Economics, Prague under Grant IGA
F4/44/2018.
References
[1] Cairo,M.andNelson,B.(1997): ModelingandGeneratingRandomVectorswith
Arbitrary Marginal Distributions and Correlation Matrix. Technical Report. De-
partmentofIndustrialEngineeringandManagementSciences,NorthwesternUni-
versity,Evanston,IL.
[2] Charu,C.A.andChandan,K.R.(2013): Dataclustering: Algorithmsandapplica-
tions.BocaRaton,FL:Chapman&Hall.
[3] Chen, H. (2001): Initialization for NORTA: Generation of random vectors with
speciﬁed marginals and correlations. INFORMS Journal on Computing, 13(4),
312–331.
[4] Choi, S.S., Cha, S.H. and Tappert, C.C. (2010): A survey of binary similarity
and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1),
43–48.
[5] Cibulkov´ a,J.and
ˇ
Rezankov´ a,H.(2018): CategoricalDataGenerator.InT.Loster,
T. Pavelka (eds.): International Days of Statistics and Economics 2018. Slan´ y:
Melandrium,288–296.
[6] Deza, M.M. and Deza, E. (2013): Encyclopedia of Distances. New York, NY:
Springer.
[7] Dunn, G. and Everitt, B.S. (1982): An Introduction to Mathematical Taxonomy.
Cambridge: CambridgeUniversityPress.
[8] Forbes, S. A. (1925): Method of determining and measuring the associative rela-
tionsofspecies.Science,61,5–24.
54 Cibulkov´ aetal.
[9] Higham, N.J. (2009): Cholesky factorization. Wiley Interdisciplinary Reviews:
ComputationalStatistics,1(2),251–254.
[10] Hub´ alek, Z. (1982): Coefﬁcients of association and similarity, based on binary
(presence-absence)data: Anevaluation.BiologicalReviews,57(4),669–689.
[11] Rand, W.M. (1971): Objective criteria for the evaluation of clustering methods.
JournaloftheAmericanStatisticalAssociation,66,846–850.
[12] Rousseeuw P.J. (1987): Silhouettes: A graphical aid to the interpretation and val-
idation of cluster analysis. Journal of Computational and Applied Mathematics,
20,53–65.
[13] Shannon, C.E. (1948): A mathematical theory of communication. Bell System
TechnicalJournal,27,379–423.
[14] Jaccard, P. (1901): Etude comparative de la distribuition ﬂorale dans une portion
des Alpes et des Jura. Bulletin de la Soci´ et´ e vaudoise des sciences naturelles, 37,
547–579.
[15] Jackson, D.A., Somers, K.M. and Harvey, H.H. (1989): Similarity coefﬁcients:
Measures of co-occurrence and association or simply measures of occurrence?.
TheAmericanNaturalist,133(3),436–453.
[16] R Core Team (2018): R: A language and environment for statistical comput-
ing. R Foundation for Statistical Computing, Vienna, Austria. https://www.
R-project.org/.
[17] SokalR.and Michener, C., (1958): A statisticalmethod forevaluating systematic
relationships.UniversityofKansasScienceBulletin,38,1409–1438.
[18] Stahl, D. and Sallis, H. (2012): Model-based cluster analysis. Wiley Interdisci-
plinaryReviews: ComputationalStatistics,4(4),341–358.
[19] Todeschini, R., Consonni, V., Xiang, H., Holliday, J., Buscema, M. and Willet,
P. (2012): Similarity coefﬁcients for binary chemoinformatics Data: Overview
and extended comparison using simulated and real data sets. Journal of Chemical
InformationandModeling,52(11),2884–2901.