Metodolos¡ki zvezki, Vol. 3, No. 1, 2006, 9-19 A Distance Function for Ranked Variables: A Proposal for a New Rank Correlation Coefficient Antonio Mango1 Abstract Rank correlation coefficients, RCC, are usually based on differences between matched pairs of modalities of two related variables in ordinal scale or on the number of inversions existing between couples of paired modalities, like ? - Spearman and ? - Kendall, respectively. Here a way to build a RCC based on determinants of all second order minors ofanx 2 data matrix in ordinal scale is proposed. Some possibilities of use of these determinants will be pointed out and one that seems particularly interesting will be considered showing the properties of the coefficient obtained and comparing them with the corresponding of Kendall’s and Spearman’s ones. 1 Introductory remarks A little more than one hundred years ago T.G.Fechner, (Fechner,1887), introduced a RCC based on the number of inversions of a variable with respect to an ordered one. At the beginning of this century, C.Spearman, (Spearman, 1904), introduced a RCC based on the difference between paired ranks; later on, M.G.Kendall, (Kendall, 1962), starting from Fechner’s idea, proposed a RCC in which the number of existing inversions, among all couples of paired modalities of two ordered variables, was taken into account. Here I propose a criterion for the construction of a RCC where the determinants of all second order minors of a n × 2 data matrix enter as elements. 2 A distance function among vectors We have already met in Statistics the use of determinants of 2 × 2 matrices in the analysis of dependence among variables both for qualitative and quantitative scales, I will show that the determinant of a 2 × 2 matrix can be used to build a distance function. Given the following column vectors: [ x \ \ a 1 \ r \ , b , ys 1Dipartimento di Matematica e Statistica, Universita’ degli Studi di Napoli Federico II, 80126 Napoli; mango@unina.it 10 Antonio Mango we assume that the inequality: det + det (2.1) < det y b y s holds, any real x, y, a, b, r and s. If true, the absolute value of a 2 x 2 matrix determinant should be a distance function between the vectors; this hypothesis is not true in general because there exist at least three vectors that don’t verify it: det 1 5 2 3 - det r 1 5 r the - det = 22, 24 31 it will be shown that 2.1 is true under the condition that the sum of the elements of the 4 1 three column vectors is constant, the simplest way to get it is to divide the elements of each column vector for their sum. Inequality 2.1 may be rewritten as: det x x+y x+y which turned into: a a a+b b a+b < det - x +b - x+ y < a x x+y x x+y - r r+s s r+s + det a a+b r r+s a+b r+s +b - r+ + x x - +y - r+ verifies the triangular property of the Euclidean metrics. I propose to calculate these distances for all minors of a n × 2 matrix, whose columns are permutations of the first n integers, and sum them to obtain a distance measure for these column vectors. 3 Determinants of matrices as elements for indices con-struction Let us consider a n × 2 data matrix at ordinal scale level, extract all second order minors, say Tj , j = 1, 2, n(n-1) , and calculate all determinants det(Tj). We have null determinants in the case of perfect concordance of ranks, positive and negative determinants in other cases. We can resort to these determinants as intermediate scores and use: 1) the absolute value of the sum of their algebraic values, 2) the sum of their absolute values, 3) the square root of the sum of their second power, to build RCCs whose variable components are: S1 = J2j=1 det(Tj) _ n ( n __ 1) S2 = j=1 |det(Tj)| S3 = VLj=1 (det(Tj)) . r r . A Distance Function for Ranked Variables. .. 11 The purpose of this work is to introduce an index that fulfills the fundamental property of ordering the greater part of the couples of permutations of the first n integers on the base of their mutual degree of correlation. The presence of this property can be insured by its metrics structure, and by its diversity. The quantity S2 has been selected for the construction of such an index, say ?, because it gives the best results under this condition with respect to quantities S1 and S3. As a consequence of the emphasis here given to this property, since the metric structure of the index has already been verified for the index ?, the concept of diversity will be deepened, in order to verify that the best index under this aspect is ? with respect to ?1, ?2, ? and ?. 4 The property of diversity Two indices are different if they give different results when applied to a same data set. An index has a greater diversity than another if its set of results is larger than the set of the other when applied to a same data set. In the case of perfect diversity of an index, we would observe a one-to-one correspondence among the elements of a sample universe and its image in R. In literature on rank correlation we can find a considerable number of coefficients, among which ? of Spearman and ? of Kendall, are different and do not show a perfect diversity. Generally we agree to associate 1 to the case of perfect rank positive correlation and - 1, to the case of perfect rank negative correlation, and other real numbers r, where - 1 < r < 1, to other cases. But how do we distinguish between two ”other” cases? For example, it is not an easy task to say at first sight, among the following three cases, which is the one with higher correlation: a b c 2 3 4 1 5 1 2 3 4 5 1 2 5 3 4 1 2 3 4 5 2 3 4 5 1 1 2 3 4 5 As a result, let us consider the values of ? and of ? for these couples of permutations, we have: index\perm a b c ? 0.4 0.7 0 ? 0.4 0.6 0.2 we see that in case a the indices agree , in case b results ? > ? and, in case c, ? < ?. This means evidently that each index follows a proper definition of rank correlation. Two main aspects arise in the comparison of the universes of two different indices applied to couples of permutations of the first n integers: The one pertains the lack of an identical progression with respect to the same sequence of couples of permutations, as seen; the other pertains to the frequency distribution, say DFindex, whose elements are couples (index,frequency). 12 Antonio Mango For example, for n = 4, we have: DF? ={{- 1,1},{-0.8,3},{-0.6,1}, {-0.4,4},{-0.2, 2}, {0,2}, {0.2, 2} ,{0.4,4},{0.6,1},{0.8,3},{1,1}} D F? ={{-1,1},{-0.6, 3},{-0.3, 5}, {0,6}, {0.3,5}, {0.6, 3},{1,1}} and observe that DF? has a wider variety of indices than D F?, this can be seen evidencing the frequency distributions of the frequencies for the two indices, say index (ci, fi) where ci represents numerousness of the class, that is the number of times an index has appeared in the universe, and fi the frequency of indices that have appeared ci times: ? (ci, fi) = {{1, 4}, {2,3}, {3, 2}, {4, 2}} ? (ci, fi) = {{1, 2}, {3, 2}, {5, 2}, {6,1}} . (4.1) It is expected that, in the case of complete diversity of an index, we should have: index (c, f) = {{1, n!}} As a measure of the degree of diversity for an index, we will choose the Shannon relative entropy index: - 2^ii=1 pi log pi log k where pi = i×!i and k is the number of classes ci. We observe that for the distributions in4.1 the Shannon relative entropy index gives: H 0.71 for ? (ci, fi) 0.86 for ? (ci, fi) this means that more permutations may be distinguished by ? than by ?. 5 Definition of the rank correlation coefficient ? and its properties The definition of RCC ? is bounded to the quantity S2, for simplicity S, which is the sum of the absolute values of the determinants appearing in columns from (4) to (6) in Table 1, where and: x '1 r x y j 1 jj = det xj+.x'1 vyi +yi x- y- ii xj+xi yj+yi yj xj yjxj+xi yj +yi xjyi - yjxi det 1 1 = det xi \ =--------------------------, xj+x yj+yi (xj + xi) (yj + yi) x1y2-x2y1 x1y3-x3y1 xn-1yn - xnyn-1 S =|| + ( | (------- + · · · + ( || , (5.1) (x1 + x2) (y1 + y2) x1 + x3) Vl + |/3) Xn-1 + xn) (yn-1 + yn) A Distance Function for Ranked Variables. .. 13 Table 1: Determinants of the second order minors with constant sum of the elements of each column from a data matrix of two ordinal variables. subj. X Y det x 1 1 xi yi · j yj det i i xi yi · det n~1 n_1 xi yi (1) (2) (3) (4) (5) (6) 1 x1 y1 . 2 x2 y2 x1y2 x2y1 . 3 x3 y3 x1y3 x2y3 . . i xi yi x 1yi xiy1 xjyi xiyj . n xn yn x 1yn y1xn xj yn xn yj x n_1 y n - x n y n_1 or n—1 n S = y y i=1 j=2 (5.2) xi yi ----------------------- xi + xj yi + y j S is an absolute index of cograduation, we will normalize it, obtaining Srel, and make it to vary within the interval [-1,1], obtaining ?. We first need to find the maximum value of S, say Smax, which will be done by putting in Table 1: xi = i and yi = (n - i + 1), (5.3) which represent the case of perfect discordance between the two variables, and then apply the S definition. In this way the determinants bear the same sign, then the sum of their absolute values corresponds to the absolute value of their sum. A very simple formula to obtain Smax is the following: n i-1 z—' z—' i i=2 j=1 i-j j . (5.4) When the variables show tied ranks to get Smax it is necessary to reorganize one variable in increasing order and the other in decreasing order and apply the S definition. The normalized value of S, say Srel, will be then: Srel = S it follows ?: or also: Smax (5.5) ?=1 - ? = Smax S 2Srel 2S . 14 Antonio Mango 6 Some sampling distribution properties 6.1 Triangular inequality We have already seen that the triangular property holds for ?, while for ? and ? does not. 6.2 Diversity Beginning with a frequency distribution of an index in the universe, which we will call Index whose generic couple is (indexi, fi), we can build a frequency distribution of the frequencies of Index, which we can indicate with FI whose generic couple is (fi, ffi). From here we go to the frequency distribution CN which we shall call Numerousness of the classes of Index, or simply Numerousness of Index, whose generic couple is fi X fci ci = fi, fci = ------j X 100 n! Hence ci represents the number of times an index can be repeated in the universe and fci represents the percentage of indices of the universe which are repeated a number of times equal to ci. Numerousness is strictly related to diversity which consists of both richness (how many) and evenness (how distributed), we will, through the data reported in Tables 2, 3, 4, 5, and 6 in which the CN distributions for the indices ?, ?1, ?2, ? and ? for some n values are shown, measure the degree of diversity of these indices. Table 2: Percent frequency of ? for classes of numerousness for some n. n = 4 5 ci fc i ci fc i 1 41.7 1 21.7 2 41.7 2 63.4 4 16.6 4 10.0 6 4.9 6 ? ci fc i 1 9.72 2 58.61 3 0.83 4 22.22 6 5.84 8 1.11 12 1.67 7 ci fc i 1 4.50 2 66.40 3 0.98 4 20.87 6 5.24 8 2.06 16 0.63 8 ci fc i 1 8.8 2 45.3 3 8.6 4 19.6 5 3.2 6 5.2 7 1.5 8 1.9 9 0.5 10 0.6 11 ÷ 24 0.9 We can observe that, for n = 4, ? furnishes 8.8 percent of indices which are observed only once and 45.3 percent that are observed twice, while these percentages are respectively, with reference to ?1, 0.1 and 0.0, with reference to ?2, 16.7 and 25, with reference A Distance Function for Ranked Variables. .. 15 Table 3: Percent frequency of ?1 for classes of numerousness for some n. n = 4 5 c fc c fc 1 91.7 1 55.0 2 8.3 2 28.3 3 10.0 4 6.7 6 ?i c fc 1 7.5 2 11.7 3 20.0 4 23.3 5 19.4 6 5.0 7 3.9 8 6.7 9 2.5 7 c fc 1 0.4 2 1.0 3 1.2 4 0.6 5 1.4 6 2.9 7 1.7 8 4.2 9 5.0 10 6.3 11 ÷ 32 75.3 8 c fc 1 0.1 2 0.0 3 0.1 4 0.1 5 0.1 6 0.1 7 0.2 8 0.1 9 0.1 10 0.3 11 ÷ 153 98.8 Table 4: Percent frequency of ?2 for classes of numerousness for some n. n = 4 5 ci fci ci fci 1 16.7 1 1.7 2 25.0 3 5, 0 3 25.0 4 13.3 4 33.3 6 35.0 7 11.7 10 33.3 6 ?2 ci fci 1 2 5 2 6 2 9 2 12 2 14 2 16 2 20 4 21 2 23 3 24 ÷ 42 14 7 ci fci 1 0.04 6 0.24 10 0.40 14 0.56 26 1.03 29 1.15 35 1.39 46 1.83 54 2.14 55 2.18 70 ÷ 184 89.05 8 ci fci 1 0.01 7 0.03 15 0.07 22 0.11 47 0.23 54 0.27 70 0.35 94 0.47 124 0.62 129 0.64 178 ÷ 1066 97.21 to ?, still 16.7 and 25.0, the same as for ?2 and, with reference to ?, 8.3 and 25, this fact means that ? produces a greater diversity of indices than the others. Diversity indices must include contemporarily both richness and evenness, for this purpose we have chosen Shannon relative entropy index to point out the different degree of diversity of the indices under analysis: H *ž2i^ Hrel = Z = p ln pi lnk lnk . 16 Antonio Mango Table 5: Percent frequency of ? for classes of numerousness for some n. ? n = 4 5 6 7 8 ci fci ci fci ci fci ci fci ci fci 1 16.7 1 1.7 1 0.3 1 0.04 1 0.01 2 25.0 3 5.0 5 1.4 6 0.24 7 0.03 3 25.0 4 13.0 6 1.7 10 0.40 15 0.07 4 33.3 6 35.0 9 2.5 14 0.56 22 0.11 7 11.7 12 3.3 26 1.03 47 0.23 10 33.0 14 3.9 29 1.15 54 0.27 16 4.4 35 1.39 70 0.35 20 11.1 46 1.83 94 0.47 21 5.8 54 2.14 124 0.62 23 6.9 55 2.18 129 0.64 24 ÷ 42 59.2 70 ÷ 184 89.05 178 ÷ 1066 97.21 Table 6: Percent frequency of ? for classes of numerousness for some n. n = 4 5 6 7 8 ci fci 1 8.3 3 25.0 5 41.7 6 25.0 ci fci 1 1.7 4 6.7 9 15.0 15 25.0 20 33.3 22 18.3 ci fci 1 0.3 5 1.4 14 3.9 29 8.1 49 13.6 71 19.6 90 25.0 101 28.1 ci fci 1 0.1 6 0.2 20 0.8 49 1.9 98 3.9 169 6.7 259 10.3 359 14.2 455 18.1 531 21.1 573 22.7 ci fci 1 0.01 7 0.09 27 0.36 76 0.95 174 2.16 343 4.27 602 2.49 961 11.95 1415 17.60 1940 24.13 2493 ÷ 3836 31.01 where k is the number of classes ci, and pi is the probability to extract an index from the class ith: ci ×fi pi = n! in this way the value of Hrel must diminish when the number of these classes decreases and, furthermore, when the concentration of indices increases in the classes, as we can ? A Distance Function for Ranked Variables. .. 17 see in Tables 2 - 6, the following results are obtained: n ? ? ? ?1 ?2 4 0, 57 0.65 0.37 0.06 0.60 5 0.48 0.65 0.24 0.70 0.55 6 0.71 0.61 0.21 0.34 0.76 5 0.73 0.62 0.13 0.51 0.80 8 0.78 0.65 0.12 0.69 0.85 We can see that the values of Hrel related to ? are decreasing when n increasing, and are the lowest with respect to ?1,?2, ? and ?. After these results we continue our discussion without considering indices ?1 and ?2 as we have chosen ? to compare with the most used indices ? and ?. 6.3 Variance The variance sampling distribution of ? and of ? are: var(?) = 1 var(?) = n — 1 2(2n + 5) it results, Kendall (1962): and specifically: 9n(n — 1) var(?) > var(?) var(?) limn^oo----- = 2.25 var(?) Due to the difficulties of obtaining simple expressions for ? and its sampling distribution parameters, comparisons among parameters with those of ? and ? will be done for definite values of n: n 4 5 6 7 8 00 var (?) 0.33 0.25 0.2 0.17 0.14 0 var(?) 0.241 0.167 0.126 0.102 0.083 0 var (?) 0.064 0.045 0.035 0.029 0.024 0 thus we observe that var (?) is a decreasing function of n just like var (?) and var (?), and it is always lesser than the other two. 6.4 Symmetry The symmetry of the sampling distribution of ? is ensured, as it can be seen, by the empirical values of the odd mean moments for some n and r, µS,r = PgLi(Sj-µs)' n!SL , since, 18 Antonio Mango as n and r become larger, the parameters decrease: n h^S,3 h^S,5 4 -0.01079 -0.00443 5 -0.00590 -0.00194 6 -0.00376 -0.00099 7 -0.00266 -0.00059 8 -0.00198 -0.00038 6.5 Normality Parameters ßr = ß2r/ ßr2, and 72 = 2ß2/(1S)2, where 1S is the simple mean deviation index, of the sampling distribution of 8, r and p, for increasing values of n, are bounded above and in particular those of 72 converge to the normal distribution corresponding value as shown in the following table: p T ö p T S n ß2 ßs ßi ß2 ßs ßi ß2 ßs ßi 72 72 72 4 1..85 5.16 10.28 2.37 7.58 27.72 2.78 12.16 65.13 3.18 2.67 3.03 5 2.07 5.40 15.86 2.53 9.11 39.6 2.82 13.74 90.40 3.12 2.83 3.01 6 2.23 6.45 21.46 2.62 10.10 48.34 2.82 13.95 97.25 3.00 2.90 3.01 7 2.28 7.12 25.91 2.68 10.79 54.90 2.84 14.10 100.31 3.03 2.95 3.01 8 2.42 8.05 31.78 2.73 11.30 60.11 2.86 14.26 102.57 3.09 2.97 3.03 ? 3 15 105 3 15 105 3 15 105 3.14 3.14 3.14 7 Final remarks The mathematical and statistical properties taken in examination for the different indices shown in the preceding sections allow us to assume that the greatest part of these are on a basis of parity to represent rank correlation indices. I have written this paper because I have encountered both in my professional practice and in my research the necessity of putting in order couples of variables in a group and to distinguish the greatest number of these on the basis of their degree of correlation. Besides the other properties like normality and variability which put index 8 in a favorable position in comparison to the others, the two properties dealing with the metrics structure of an index and with its degree of diversity, still seem very favorable to the index 8, as it has broadly been shown in this work. References [1] Blest, D.C., (2000): Rank Correlation - An alternative measure. Australian and New Zeland, Journal of Statistics, 42, 101-111. [2] Fechner, T.G. (1897): Kollectionmasslehre. Leipzig: Lipps Ed.. A Distance Function for Ranked Variables... 19 [3] Gini, C. (1914): Di una misura della dissomiglianza tra due gruppi di quantita´ e delle sue applicazioni allo studio delle relazioni statistiche, Atti del Reg.Istit.Veneto delle Scien.Lett. ed Arti, 1914-15. [4] Kendall, M.G. (1962): Rank Correlation Methods. London: Griffin. [5] Lauro, N. (1977): Considerazioni sulla metrica degli indici di cograduazione, Gior-nate di lavoro AIRO 1977, Parma. [6] Mango, A. (1997): Rank Correlation Coefficients: A New Approach. Computing Science and Statistics. Computational Statistics and Data Analysis on the Eve of the 21st Century. Proceeding of the Second World Congress of the IASC, 29, 471-476. [7] Spearman, C. (1904): The proof and measurement of association between two things. Am.J.Psych., 15, 88. [8] Tarsitano, A. (2005): Weighted rank correlation and hierarchical clustering, Classi-fication and Data Analysis. In S. Zani and A. Cerioli (Eds): Book of Short Papers, MUP, Parma, 517-520.