A Comparison of the Most Commonly Used Measures of Association for Doubly Ordered Square Contingency Tables via Simulation Atila Gökta§1 and Öznur i§fi2 Abstract Spearman and Pearson correlation coefficient, Gamma coefficient, Kendall's tau-b, Kendall's tau-c, and Somers' d are the most commonly used measures of association for doubly ordered contingency tables. So far there has been no study expressing a priority on those measures of association. The aim of this study is to compare those measures of association for several types and different sample sizes of generated squared doubly ordered contingency tables and determine which measures of association are more efficient. It is found that both the sample sizes and the dimension of the doubly ordered contingency tables play a significant role on the effect of those measures of association. 1 Introduction When categorical measures have a natural order (ex., strongly agree to strongly disagree; high, medium, low), additional information may be presented beside nominal variables. When there are two categorical variables that are both naturally ordered, a variety of effect size measures have been proposed for such ordinal data, including Gamma coefficient, Kendall's tau-b, Kendall's tau-c, and Somers' d (Garson, 2008). An ordinal variable is also a type of a categorical variable. The only difference between the two is that there is a clear ordering of the ordinal variables, whereas there is no such ordering for ordinary categorical variables. For example, suppose you have a variable, patient's status, with three categories (worse, no difference and much better). In addition to being able to classify patients into these three categories, you can order the categories as worse, no difference and much better. Now think of a variable like educational background (with levels 1 University of Mugla, Faculty of Sciences, Department of Statistics, Mugla, Turkey; gatilla@mu.edu.tr 2 University of Mugla, Faculty of Sciences, Department of Statistics, Mugla, Turkey; oznur.isci@mu.edu.tr such as elementary school graduate, high school graduate, some college and university graduate). These also can be ordered as elementary school, high school, some college, and university graduate. Even though the levels are ordered from lowest to highest, the distance between the levels need not to be the same across the levels of the variables. Suppose we assign scores for the levels of educational experience as 1, 2, 3 and 4 respectively and we compare the difference in education between levels one and two with the difference in educational experience between levels two and three, or the difference between levels three and four. The difference between levels one and two (elementary and high school) is perhaps much larger than the difference between categories two and three (high school and some college). In this example, we can order the people in level of educational experience but the size of the difference between levels is inconsistent (because the distance between levels one and two is larger than levels two and three) i.e the level of measuring is ordinal not interval (Ucla, 2007). A doubly ordered categorical data or doubly ordered contingency tables are data with two variables that are both naturally ordered and cross tabulated. The most commonly and widely used measures of association for doubly ordered categorical data are measures of differences between probabilities of concordant and discordant pairs. Examples of these are Kendall's tau-b, Stuart's tau-c, Goodman-Kruskal's gamma, and Somers'd (Svensson, 2000). The difference among these measures lies in the power of overcoming of ties. One of the most well known non-parametric measures of association is called the Spearman rank-correlation p . Another famous measure of association is Kendall's tau which may be formulated as a Pearson product-moment correlation between signed indicators of X's and Y's, and Spearman's rank-correlation is the special case of the Pearson product-moment using the ranks instead of the actual variates the correlation with (Kruskal, 1958 and Hoeffding, 1948). Kendall's tau which does not need to specify the ranking scores for both row and column and Somers' d coefficients are alternatives to Pearson's product -moment correlation coefficient and Spearman's rank-order correlation coefficient for ordinal data (Cyrus and Nitin, 1995). 2 The most commonly used measures of association Spearman's rank-order correlation coefficient and Pearson's product-moment correlation coefficient, Goodman-Kruskal's gamma coefficient, Kendall's tau-b, Kendall's tau-c, and Somers' d are the most commonly used measures of association for doubly ordered contingency tables. This study was performed for the square doubly ordered contingency tables. What square term actually means is that the number of row categories equals to the number of column categories. Notation The following notations are used throughout this study: Xi Row variable arranged in ascending order: X1 < X2 <... [ C ] cov(X, Y) = Z XiYjfij - [Zv,] Z Y|C| V i=1 V V j=1 /W (2.2) S(X) which is presented in equation (2.3) is also called the variance of X S(X) = Z X2ri - Z Xiri V i=1 i=1 V /W (2.3) and S(Y) in (2.4) is the variance of Y S(Y) = Z Yj2Cj - Z YjCj j=1 V j=1 v /W (2.4) The variance of r is var1 = ^ Zfii {T(Xi -X)(YJ - Y)-^[(X -X)2S(Y) + (YJ - Y)2S(X)]| (2.5) If the null hypothesis which is " H0: P = 0" against the alternative hypothesis which is either " H : P^0" or " H : P>0" or " H : P<0" is true, the variance of r may be presented as in (2.6), var0 = Z f X2Y2 - t—i ii i J f > Z fXY i—t iJ i J V i,J > 2 /W ( \ Z riX2 V i 'Z CjY2 " V i V (2.6) where X = Z Xr /W and Y = Z Yc / W are the mean of X and the mean of Y i=1 j=1 respectively. Under the null hypothesis that there is no correlation, 2 R C _rVw - 2 tcalculated = ^---(2-7) statistics has a t distribution with W - 2 degrees of freedom. 2.2 Spearman rank correlation coefficient Calculating the Pearson's correlation coefficient needs the assumption that the two samples are normally distributed. If the assumption of normality is violated, Pearson's correlation coefficient will produce unreliable results. Hence a very best alternative for Pearson's correlation coefficient may be the use of Spearman's rank correlation rs which can be calculated under the first assumption of Pearson's product moment correlation (Lohninger, 1999). There is no need of satisfaction of the second and third assumptions of the Pearson's product moment correlations for the use of Spearman rank correlation. Dependency of the ordinal variables is denoted as a rank correlation and their intensity is expressed by correlation coefficients. One of the most used ordinal coefficients is Spearman's correlation coefficient (Rezankova, 2009). The Spearman's rank correlation coefficient r s is computed by using rank scores Ri for Xi and rank scores Cj for Yj. These rank scores are defined as follows: Ri = 2rk + (r +1)/2 for i = 1, 2, ..., R (2.8) Cj = 2c„ + (cj +1)/2 for j = 1, 2, ..., C h < j (2.9) The formulas for rs can be obtained from the Pearson formula given in (2.1) by substituting Ri and Cj for Xi and Yj, respectively. And its asymptotic variance of the Spearman correlation can be obtained under the null hypothesis of no correlation from the formula presented in (2.6) by substituting Ri and Cj for Xi and Yj, respectively. rS = cov(R,C) S >/S(R)S(C) T (210) If there are no ties, another simple formula for obtaining Spearman's rank correlation is given in (2.11) as follows: 62 d2 rS = 1--2 i (2.11) S W(W2 -1) Where di in Spearman's rank correlation coe fficient represents the difference in the ranks assigned to the values of the variable for each item of the certain data. When W is fairly small, the computation of the formula is very straightforward. In case of numerically equal observations an arithmetic average of the rank numbers associated with the ties are assigned to the values of the variables. This formula of Spearman's rank correlation coefficient is applied in cases when there are no tied ranks. When there are tied ranks the formula in (2.11) is not algebraically equivalent to the formula in (2.10). However, when there are a reasonable number of ties in the pairs of values of the variables, this approximation of Spearman's rank correlation coefficient is often used as fairly good approximations. The Spearman's rank correlation coefficient may be used to test for association between both ordinals and continues variables. The underlying relationship between variables must be monotonic. In other words, generally speaking, the variables should either increase in values together, or when one gets increased, and then the other should get decreased. Some difficulties of calculating Spearman's rank correlation coefficient arise, when the sample is large. For large data it can be hard to rank the data for both variables and consequently it is time consuming to perform Spearman's rank correlation coefficient test. Since Spearman's rank correlation coefficient is a non parametric test, it does not depend upon the assumptions given for the Pearson's product moment correlation coefficient. Hence it is distribution free. It can be used to test whether there is a statistically significant association between variables. The null hypothesis we are testing is that there is no association between the variables under study. Thus, the main purpose of Spearman's rank correlation coefficient is to investigate the existence of any association in the underlying variables. To this end, the null hypothesis is constructed as having no rank correlation between the variables while using Spearman's rank correlation coefficient. Under the null hypothesis that there is no correlation, _rsVw - 2 tcalculated _ I-T (2.12) VW statistics has a t distribution with W - 2 degrees of freedom (Kendall and Stuart, 1973). 2.3 Goodman and Kruskal gamma (y or G) The Gamma (y) statistics is proposed in a series of papers from 1954 to 1972 by Leo Goodman and William Kruskal. It is now mostly described just as Gamma that is used to investigate an association in a given doubly ordered contingency table. The estimator of gamma uses only the number of concordant and discordant pairs of observations. It ignores tied pairs. In other words, pairs of observations that have equal values of X and equal values of Y are called tied pairs. Gamma can be calculated for only when both variables lie on an ordinal scale. It has the range -1 < Y <1 just as Spearman's rank correlation coefficient. If there is no association between the two variables, then the estimator of gamma should be close to zero. The estimation of Gamma ( y ) may be given as follows: P - Q Y =-— (2.13) P + Q ( ) where P has the form as P = 2fC and it is the probability that a randomly selected pair of observations will place in the same order and Q has the form as Q = 2 fjDjj and it is the probability that a randomly selected pair of observations will place in the opposite order, where fy is the frequency of z'-th row and j-th column of the doubly order contingency table, Cij is 'w +22'w and Dij is k>i l>J ki lJ ASE, =-12 f (QC - PD..)2 (2 14) 1 (P+Q)2V 2 iJ Under the null hypothesis of independence or no association, its standard error becomes as follows: ASE = (P+Q)^«-J - >- Q)! (215) For 2^2 tables, gamma is equivalent to Yule's Q which may be presented as follows (Goodman and Kruskal, 1979; Agresti, 2010; Brown and Benedetti, 1977b); f f - f f n _ '11*22 '12*21 Q = ff + ff (216) Gamma coefficient can also be calculated for even small or perhaps for zero frequency of a 2x2 table. Suppose that we have a value of gamma to be .582. It can be inferred that knowing the independent variable reduces our errors in predicting the rank (not value) of the dependent variable by 58.2%. Under statistical independence, gamma will be zero, but there are some other times in which gamma coefficient may be zero whenever the number of concordant equal to the number of discordant. Meanwhile, using gamma coefficient a perfect association is present whenever the number of discordant pairs is zero. Under the null hypothesis that there is no correlation, ^calculated ^SE (217) statistics has standard normal distribution. 2.4 Kendall's Tau-b Kendall's tau-b (Tb) is similar to gamma except that tau-b uses a correction for ties. The rule of both variables lie on an ordinal scale for calculation Tau-b is just the same as gamma coefficient. Tau-b has also the range -1 < Tb <1 as both gamma and Spearman's rank correlation. It is estimated by, Tb = P _ Q a/DX (2.18) C where D stands for W2 _|r2 and r is the total count or the total frequency of j=i C row i in the doubly ordered cross table, D stands for W2_|c2 and c is the j=i total count or the total frequency of column j in the doubly ordered cross table. Its general standard error may be obtained as follows: ASEl = ^(«(Cj _ Dj) + TbVj)2 _ w3T2b(D + Dc)2 (2.19) where vu is defined as r.Dr + CjDc. Under the null hypothesis of independence or no association, the standard error takes its form as follows: ASE = 21 If..(C. - D )2 ——(P - Q)2 I .J( .j .j) w( Q) (2.20) D D and under the null hypothesis of independence the asymptotic test statistics has standard normal distribution which is given as, ^calculated _ ^(221) The test statistics given in (2.21) is used to test whether the degree of association of the cross tabulations when both variables are measured in ordinal scale is significant (Kendall, 1955; Brown and Benedetti, 1977a; SAS, 2010). It adjusts the ties and is most appropriate for square tables what means that the number of row categories equals to the number of column categories. Value of -1 is 100% negative association or perfect inversion whereas value of +1 is 100% positive association, or perfect agreement. A value of zero indicates no association. If Tb = ±1 then there is no ties and subjects from different cells form strict concordant and discordant pairs in these two extreme cases. When both Tb = ±1 and y = ±1, it is generally concluded that Tb is stronger than y . If ib = 1, then the table is diagonal and if Tb = -1, the table is skewed diagonal (Tu, 2007). 2.5 Kendall's Tau-c Stuart's tau-c ( tc ) makes an adjustment for table size as well as a correction for ties. Tau-c is also appropriate only when both variables lie on an ordinal scale. Tau-c has the range -1 < tc < 1 as well as Spearman's rank correlation, Gamma and Tau-b. It is estimated by _q ( P ~ Q) W2(q -1) = .' L ' (2.22) where q is defined as min(R,C). Its general standard error may be written as follows: ASE = 2q (q - i)w2,y Sf«(Cu - Dij)2 -^(p-Q)2 (2.23) Under the null hypothesis of no association ASE1 is identical to ASE0. Therefore the test statistics which may be used to investigate the degree of association for two ordinal variables under the null hypothesis of no association can be expressed as 7 - Tc ^calculated ~ (2.24) where Zcalculated statistics has standard normal distribution. Besides making adjustments for ties it is most suitable for rectangular tables. Value of -1 is 100% negative association or perfect inversion whereas value of +1 is 100% positive association, or perfect agreement. A value of zero indicates no association (Brown and Benedetti, 1977a; SAS, 2010). Kendall's tau-c, also called Stuart's tau-c or Kendall-Stuart tau-c, is a special case of tau-b for larger tables. It also makes adjustments for the size of the cross table (Lohninger, 1999). 2.6 Somers' d Somers' d(C|R) and Somers' d(R\C) are asymmetric modifications of tau-b. C|R represents that the row variable X is treated as an independent variable, whereas the column variable Y is treated as dependent. Similarly, R|C represents the reverse interpretation. Somers'd differ from tau-b in that it only makes a correction for tied pairs on the independent variable. Somers' d can be calculated only when both variables are ordered. It varies in the range -1 < d < 1. Formulas for Somers' d is obtained according to the position of independent variable. For instance, if the row variable X is treated to be independent then Somers' d can be calculated as p - Q d y / x (2.25) r and its general standard error is defined as below: ASEt L {D (Cj - Dj) - (P - Q)(W - Ri)}2 (2.26) or, under the null hypothesis of independence its standard error may be written as: ASE0 JZ Wj - V-^(P-Q)2 (2.27) by interchanging the roles of X and Y, the formulas for Somers' d with X as the dependent variable can be obtained with only a minor change in the denominator by replacing D with D . If both variables are ignored to be either independent or dependent, symmetric version of Somers' d is appropriate and it is calculated as follows: d - (P"Q) SymetriC +B) (2 28) 2( c r) and its standard error is simplified as follows: 2°2 ASE =-^—J DD (2.29) 1 (Dr + Dcy r c ( ) where a is the variance of Kendall's Tb. Under the null hypothesis of no association its standard error may be obtained as follows: ASE =--- T f(C - D..)2-—(P - Q)2 (2 30) 0 (Dc+Dr)]lT ij( ij ij) W( Q) ( ) Somers' d value of -1 is 100% negative association or perfect inversion whereas value of +1 is 100% positive association, or perfect agreement (Somers, 1962; Goodman and Kruskal, 1963; Liebetrau, 1983; SAS, 2010). A value of zero indicates no association. Under the null hypothesis of independence, the following statistics asymptotically has standard normal distribution Z _ dsymetric (r) o 1 n. ^calculated ^SE (231) 3 Generation of doubly ordered contingency table In order to generate a doubly ordered contingency table, there are lots of techniques in the literature of Statistical simulation. For instance, a doubly ordered contingency table may be generated from the uniform association model (Agresti, 2010). In our study we present a new way of generating a doubly ordered contingency table using bivariate standard normal distribution. In the first step we generate two identically independently distributed random variables, as Xj □ N(0,1) and X2 □ N(0,1). To generate two random variables (X and Y) from the bivariate normal distribution with certain correlation ( p ) for a specific sample size, we apply the followings: X = aX + bX2 (3.1) Y = bX + aX2 (3.2) where a2 + b2 = 1 and 2ab = p, and hence a and b are obtained as a and b = :£± Si 0.4 CT S 0.3 0.2 Pearson Spearman Gamma TauB TauC SomerD 3X3 4X4 5X5 6X6 7X7 8X8 9X9 Table Dimension Figure 2a: Table dimension against degree of the ordinal measure of associations for p =0.9 and n=50. Figure 2b: Table dimension against degree of the ordinal measure of associations for p =0.9 and n=100. 1 0.9 0.8 - 0.7 0.6 0.5 0.4 0.3 0.2 Pearson Spearman Gamma TauB TauC SomerD 3X3 4X4 5X5 6X6 7X7 8X8 9X9 Table Dimension 1 0.9 0.8 - 0.7 0.6 0.5 0.4 0.3 0.2 Pearson Spearman Gamma TauB TauC SomerD 3X3 4X4 5X5 6X6 7X7 8X8 9X9 Table Dimension Figure 2c: Table dimension against degree of the ordinal measure of associations for p =0.9 and n=150. Figure 2d: Table dimension against degree of the ordinal measure of associations for p =0.9 and n=200. 1 n 0.9 o is 0.8 -{ o 0.7 t/) a 0.6 o 0.5 o> Si 0.4 CT S 0.3 0.2 Pearson Spearman Gamma TauB TauC SomerD 3X3 4X4 5X5 6X6 7X7 8X8 9X9 Table Dimension 1 n 0.9 o iS 08 o 0.7 t/) a 0.6 o 0.5 o> Si 0.4 CT S 0.3 0.2 Pearson Spearman Gamma TauB TauC SomerD 3X3 4X4 5X5 6X6 7X7 8X8 9X9 Table Dimension Figure 2e: Table dimension against degree of the Figure 2f: Table dimension against degree of the ordinal measure of associations for p =0.9 and ordinal measure of associations for p =0.9 and n=250. n=500. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 - Pearson Spearman - Gamma TauB TauC SomerD 3X3 4X4 5X5 6X6 7X7 8X8 9X9 Table Dimension 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 —♦— Pearson ""T^ • » " -1 Spearman Gamma —*—-* TauB x TauC SomerD 3X3 4X4 5X5 6X6 7X7 8X8 9X9 Table Dim ension Figure 2g: Table dimension against degree of the Figure 2h: Table dimension against degree of the ordinal measure of associations for p =0.9 and ordinal measure of associations for p =0.9 and n=750. n=1000. Figure 2: Table dimension against degree of the ordinal measure of associations for p =0.9 and n=50, 100, 150, 200, 250, 500, 750, 1000. 0.7 c o 0.6 .22 0.5 0.4 < 0.3 o 0) 0) 0.2 O) 0.1 o 0 - Pearson - Spearman - Gamma TauB TauC - SomerD Sample Size 0.6 .1 0.5 1 04 < 0.3 1 0.2 (V ff 0.1 Q 0 / —•— Pearson —■— Spearman --Gamma —*— TauB TauC —•— SomerD Sample Size 0.5 c0.45 ■2 0.4-■§0.35 ■ Sä 0.3-.S? 0.25 ■ 'S 0.2-od 015 ■ & 0.1 ■ o 0.05 ■ 0 & - Pearson - Spearman - Gamma TauB TauC SomerD >V < 0.3 o 0.2 (D O) 0.1 O 0 ♦ Pearson —■— Spearman --Gamma TauB —«t—TauC —•— SomerD Sample Size 0.6 1 0.5 | °.4 £ 0.3 1 0.2 (V ff 0.1 Q 0 / - Pearson - Spearman - Gamma TauB TauC SomerD Sam ple Size Figure 3a: Sample size against degree of the ordinal measure of associations for p =0.5 and 3x3 Figure. Figure 3b: Sample size against degree of the ordinal measure of associations for p =0.5 and 4x4 Figure. 0.6 .o 0.5 1 04 £ 0.3 1 0.2 (V ff 0.1 Q 0 / —•— Pearson —■— Spearman --Gamma —x—TauB TauC —•— SomerD Sam ple Size Figure 3c: Sample size against degree of the ordinal measure of associations for p =0.5 and 5x5 Figure. Figure 3d: Sample size against degree of the ordinal measure of associations for p =0.5 and 6x6 Figure. 0.5 ■ c0.45 ■2 0.410.35 Sä 0 3 ■ <0.25 ■ 'S 0 2 ■ 80.15-S> 0.1 ■ o 0.05 ■ 0 & - Pearson - Spearman - Gamma TauB TauC SomerD Sam ple Size Figure 3e: Sample size against degree of the ordinal measure of associations for p =0.5 and 7x7 Figure. Figure 3f: Sample size against degree of the ordinal measure of associations for p =0.5 and 8x8 Figure. Figure 3g: Sample size against degree of the ordinal measure of associations for p =0.5 and 9x9 Figure. i c 0.9 ■2 0.8 ■§ 0.7 Sä 0.6 Hi 0.5 'S 0.4 Sä 0.3 Si 0.2 S 0.1 0 ---- —♦— Pearson Spearman Gamma TauB TauC SomerD Sam ple Size 1 0.9 i X X ± i —♦— Pearson Spearman 0 7 * X 0.6 *-*-i Gamma 05 0.4 TauB 0.3 —*— TauC 0.2 0.1 SomerD 0 ■ Sam ple Size i 0.9 0.8 0.7 0.6 0.5 0.4 0.3 i 0.2 0.1 0 -1 ___,_r__—_— —♦— Pearson Spearman Gamma —*— TauB —*— TauC SomerD Sam ple Size I -4— Pearson ■— Spearman --Gamma TauB TauC •— SomerD SP & ß1' S << Sam ple Size 1 ■ 0.9 ■ 0.8 i I 0.7 ■ Sä 0.6- Hi 0.5-'S 0.4-Sä 0.3-n 0.2 ■ S 0.1 ■ 0 *— Pearson ■— Spearman ▲ Gamma TauB TauC *— SomerD A