CONTENTS Metodoloski zvezki, Vol. 13, No. 1, 2016 Morteza Amini and S.M.T.K. MirMostafaee Interval Prediction of Order Statistics Based on Records by Employing Inter-Record Times: A Study Under Two Parameter Exponential Distribution 1 Anindita Datta, Seema Jaggi, Cini Varghese and Eldho Varghese Series of Incomplete Row-Column Designs with Two Units per Cell 17 Jose Pina-Sanchez Adjustment of Recall Errors in Duration Data Using SIMEX 27 Janez Stare and Delphine Maucort-Boulch Odds Ratio, Hazard Ratio and Relative Risk 59 Metodoloski zvezki, Vol. 13, No. 1, 2016, 1-15 Interval Prediction of Order Statistics Based on Records by Employing Inter-Record Times: A Study Under Two Parameter Exponential Distribution Morteza Amini1 and S.M.T.K. MirMostafaee2 Abstract In this note, we propose a parametric inferential procedure for predicting future order statistics, based on record values, which takes inter-record times into account. We utilize the additional information contained in inter-record times for predicting future order statistics on the basis of observed record values from an independent sample. The two parameter exponential distribution is assumed to be the underlying distribution. 1 Introduction Suppose Y1,... ,Ym are independent and identically distributed (iid) observations from an absolutely continuous cumulative distribution function (cdf) F, possessing probability density function (pdf) f. The order statistics of the sample Y1,..., Ym, represented by Yi:m < ■ ■ ■ < Ym:m, are obtained by arranging the sample in an increasing order. Order statistics have been used in a wide range of applications, including robust statistical estimation, detection of outliers, characterization of probability distributions, goodness-of-fit tests, entropy estimation, analysis of censored samples, reliability analysis, quality control and strength of materials. A useful survey of available results until 2003 is given in the book of David and Nagaraja (2003). Let X1,X2,... be a sequence of iid random variables, independent of and identically distributed to Y1. An observation Xj is called an upper (lower) record value if its value exceeds (resp. falls below) those of all the previous observations, that is the nth upper (resp. lower) record value, Un (resp. Ln), is defined as XTn, where T1 = 1, with probability 1, and Tn = min{j : j > Tn-1, Xj > XTn-1} (resp. Tn = min{j : j > Tn-1, Xj < XTn-1}), for n > 1. Throughout this paper we department of Statistics, School of Mathematics, Statistics and computer Science, College of Science, University of Tehran, P.O. Box 14155-6455, Tehran, Iran; morteza.amini@ut.ac.ir 2 Department of Statistics, Faculty of Mathematical Sciences, University of Mazandaran, P.O. Box 47416-1467, Babolsar, Iran; m.mirmostafaee@umz.ac.ir 2 Amini and MirMostafaee deal with upper record values for a predictive inference. Similar results can be obtained for the case of lower record values. The inter-record time statistic, defined as As = Ts+1 - Ts, s > 1, is the number of observations between sth and (s + 1)th record values. For more details we refer the reader to Arnold et al. (1998). Record data arise in a wide variety of practical situations including industrial stress testing, finance, meteorological analysis, hydrology, seismology, sporting and athletic events, and mining surveys. The problem of predicting future observations has been extensively studied in the literature and several parametric and non-parametric procedures are developed for prediction. In many practical data-analytic situations, one is interested in constructing a prediction interval on the basis of available observations. There are situations in which the available observations and the predictable future observation are of the same type. The prediction of future records on the basis of observed records from the same distribution and prediction of order statistics based on order statistics are studied, among others, by Dunsmore (1983), Nagaraja (1984), Chou (1988), Awad and Raqab (2000), Raqab and Balakrishnan (2008) and the references therein. Recently, Ahmadi and Balakrishnan (2010), Ahmadi and MirMostafaee (2009), Ah-madi et al. (2010) and MirMostafaee and Ahmadi (2011), discussed the prediction of future records from a Y-sequence based on the order statistics observed from an independent X-sequence, and vice versa. In predicting future order statistics on the basis of observed record statistics, sometimes the available observations also include inter-record times which can be utilized as additional information to improve the predictive inference. In other words, when both record values and the inter-record times are available, it would be nice to employ the information included in both records and record times. Feuerverger and Hall (1998) emphasized that "However, the record times and record values jointly contain considerably more information about F than the record values alone." Actually, applying the additional information about record times is not a new subject and several authors focused on inference based on both record values and record times, see for example Samaniego and Whitaker (1986), Lin et al. (2003), Doostparast (2009), Doostparast and Balakrishnan (2013), Kizilaslan and Nadar (2014) and MirMostafaee et al. (2016). In this paper, a two parameter exponential distribution, Exp(p, a), with pdf f (x; a) = 1 e-{x-^)/a, x > ^ G R, a > 0, (1.1) a is considered as the underlying distribution. We write Z ~ Exp(p, a) if the pdf of Z can be expressed as (1.1). Note that ^ and a are the location and scale parameters, respectively. Throughout this paper we assume that both parameters, ^ and a, are unknown. Now, suppose that Y1, ■ ■ ■ ,Ym constitute a future random sample from a two parameter exponential distribution, i.e. Yi, ■ ■ ■ , Ym ~ Exp(^, a) and Yi:m < ■ ■ ■ < Yj:m are the corresponding order statistics of this sample. In addition, Ym = m-1 Y:m denotes the mean of this future sample. If Y1, ■ ■ ■ ,Ym denote the times to failure of m independent units in a lifetime test, then Ym can be interpreted as the mean time on test of these failed units. We assume that the available data include the observed upper record Interval Prediction of Order Statistics... 3 values, U1, ■ ■ ■ , Un, given the inter-record times, (Ai,..., An-1). We emphasize that these record values are assumed to be extracted from a sequence of iid random variables {Xj,j = 1,2, ■ ■ ■ } where Xj ~ Exp(p, a) for j = 1,2, ■ ■ ■. Moreover, the sequence {Xj, j = 1, 2, ■ ■ ■ } and the sample {Yi, i = 1, ■ ■ ■ , m} are statistically independent. Note that n is the number of the observed record values and depends on the experiment, however, m is the sample size of the future observations and it can be considered arbitrary. In addition, n and m are unrelated. The problem of interest is to obtain conditional prediction intervals for jth future order statistic, Yj:m, as well as for the mean, Ym, in a future sample on the basis of the available data. We compare our conditional prediction intervals with the unconditional ones proposed by Ahmadi and MirMostafaee (2009) and observe an improvement over the predictive inference without inter-record times. Therefore, we consider two cases: (a) The informative data contain only the upper record values, (b) The informative data contain the upper record values and the inter-record times, and then we observe that case (b) has some predictive inferential improvement in comparison with case (a). The rest of the paper is organized as follows. Some general preliminaries are presented in Section 2. Conditional prediction intervals for the future jth order statistic, Yj:m, and the mean of the future sample, Ym, based on record values of given inter-record times for the two parameter exponential distribution are studied in Sections 3 and 4. An illustrative example and some concluding remarks are involved in Sections 5 and 6. The R codes for computing some results of the paper are given in the appendix. 2 Preliminaries In this section, we present some general preliminary results used in future sections. Given upper record values u1,..., un-1, which are observed and extracted from the sequence {Xj; j > 1}, inter-record times A1,..., An-1 are independent geometrically distributed random variables with success probabilities F(ui), i = 1,..., n — 1. Furthermore, the record values U1,... ,Un form a Markov Chain with adjacent transition pdf equal to the left truncated pdf of the underlying distribution, see Arnold et al. (1998). Thus, the joint distribution of Un = (U1,..., Un) and An = (A1,..., An-1) is fUnAn (Un, Sn) = J] f (Ui)[F (Ui)]'1-1/K), (2.1) i=1 where un = (u1,..., un) G Xn, in which X is the support of X and Sn = (51,..., 5n-1) G Nn-1, see Samaniego and Whitaker (1986) and Arnold et al. (1998) page 169. We emphasize that An contains n — 1 positive integer-valued discrete random variables and 8n is the observed vector of An. By integrating (2.1) with respect to (w.r.t.) u1,...,un, we can easily prove the following result. Lemma 1 The joint probability mass function of A1,..., An-1 is n-1 Pa„ (¿n) = Pr(An = Sn) = ^ cj (n' 5n)[(a1 (n, j, Sn) + 1)(a1(n, j, Sn) + an(n,j, Sn) + 2)]-- j=1 4 Amini and MirMostafaee where "j-2 / n-ji-1 \ n-j-2 / n-j N cj(n, ¿„) = (-1)n-j-1 u E * H E ^ jl=0 \t=n-j + 1 / j2=0 Vi=j2 + 2 ./ n-j n-1 ai(n,j, 6n) = E ^ - 1' an(n,j, ¿n) = E t=1 t=n-j+1 in which we assume for a > t=a = 0 and J}b=a ¿t = 1. In this paper, we need the conditional distribution of U1 and Un given by An = ¿n as follows. Lemma 2 The conditional pdfof U1 and Un given An = ¿n is n-1 1 fuuun\ A„ (ui,un| ¿n) = [Pa„ (¿n)]-1 ^ Cj (n, ¿„)[F(ui)]ai )/ (u„), j=1 where Cj(n, ¿n), a1(n,j, 8n), an(n,j, ¿n) and Pa„ (¿n) are as in Lemma 1. The proof of Lemma 2 is straightforward by integrating (2.1) w.r.t. u2,... ,un-1 and dividing the obtained equation by PA„ (¿n). 3 Conditional prediction intervals for order statistics In this section, the goal is to find a conditional prediction interval for Yj:m when the observed U1,... ,Un are available given An = ¿n for the two parameter exponential distribution. To this end, we consider the pivotal quantity w = (31) Note that the pivotal quantity Wj is the same as the one considered by Ahmadi and MirMostafaee (2009). This quantity is location and scale invariant namely it is free of both unknown parameters i.e. the location parameter / and the scale parameter a. It is also a simple function of both observed and future statistics, so that the future statistic can be derived from it easily. Ahmadi and MirMostafaee (2009) found the unconditional distribution of Wj while we present the conditional distribution of Wj given An = ¿n, (i.e. the inter-record times are assumed to be known and fixed) in the following theorem. Theorem 1 The conditional cdfof Wj in (3.1) given An = ¿n is for w > 0 m n-1 l ai(n,ji, Sn) a„(n,ji, Sn) Sn)^ ^a„(n,ji, l ^ Fw3\An(w^n) = EEE E E 1 lyj,^) ~ l=j ji = 1 j2=0 j3=0 J4 = 0 n' xcji (n, ¿n)[(j2 + m - l + j3 + j4 + 2)((j2 + m - l)w + j + 1)]-1, Interval Prediction of Order Statistics... 5 and for w < 0 FWj |A„ M^ = ^^^ I] I] l (— 1) j +:3+:4 p/% ) j2 l=jjl = 1 j2=0 33 = 0 34 = 0 AnV nJ xcji (n, ¿„)[(j2 + m - l + j3 + j4 + 2)(j4 + 1 - w(j3 + j4 + 2))] —1, where a1(n,j1, 6n) and an(n,j1, 6n) are defined in Lemma 1 and PAn (5n) is the joint mass function of A1,..., An—1 which is also given in Lemma 1. Proof: Letting J*^ = (Un - U1)/a, U* = (U1 - ß)/a and Yj*m = (Yj:m - ß)/a, we may write i'^O i'^O FWj | An (wl^n) = J j0 %m (vw + U)fu*,J*A | An (u, v|^n) du dv (3-2) 00 For t > 0, we have i=j Also, from Lemma 2, we obtain m (t) = £ (7) (1 - e—')l "')::- e—<)le—'. (3.3) n-1 /u.,j:i|A„(u,v| ¿n) = [Pa„(¿n)]-^ Cj(n, ¿n)[1 —e-u]ai (n'3',")[1 —e-("+v)]a"(n35n)e-(2u+v). 3=1 (3.4) Hence, by substituting (3.4) and (3.3) in (3.2) and using the binomial expansions, we have for w > 0, m n-1 l ai(n,ji, Sn) a„(n,ji, Sn) /m\/ai(n,ji, ¿„)\/a„(n,ji, S„)\f l\ (n s ) F (w| A ) V V {l){ 33 A J4 3Cji (n 0n) | An ^ ¿^ -(-1)32+33+34 Pa (S )- l=jji=132=0 33=0 34=0 V ' AnV n7 x / e-(3'2+m-l+33+34 + 2)«e-((32+m-l)w+34+1)u d^ dV 00 and therefore we naturally arrive at the desired expression. Similarly, we may attain the expression for FWj| An(w|Sn) when w < 0 after substituting (3.4) and (3.3) in (3.2) by noting that the integral w.r.t. u must be taken from — vw to to. □ Let wY(n, m, j; Sn) be the 7th conditional quantile of W3 given An = Sn, i.e. Pr(Wj < w7(n,m,j; Sn)| An = Sn) = Y- To find 100(1 — a)% two-sided conditional prediction intervals for l}:m based on record values given An = Sn, we have to find the conditional quantiles wai (n,m,j; Sn) and w1-a2 (n, m, j; Sn), for a1 + a2 = a, 0 < a < 1, i = 1, 2, numerically. Now, a 100(1 — a)% conditional prediction interval for Y}:m based on record values given An = Sn, is given by (U1 + wai(n,m,j; Sn)(Un — U1), U1 + w1-Q2(n,m,j; Sn)(Un — U1)). (3.5) 6 Amini and MirMostafaee Table 1: The values of wo.025(3, m, j), wo.975(3, m,j), wo.975(3, m, j) - wo.o25(3,m, j), wo.025(3, m, j; Sn), wo.975(3, m, j; Sn), wo.975(3, m, j; Sn) - wo.o25(3, m, j; Sn), for m = 10,20, j = 5,7,10 (for m = 10), j = 12,17,20 (for m = 20) and different values of Sn- m 10 20 j 5 7 10 12 17 20 Unconditional W0.025 -3.671 -2.814 -0.907 -3.140 -1.760 -0.380 W0.975 1.278 2.635 9.761 1.827 4.767 12.186 w0.975 — w0.025 4.949 5.449 10.668 4.967 6.527 12.566 Sn = (1, 2) W0.025 -1.097 -0.500 0.249 -0.651 0.055 0.464 Pa„ (Sn) = 0.0833 W0.975 1.766 3.502 11.868 2.459 6.108 14.586 w0.975 — w0.025 2.863 4.002 11.619 3.110 6.053 14.122 Sn = (1, 3) W0.025 -1.288 -0.652 0.201 -0.827 -0.025 0.420 Pa„ (Sn) =0.05 W0.975 1.290 2.627 9.481 1.786 4.675 11.690 W0.975 — W0.025 2.578 3.279 9.280 2.613 4.700 11.270 Sn = (1, 4) W0.025 -1.427 -0.774 0.160 -0.965 -0.098 0.386 Pa„ (Sn) = 0.0333 W0.975 1.022 2.106 7.984 1.398 3.793 9.872 W0.975 — W0.025 2.449 2.880 7.824 2.363 3.891 9.486 Sn = (2, 3) W0.025 -2.181 -1.267 0.045 -1.538 -0.320 0.324 Pa„ (Sn) =0.0167 W0.975 1.027 2.413 10.212 1.502 4.669 12.787 W0.975 — W0.025 3.208 3.680 10.167 3.040 4.989 12.463 Sn = (2,4) W0.025 -2.330 -1.409 -0.008 -1.697 -0.415 0.289 Pa„ (Sn) =0.0119 W0.975 0.823 1.976 8.880 1.193 3.896 11.163 W0.975 — W0.025 3.153 3.385 8.888 2.890 4.311 10.874 Conditionally on Sn, we get more information about the unknown parameters ^ and a, or generally more information about F, which leads to better prediction intervals for Yj:m. It is noted that conditioning on inter-record times does not decrease the length of the prediction interval necessarily and increase or decrease in the location and scale of the interval depend on the values of Sn. For the purpose of illustration, consider the conditional quantiles of Wj, which are computed and tabulated in Table 1, for a = 0.05, n = 3, m = 10, 20, j = 5,7,10 (m = 10), j = 12,17,20 (m = 20) and some values of Sn. The values of unconditional quantiles of Wj in Table 1 are taken from Ahmadi and MirMostafaee (2009), Tables 3 and 4. By comparing the entries of Table 1, one can observe that for a few cases, the conditional prediction intervals have bigger lengths, especially when we predict the biggest future order statistic, i.e. Ym:m. But note that in the most cases the conditional intervals are shorter than the unconditional ones for different values of Sn, so we may conclude that generally the conditional prediction approach leads to shorter (and hence better) prediction intervals in average for different values of Sn and this can be considered as an improvement. Interval Prediction of Order Statistics... 7 4 Conditional Prediction Intervals for the mean of future sample The problem of constructing a conditional prediction interval for Ym on the basis of observed U1,..., Un, given An = Sn, using the pivotal quantity V™ = I-! • (41) is considered for the two parameter exponential distribution in this section. Note that the pivotal quantity Vm has been also considered by Ahmadi and MirMostafaee (2009) and its unconditional distribution has been obtained by them. Moreover, Vm is also location and scale invariant and therefore is free of the unknown location and scale parameters. The following theorem presents the conditional distribution function of Vm given An = 8n. Theorem 2 The conditional distribution function of Vm in (4.1) given An = 6n is m-1 n-1 l ai(n,ji, Sn) a„(n,ji, Sn) (ai(n,ji, ¿„)\(a„(n,ji, l FV_|A.(xiin) = 1 -E jTjr 1 (jj.)« l=0 ji = 1 j2=0 js=0 j4=0 V ' AnV n> ^ Cji (n , ¿n)xj2mlr(l - j + 1)r(j2 + 1) X (m + j3 + j4 + 2)l-j2+1(mx + j4 + 1)j2+1' for x > 0, and n-1 al(n,jl, Sn) a„(n,ji, S„) ( 1)js+j4 (ai(n,ji, S„)\(a„(n,ji, S„)\ (n r ) (*) = EE E (-1) ( jS K j4 K (n' 'n) ^ j=0 j4=0 Pa„ (¿n)(2+ j3 + j4)[j4 + 1 - (2+ j3 + j4)x] m-1 n-1 l ai(n,ji, Sn) a„(n,ji, Sn) l—j2 (ai(n,ji, Sn)\ (a„(n,ji, 6„)W l - EEE E E EVjS j4 (-1)53+54+55 PA (¿n)/l 1=0 ji=1 j2=0 j3=0 j4=0 j5=0 V 7 AnV n' x Cj (n, ¿n)xj2+j5m'r(l - j2 + 1)r(j2 + j5 + 1) X j5l(m + j3 + j4 + 2)1-j2-j5 + 1 [j4 + 1 - (j3 + j4 + 2)x]j2+j5 + 1 ' for x < 0, where a1(n, j1, 6n) and an(n, j1, 6n) are given in Lemma 1 Proof: Let = (Un - Ux)/o, UJ" = (U1 - v)/a and F' = (F™ - Note that Fym\ A„ (x|^n)^ / (vx + u)/ur,j;il A„ CM^ du dV (4.2) 00 where fuj 11 An (u, v| ¿n) is given in (3.4). Since mF' ~ r(m, 1), that is for t > 0 FYm (t) = 1 - 'I: ^ (4.3) 1=0 8 Amini and MirMostafaee so by substituting (3.4) and (4.3) in (4.2) and using the binomial expansions, we get for x < 0 n-1 ai(n,ji, Sn) a„(n,ji, Sn) fa1(n,j1, S„)\fa„(n,ji, S„)\ c (n s ) rp , ,J5 n _ V^ V^ V^ ^ j3 A j4 /tj1 °n> FVm | A„ (x|Sn)~ ^ j1 = 1 j3 = 0 j4=0 n /»TO /»to x / / e-(j3+j4+2)"e-(j4+1)v du dv ./0 J-vx m-1 n-1 l ai(n,ji, Sn) a„(n,ji, 5„) /ai(n,ji, 6„)\/a„(n,ji, 5„)w l \ - ^ ^ ^ X/ X/ j3 j4 (-1)j3+j4 PA (¿Jll 1=0 ji = 1 j2=0 j3 = 0 j4 = 0 7 ' /»TO /»TO j (n, ¿„)xj2m1 / / e-(m+j3+j4+2)"e-(mx+j4+1)vu1-j2 vj2 du dv ./0 J-vx n-1 ai(n,j'i, 5„) an(n,ji, Sn) ( —1)j3+j4^ ai (n,ji, |"a„(n,ji, Cj (n £ £ £ j3 j4 ji ' n ji = 1 j3=0 j4=0 PAn (¿n)(2+ 33 + j4)[j4 + 1 - (2+ 33 + 34)*] m-1 n-1 I ai(n,ji, Sn) a„(n,ji, 5„) Ij /ai(n,j'i, ¿„)\^a„(n,ji, 6n)\/ 1 \ __ ^ ^ ^ ^ ^ V^ ^ j3 A j4 AjV ( 1=0 ji = 1 j2=0 j3=0 j4=0 j5=0 (-1)j3+j4+j5 PAn (¿n)/l x Cji (n, ¿n)xj2+j5mly(l - 32 + 1) fTO e-(j4+1-(j3+j4+2)x)vvj2+j5 dv j5l(m + 33 + 34 +2)1-j2-j5 + 1 J0 and therefore we naturally attain the desired result. Similarly, we may deduce the desired expression for FVm | A„ (x|Sn) when x > 0. □ To find conditional prediction interval for Ym based on records given An = Sn, we have to find the conditional quantiles of Vm given An = Sn, vai (n, m; Sn) and v1-a2 (n, m; Sn), for a1 + a2 = a, 0 < a < 1, i = 1, 2, numerically, where Pr(Vm < v7(n, m; Sn) | An = Sn) = YA 100(1 — a)% conditional prediction interval for Ym based on record values given An = Sn then is (U + Vai (n, m; Sn)(Un — U1),U1 + V1-a2(n, m; Sn)(Un — U1)). (4.4) An illustrative example has been presented in Section 5. 5 An illustrative example x In this section, we illustrate the proposed procedures by considering a real data set. A rock crushing machine has to be reset if, at any operation, the size of rock being crushed Interval Prediction of Order Statistics... 9 Table 2: 95% CPIs and UPIs for Yi2:20, Y>0:20 and Y20 for Example 1. CPI UPI Yi2:20 (0,24.17836) (0, 54.061745) Y20:20 (13.290315,183.67385) (0,307.85602) Y20 (0,26.233175) (0,61.32183) is larger than any that has been crushed before. The following data given by Dunsmore (1983) are the sizes dealt with up to the third time that the machine has been reset: 9.3, 0.6, 24.4, 18.1, 6.6, 9.0, 14.3, 6.6, 13.0, 2.4, 5.6, 33.8. The record values were the sizes at the operation when resetting was necessary. Dunsmore (1983) assumed that these data follow an Exp(0, a) distribution. Clearly, we have Consider a future sample of size m = 20. We want to find equi-tailed 95% conditional prediction intervals (CPIs) for Yi2:2o, Y20:20 and Y20 using (3.5) and (4.4) and compare these intervals with unconditional ones (UPIs). The results are given in Table 2. Note that some lower bounds have got negative values, which were replaced by zero. We can see that the conditional prediction intervals are shorter than the corresponding unconditional ones. 6 Concluding remarks In this paper, we found prediction intervals for the future order statistics based on record values, given record time statistics, when the underlying distribution is two parameter exponential. These intervals have the advantage of utilizing more information embedded in the observed sequence in comparison with their corresponding unconditional ones obtained by Ahmadi and MirMostafaee (2009). These ideas can be extended to the non-parametric and the Bayesian context. The conditional point predictors are also of interest. Work on these problems is currently under process and we hope to report these findings in future papers. Ui = 9.3, U2 = 24.4, U3 = 33.8, T1 = 1, T2 = 3, T3 = 12, Ai = 2, and A2 = 9. Acknowledgement We are very grateful to the respected editor and the respected referees for their insightful comments and suggestions which have led to this improved version. 10 Amini and MirMostafaee References [1] Ahmadi, J. and Balakrishnan, N. (2010): Prediction of order statistics and record values from two independent sequences, Statistics, 44, 417-430. [2] Ahmadi, J. and MirMostafaee, S.M.T.K. (2009): Prediction intervals for future records and order statistics coming from two parameter exponential distribution, Statistics and Probability Letters, 79, 977-983. [3] Ahmadi, J. and MirMostafaee, S.M.T.K. and Balakrishnan, N. (2010): Nonparamet-ric prediction intervals for future record intervals based on order statistics, Statistics and Probability Letters, 80, 1663-1672. [4] Arnold, B.C., Balakrishnan, N., and Nagaraja, H.N. (1998): Records, John Wiley & Sons, New York. [5] Awad, A.M. and Raqab, M.Z. (2000): Prediction intervals for the future record values from exponential distribution: comparative study. Journal of Statistical Computation and Simulation, 65, 325-340. [6] Chou, Youn-Min. (1988): One-sided simultaneous prediction intervals for the order statistics of l future samples from an exponential distribution. Communications in Statististics-Theory and Methods, 17, 3995-4003. [7] David, H.A. and Nagaraja, H.N. (2003): Order Statistics, Third edition, John Wiley & Sons, New York. [8] Dunsmore, I.R. (1983): The future occurrence of records. Annals of the Institute of Statistical Mathematics, 35, 276-277. [9] Doostparast, M. (2009): A note on estimation based on record data. Metrika, 69, 69-80. [10] Doostparast, M. and Balakrishnan, N. (2013): Pareto analysis based on records. Statistics, 47, 1075-1089. [11] Feuerverger, A. and Hall, P. (1998): On statistical inference based on record values. Extremes, 1, 169-190. [12] Kizilaslan, F. and Nadar, M. (2015): Estimation with the generalized exponential distribution based on record values and inter-record times. Journal of Statistical Computation and Simulation, 85, 978-999. [13] Lin, C.T., Wu, S.J.S and Balakrishnan, N. (2003): Parameter estimation for the linear hazard rate distribution based on records and inter-record times. Communications in Statististics-Theory and Methods, 32, 729-748. [14] MirMostafaee, S.M.T.K. and Ahmadi, J. (2011): Point prediction of future order statistics from exponential distribution, Statistics and Probability Letters, 81, 360370. Interval Prediction of Order Statistics... 11 [15] MirMostafaee, S.M.T.K., Amini, M. and Balakrishnan, N. (2016): Exact nonpara-metric conditional inference based on k-records, given inter k-record times. Journal of the Korean Statistical Society, Accepted. [16] Nagaraja, H.N. (1984): Asymptotic linear prediction of extreme order statistics. Annals of the Institute of Statistical Mathematics, 36, 289-299. [17] Raqab, M.Z. and Balakrishnan, N. (2008): Prediction intervals for future records. Statistics and Probability Letters, 78, 1955-1963. [18] Samaniego, F.J. and Whitaker, L.R. (1986): On estimating population characteristics from record-breaking observations. I. parametric results. Naval Research Logistics Quarterly, 33, 531-543. Appendix Here, we present the R codes for computing the conditional cumulative distribution functions of Wj, (see Theorem 1) and Vm (see Theorem 2). R functions for computing the unconditional cumulative distribution functions of Wj and Vm (see Ahmadi and MirMostafaee, 2009) are also given. ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooooooooooo cjn function oooooooooooooooooooooooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo cjn=function(n,j,delta){ z=(-ir(n-j-1) z1=n-j+1 z2=j-2 z4=n-j-2 z5=n-j s=1 if(z2>=0 & z1>=0){ for(j1 in 0:z2){ z3=n-j1-1 ss=ifelse(z3>=z1,sum(delta[z1:z3]),0) s=s*ss }} t=1 if(z4>=0){ for(j2 in 0:z4){ z6=j2+2 tt=ifelse(z5>=z6,sum(delta[z6:z5]),0) t=t*tt }} return(z/t/s) 12 Amini and MirMostafaee } ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo M:n c- c- -r\ v-/--\"K -n"K-i~l-i-l-T7 r\-F n^ "I -I- -n 9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-ooooooooo Mass piUbablliiy U! Delta ooooooooooooooooooooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo pdelta=functiun(n,delta){ n1=n-1 pdel=0 foi(jj in 1:n1){ nj=n-jj nj1=n-jj+1 A=cjn(n,jj,delta) a1=ifelse(nj>=1,sum(delta[1:nj]),0)-1 an=ifelse(n1>=nj1,sum(delta[nj1:n1]),0) C=(a1 + 1)* (a1+an+2) pdel=pdel+A/C } ietuin(pdel) } 9-9-9-9-9-9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'^ ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 9-9-9-9-9-9-9-9-9- r*. -F TaT 9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9-9- ooooooooo CUnditiUnal cdf Uf W ooooooooooooooooooooooo 9-9-9-9-9-9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9-ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo Fw=function(n,j,m,w,delta){ n1=n-1 pw=0 foi(l in j:m){ foi(j1 in 1:n1){ for(j2 in 0:l){ nj1=n-j1+1 nj=n-j1 a1=ifelse(nj>=1,sum(delta[1:nj]),0)-1 an=ifelse(n1>=nj1,sum(delta[nj1:n1]),0) foi(j3 in 0:a1){ for(j4 in 0:an){ A=chouse(m,l)*chouse(a1,j3)*chouse(an,j4)*chouse(l,j2) *((-1)"(j2+j3+j4))*cjn(n,j1,delta)/pdelta(n,delta) B=j2+m-l+j3+j4+2 if(w<0) C=B*(j4+1-w*(j3+j4+2)) if(w>=0) C=B*(w*(j2+m-l)+j4+1) pw=pw+A/C }}}}} Interval Prediction of Order Statistics... 13 return(pw) } ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 2-2-2-2-2-2-2-2-2- \T 2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2- ooooooooo LUnUitlUndl CU! of V ooooooooooooooooooooooo 2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2^ ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo Fv=functiun(n,m,v,Ueltd){ pv=0 n1=n-1 m1=m-1 if(v>=0){ fur(l in 0:m1){ fur(j1 in 1:n1){ for(j2 in 0:l){ nj1=n-j1+1 nj=n-j1 d1=ifelse(nj>=1,sum(Ueltd[1:nj]),0)-1 dn=ifelse(n1>=nj1,sum(Ueltd[nj1:n1]),0) fur(j3 in 0:a1){ for(j4 in 0:an){ A=chuuse(d1,j3)*chuuse(an,j4)*choose(l,j2)/factorial(l) /pUelta(n,Ueltd)*((-1)~(j3 + j4)) B=cjn(n,j1,Uelta)*(v"j2)*(m"l)*gammd(l-j2+1)*gammd(j2+1) /((m+j3+j4+2)~(l-j2+1))/((m*v+j4+1)~(j2+1)) pv=pv+A*B }}}}}} if(v>=0) pv=1-pv pv1=0 pv2 = 0 if(v<0){ fur(j1 in 1:n1){ nj1=n-j1+1 nj=n-j1 d1=ifelse(nj>=1,sum(Ueltd[1:nj]),0)-1 an=ifelse(n1>=nj1,sum(Ueltd[nj1:n1]),0) fur(j3 in 0:a1){ fur(j4 in 0:an){ A=((-1)"(j3+j4))*choose(a1,j3)*chuuse(an,j4)*cjn(n,j1,Ueltd) /pUelta(n,Ueltd)/(2+j3+j4)/(j4+1-v*(2+j3+j4)) pv1=pv1+A }}} fur(l in 0:m1){ fur(j1 in 1:n1){ fur(j2 in 0:l){ 14 Amini and MirMostafaee nj1=n-j1+1 nj=n-j1 a1=ifelse(nj>=1,sum(delta[1:nj]),0)-1 an=ifelse(n1>=nj1,sum(delta[nj1:n1]),0) for(j3 in 0:a1){ for(j4 in 0:an){ lj2=l-j2 for(j5 in 0:lj2){ A=choose(a1,j3)*choose(an,j4)*choose(l,j2)/factorial(l) /pdelta(n,delta)*((-1)~(j3+j4+j5)) B=cjn(n,j1,delta)*(v~(j2 + j5))*(m~l)*gamma(l-j2+1) *gamma(j2+j5+1)/factorial(j5)/((m+j3+j4+2)"(l-j2-j5+1)) /((j4+1-v*(j3+j4+2))"(j2+j5+1)) pv2=pv2+A*B }}}}}} pv=pv1-pv2 } return(pv) } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%% unconditional cdf of W %%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FwU=function(n,j,m,w){ pw=0 if(w<0) pw=(m-j + 1) * ((1-w)"(1-n))/(m+1) if(w>=0){ ss=0 j1=j-1 for(i in 0:j1){ ss=ss+choose(j1,i)*((-1)"i)*((1+w*(m-j+i+1))"(1-n)) /(m-j+i+1)/(m-j+i+2) } pw=1-j*choose(m,j)*ss } return(pw) } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%% unconditional cdf of V %%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FvU=function(n,m,v){ pv=0 if(v<0) pv=((1-v)~(1-n))/((1 + 1/m)~m) if(v>=0){ Interval Prediction of Order Statistics... 15 m1=m-1 s1=0 s2=0 for(i in 0:m1){ nn=n+i-2 s1=s1+choose(nn,i)*((1-1/(m*v+1))~i)*((1/(m*v+1))~(n-1)) * ((m/(m+1))"(m-i)) s2=s2+choose(nn,i)*((1-1/(m*v+1))~i)*((1/(m*v+1))~(n-1)) } pv=s1+1-s2 } return(pv) } Metodoloski zvezki, Vol. 13, No. 1, 17-25 Series of Incomplete Row-Column Designs with Two Units per Cell Anindita Datta1, Seema Jaggi1, Cini Varghese1 and Eldho Varghese1 Abstract Here, two series of incomplete row-column designs with two units per cell have been developed that are structurally complete, i.e. all the cells corresponding to the intersection of row and column receive two distinct treatments. Properties of these classes of designs have been studied and the methods result in designs in which the elementary contrasts of treatment effects are estimated with same variance. 1. Introduction Row-column designs are used for controlling heterogeneity in the experimental material in two directions. Most of the row-column designs developed in the literature have one unit corresponding to the intersection of row and column. Row-column designs with more than one unit per cell are used when the number of treatments is substantially large with limited number of replicates. For example, (Bailey and Monod, 2001), to conduct an experiment for comparing 4 treatments using 4 plants with leaves at 2 different heights, the following row-column design with complete rows and columns having two units per cell can be used: Plants Leaf Height 1 2 2 3 3 4 4 1 3 4 4 1 1 2 2 3 These designs are termed as semi-Latin squares in the literature. An (n x n)/k semiLatin square is an arrangement of nk symbols (treatments) in an (n x n) square array such that 1 ICAR-Indian Agricultural Statistics Research Institute, Library Avenue, New Delhi-110 012. India. Tel.: +9111-25847284, E-mail address: seema.iasri@outlook.com, seema@iasri.res.in 18 Datta et al. each row-column intersection contains k symbols and each symbol occurs once in each row and each column. Trojan squares are a special class of semi-Latin squares based on sets of mutually orthogonal superimposed Latin squares and have been shown to be maximally efficient for pair-wise treatment comparisons in the plots-within-blocks stratum (Bailey, 1992). Following is an example of a Trojan square design of size (4 x 4)/2 for 8 treatments constructed by superimposing two mutually orthogonal Latin squares of size 4, one with 1, 2, 3 and 4 treatments and the other with 5, 6, 7 and 8 treatments. Columns Rows 1 5 2 6 3 7 4 8 2 7 1 8 4 5 3 6 3 8 4 7 1 6 2 5 4 6 3 5 2 8 1 7 This arrangement could be extended for 12 treatments of size (4 x 4)/3 by superimposing the third orthogonal Latin square of size 4 but no further Trojan extension is possible, there being only three mutually orthogonal Latin squares of size 4. Complete Trojan squares of size (n x n)/k have n2 blocks of size k and require n replicates of nk treatments. Sometimes, design or cost constraints make complete Trojan squares impossible and then Incomplete Trojan squares of size [(n-1) x n]/ k or of size [n x (n-1)]/ k can be useful. Such incomplete Trojan squares can be constructed by omitting any complete row or any complete column from any Trojan design of size (n x n)/k. Trojan squares were first discussed by Harshbarger and Davis (1952) but then it was named as Latinized Near Balanced Rectangular Lattices having k = n-1. Later, Darby and Gilbert (1958) discussed the general case for k < n and introduced the name Trojan square designs where k > 2. However, all designs of the Latinized Rectangular Lattice type are now commonly described as Trojan squares for any 1 < k < n. Williams (1986) generalized the notion and called semi-Latin squares as Latinized incomplete-block designs. Andersen and Hilton (1980) called semi-Latin squares as (1, 1, k) Latin rectangles. Series of incomplete row-column designs with two units per cell 19 Preece and Freeman (1983) discussed the combinatorial properties of semi-Latin squares and related designs. Bailey (1988) discussed further construction for a range of semi-Latin and Trojan square designs. Bailey (1992) gave methods of constructing a range of semi-Latin and Trojan square designs, studied their efficiencies and showed that the Trojan squares are the optimal choice of semi-Latin squares for pair-wise comparisons of treatment means. These are particularly suitable for crop research experiments either in field or in the glasshouse. Trojan squares are normally the best choice of semi-Latin squares for crop research (Edmondson, 1998). Bedford and Whitaker (2001) have given several methods of construction of semi-Latin squares. Dharmalingam (2002) gave an application of Trojan square designs and used it to obtain partial triallel crosses. Edmondson (2002) constructed generalized incomplete Trojan square designs, denoted by (m x n)/k where m denotes the number of replicates of nk treatments, based on a set of k cyclic generators. There existed three optimal (4 x 4)/4 semi-Latin squares (Bailey and Chigbu, 1997) for sixteen treatments in blocks of size four. Since these squares do not have the same concurrences, there was a need for distinguishing one square from the others and determining the most preferred square in a given context. Chigbu (2003) obtained the best of the three optimal (4 x 4)/4 semi-Latin squares by finding and comparing the variances of elementary contrasts of treatments for the squares. Jaggi et al. (2010) defined Generalized Incomplete Trojan-Type Designs and developed method of constructing these designs. Varghese and Jaggi (2011) obtained generalized row-column designs and showed their application in obtaining mating plans. Datta et al. (2014) obtained some methods of constructing row-column designs with multiple units per cell that are structurally incomplete i.e. corresponding to the intersection of any row and column, there is at least one cell which does not contain any treatment. Datta et al. (2015) developed row-column designs with multiple units per cell with equal/ unequal cell sizes. Jaggi et al. (2016) obtained another series of generalized incomplete Trojan-type designs for number of treatments v= sm+1. Most of the methods available in the literature are for complete rows and complete columns. Here, two methods of constructing row-column designs with two units per cell in incomplete rows or columns are obtained that are balanced for estimating elementary contrasts of treatment effects. 20 Datta et al. 2. Experimental Setup and Information Matrix We consider a row-column design with v treatments arranged in m rows, n columns and in each row-column intersection (i.e. cells) there are k units or plots resulting in total mnk experimental units or observations. The following four-way classified model with treatments, rows, columns and cells as the four classifications, is considered: Y= X101 + X202 + e, where Xi = [ A' ] , X2 = [1 d; d; d; ]', 01 = (t) is the vector of parameters of interest and 02 = (^ P y n) is the vector of nuisance parameters. Y is a mnk x 1 vector of observations, p is the grand mean, 1 is the mnk x 1 vector of ones, A' is mnk x v matrix of observations versus treatments, t is a v x 1 vector of treatment effects, d' is mnk x m matrix of observations versus rows, P is m x 1 vector of row effects, D2 is mnk x n matrix of observations versus columns, y is n x 1 vector of column effects, d; is mnk x mn matrix of observations versus cells, n is mn x 1 vector of cell effects and e is mnk x 1 vector of random errors with E(e) = 0 and D(e) = o2 I. The information matrix of row-column design with multiple units per cell for treatment effects is obtained as C = Rx - {( N1K11N;+N2K21N;+N3K31N;+N^N+N2K 22N; ) + (N3K 23n;+^^3+N2K 23N;+N3K33N;)} ...(2.1) where Kn= Kp- + [KpXIMK 22M;+M2K ^MJ +MK 23M+M2K33M2 )Kp- ] K12 =-Kp-( MK + M2K 23 ) K^ = Kp- (MjK23 + M2K33 ) K22 = Ky- + Ky-M3 (Kn -M3Ky-M3 )- M3Ky- K 23 =-K y-M3 (K n - M3K y-M3 )- K33 = (Kn - M3K Y - M3 ) Series of incomplete row-column designs with two units per cell 21 Rt is the diagonal matrix of replication of treatments, Kp is the diagonal matrix of row-sizes, Ky is the diagonal matrix of column-sizes and Kr, is the diagonal matrix of cell-sizes. N1 is the incidence matrix of treatments versus rows, N2 is the incidence matrix of treatments versus columns, N3 is the incidence matrix of treatments versus cells, M1 is the incidence matrix of rows versus columns, M2 is the incidence matrix of rows versus cells and M3 is the incidence matrix of columns versus cells. The v x v matrix C is symmetric, non-negative definite with zero row and column sums. 3. Methods of Construction We present here two methods of constructing row-column designs with two units per cell. The first method is for odd number of treatments and the second is for even number. Method 3.1: For v = 2t + 1 (t > 1), obtain the following initial column having two units per cell: 1 2t + 1 2 2t 3 2t - 1 t 2t - (t - 2) Develop 2t more columns horizontally from the initial column by adding 1,2,...,2t consecutively reducing mod (2t+1). The resulting design is a row-column design with two units per cell and with m = t rows of size 2(2t+1), n = (2t+1) columns of size 2t, k = 2 and r = 2t replications. The design is complete row-wise and column-wise it is incomplete. The information matrix of the design for treatment effects is obtained from (2.1) as c = (t + 0.5) i - 0.5 j . It is seen that all the elementary treatment contrasts are estimated with same variance. This method would give designs for all odd number of treatments. 22 Datta et al. Example 3.1.1: Let t = 3, so v = 7. The contents of the initial column are obtained as follows: 1 7 2 6 3 5 Developing this column by adding 1,2,...,6 reducing mod 7 would result in the following row-column design in three rows of size 14, 7 columns of size 6 with 2 units per cell and replication of each treatment being 6: Columns Rows 1 7 2 1 3 2 4 3 5 4 6 5 7 6 2 6 3 7 4 1 5 2 6 3 7 4 1 5 3 5 4 6 5 7 6 1 7 2 1 3 2 4 The information matrix for estimating treatment effects is C=3.51 - 0.5 J. Example 3.1.2: For t = 4, the contents of the initial column for v = 9 are obtained as follows: 1 9 2 8 3 7 4 6 The row-column design in 4 rows of size 18, 9 columns of size 8 and 8 replications is constructed as given. Columns 1 9 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 m * 2 8 3 9 4 1 5 2 6 3 7 4 8 5 9 6 1 7 o Pi 3 7 4 8 5 9 6 1 7 2 8 3 9 4 1 5 2 6 4 6 5 7 6 8 7 9 8 1 9 2 1 3 2 4 3 5 Here, the design is complete row-wise with each treatment occurring twice and column-wise it is incomplete. Series of incomplete row-column designs with two units per cell 23 Method 3.2: For v even, obtain the following initial column with 2 units per cell: 1 v v 2 2 v-1 v-1 3 v - ( 2 -2) v v — 2 v 2 v + 1 2 v Develop — 1 more columns horizontally from the initial column by adding 1,2,..., (— -1) consecutively reducing mod v. The resulting design is a row-column design 2 with 2 units per cell in incomplete rows and complete columns. The parameters of the design are v, m = (v-1) rows of size v, n = — columns of size 2(v-1), k = 2 and r = (v-1). From (2.1), 2 the information matrix for treatment effects is obtained as v C = — I-0.5 J. 2 Thus, all the elementary contrasts of treatment effects are estimated with same variance. 24 Datta et al. Example 3.2.1: For v = 8, following is a row-column design with cells containing 2 units in 7 rows of size 8 each and 4 columns of size 14 each: Columns 1 8 2 1 3 2 4 3 8 2 1 3 2 4 3 5 ows 2 7 3 8 4 1 5 2 7 3 8 4 1 5 2 6 3 6 4 7 5 8 6 1 6 4 7 5 8 6 1 7 4 5 5 6 6 7 7 8 The canonical efficiency factor of the above two series of designs is obtained as E=H, where H=|— V^,-1 r , I v-1V 1 Xi are the eigen-values of C- matrix of the designs obtained and r is the number of replications of the treatments. It is assumed that c2 is same for the developed design and the orthogonal design to which it is compared. The canonical efficiency factor of the developed designs was worked out and was found to be fairly good. Acknowledgements The authors are grateful to the Editor and the reviewers for the constructive comments that have led to considerable improvement in the paper. References [1] Andersen, L.D. and Hilton, A.J.W. (1980): Generalized Latin rectangles I: Constructions and decomposition, Discrete Maths., 31, 125-152. [2] Bailey, R.A. (1988): Semi Latin squares, J. Statist. Plan. Inf., 18, 299-312. [3] Bailey, R.A. (1992): Efficient semi-Latin squares, Statistica Sinica, 2, 413-437. [4] Bailey, R.A. and Chigbu, P.E. (1997): Enumeration of semi-Latin squares, Discrete Maths., 167/168, 73-84. [5] Bailey, R.A. and Monod, H. (2001): Efficient semi-Latin rectangles: Designs for plant disease experiments, Scand. J. Statist., 28, 257-270. Series of incomplete row-column designs with two units per cell 25 [6] Bedford, D. and Whitaker, R.M. (2001): A new construction for efficient semi-Latin squares, J. Statist. Plan. Inf., 98, 287-292. [7] Chigbu, P.E. (2003): The "best" of the three optimal (4 x 4)/4 semi-Latin squares, Sankhya: The Indian Journal of Statistics, 65(3), 641-648. [8] Darby, LA. and Gilbert, N. (1958): The Trojan Square, Euphytica, 7, 183-188. [9] Datta, A., Jaggi, S., Varghese, C. and Varghese, E. (2014): Structurally incomplete row-column designs with multiple units per cell. Statistics and Applications, 12, (1&2): 7179. [10] Datta, A., Jaggi, S., Varghese, C. and Varghese, E. (2015): Some series of row-column designs with multiple units per cell. Calcutta Statistical Association Bulletin, 67, (265266), 89-99. [11] Dharmalingam, M. (2002): Construction of partial triallel crosses based on Trojan square design, J. App. Statist., 29(5), 675-702. [12] Edmondson, R.N. (1998): Trojan square and incomplete Trojan square design for crop research, J. Agric. Sci. 131, 135-142. [13] Edmondson, R.N. (2002): Generalized incomplete Trojan designs, Biometrika, 89(4), 877-891. [14] Harshbarger, B. and Davis, L.L. (1952): Latinized rectangular lattices, Biometrics, 8, 73-84. [14] Jaggi, Seema, Varghese, Cini, Varghese, Eldho and Sharma, V.K. (2010): Generalized incomplete Trojan-type designs, Statistics and Probability Letters, 80, 706-710. [15] Jaggi, Seema, Varghese, Cini, and Varghese Eldho (2016): A series of generalized incomplete Trojan-type designs. Journal of Combinatorics, Information and System Sciences: American Journal, 40(1-4), 53-60. [16] Preece, D.A. and Freeman, G.H. (1983): Semi-Latin squares and related designs, J. R. Statist. Soc. B 45, 267-277. [17] Varghese, Cini and Jaggi, Seema (2011): Mating designs using generalized incomplete Trojan-type designs, Journal of Statistics and Applications, 6(3-4), 85-93. [18] Williams, E.R. (1986): Row and column designs with contiguous replicates, Australian J. Statist., 28, 154-163. Metodoloski zvezki, Vol. 13, No. 1, 2016, 27-58 Adjustment of Recall Errors in Duration Data Using SIMEX Jose Pina-Sánchez1 Abstract It is widely accepted that due to memory failures retrospective survey questions tend to be prone to measurement error. However, the proportion of studies using such data that attempt to adjust for the measurement problem is shockingly low. Arguably, to a great extent this is due to both the complexity of the methods available and the need to access a subsample containing either a gold standard or replicated values. Here I suggest the implementation of a version of SIMEX capable of adjusting for the types of multiplicative measurement errors associated with memory failures in the retrospective report of durations of life-course events. SIMEX is a method relatively simple to implement and it does not require the use of replicated or validation data so long as the error process can be adequately specified. To assess the effectiveness of the method I use simulated data. I create twelve scenarios based on the combinations of three outcome models (linear, logit and Poisson) and four types of multiplicative errors (non-systematic, systematic negative, systematic positive and heteroscedastic) affecting one of the explanatory variables. I show that SIMEX can be satisfactorily implemented in each of these scenarios. Furthermore, the method can also achieve partial adjustments even in scenarios where the actual distribution and prevalence of the measurement error differs substantially from what is assumed in the adjustment, which makes it an interesting sensitivity tool in those cases where all that is known about the error process is reduced to an educated guess. 1 Introduction Applied quantitative searchers commonly assume that variables included in their models are measured perfectly. This often implicit assumption is, however, difficult to maintain when using survey data as interviewer effects, interviewee fatigue, social desirability bias, lack of cooperation, or plain deceit inevitably introduce measurement error (ME). This is especially true for surveys using a retrospective design, which collect information about past events 1 School of Law, University of Leeds, J.PinaSanchez@leeds.ac.uk Adjustment of Recall Errors in Duration Data Using SIMEX 28 from a single contact with respondents. The advantages of retrospective designs, in comparison with prospective studies2, are well known: a) immune to problems of attrition; b) cheaper to administer; and c) more capable of detecting transitions occurring in short periods. Retrospective questions are however prone to ME as they require respondents to both interpret the question correctly and recall events that took place in the past. The consequences of using data affected by ME are both difficult to estimate and potentially disastrous (Nugent, Graycheck & Basham, 2000; and Vardeman et al, 2010)3. Unfortunately, the latter is rarely acknowledged, and in certain cases it is directly misunderstood For example, Carroll et al (2006) point at the widespread belief that ME affecting an explanatory variable will only attenuate the regression estimate of that variable4. Even amongst researchers that acknowledge the potential consequences of ME very little is done to tackle the problem besides mentioning it as a caveat. There are two reasons for this: the requirement of additional data and the complexity of the adjustment methods available. Generally, methods for the adjustment of ME need to be informed about the true unobserved values using additional data. For example, multiple imputation (Rubin, 1987, and Cole, Chu & Greenland, 2006) requires access to a validation subsample where the true values are observed. Regression calibration (Carroll & Stefanski, 1990; and Glesjer, 1990) needs at least repeated measurements, while two stage least squares (Theil, 1953) requires instrumental variables. However, researchers' access to this type of data tends to be the exception rather than the norm. In addition, these three methods belong to the family of functional methods - i.e. those that do not make any assumptions about the distribution of the true values. A second group of methods known as structural methods are technically more complex, amongst other things because they require specifying the probability function of the unobserved true values. Examples of structural methods are likelihood based adjustments, either Bayesian or Frequentist. These methods account directly for the ME mechanism in place, which tend to involve ad hoc specifications, in turn increasing the complexity of the adjustment. 2 See Solga (2001) for a comparison of data quality derived from prospective and retrospective questions. 3 "Measurement error is, to borrow a metaphor, a gremlin hiding in the details of our research that can contaminate the entire set of estimated regression parameters. " (Nugent, et al. 2000: 60). "Even the most elementary statistical methods have their practical effectiveness limited by measurement variation. " (Vardeman et al., 2010: 46). 4 "Despite admonitions of Fuller (1987) and others to the contrary, it is a common perception that the effect of ME is always to attenuate the line. In fact, attenuation depends critically on the classical additive ME model. " (Carroll et al., 2006: 46). 29 Jose Pina-Sanchez In this paper I will use simulated data to study the effectiveness of multiplicative Simulation Extrapolation Method (SIMEX) (Carroll et al. 2006; and Biewen, Nolte & Rosemann, 2008). This is an extension of the standard SIMEX method (Cook & Stefanski, 1994) capable of adjusting for the recall errors that are typically observed in the retrospective reports of life-course events. SIMEX implementation is relatively simple in that only requires an estimate of the reliability of the variable affected by ME. This is normally obtained using a subsample of replicated data. Here I will assume that such information is not available to the researcher, as it is often the case. Instead I will use this technique to show its potential to carry out sensitivity analysis when the reliability ratios have to be assumed. That is, I will demonstrate how the problem of recall errors so ubiquitous in retrospective data can be effectively dealt with by researchers who do not have neither the technical background to carry out complex adjustments, nor access to additional sources of data. In so doing my ultimate goal is to encourage a wider audience of survey researchers both to reflect about the implications of relying variables affected by ME and to consider the possibility of assessing the robustness of their findings. In the following section I review the theory regarding the types of errors that can be expected from retrospective questions and the models that have been normally used to specify them. In Section 3, I present the simulated data that will be used in the analysis and illustrate the implications of using an explanatory variable affected by multiplicative errors in different outcome models. Section 4 lays out the functioning of the standard SIMEX and the extension considered to accommodate multiplicative ME. In Section 5 the results of the analysis are presented, and in Section 6 I conclude with a discussion of the relevance of the main findings. 2 Modelling Memory Failures in Retrospective Questions on Life-Course Events Most studies aiming to assess the implications of ME or to adjust for them assume a simple error mechanism known as the classical ME model. This model was first formally defined by Novick (1966) as follows, Adjustment of Recall Errors in Duration Data Using SIMEX 30 where X* is the observed variable, equal to the true variable X, plus the ME term V. which fulfils six important assumptions: 1. Null expectancy refers to the assumption that the error term is non-systematic, or in other words, the expected value of the error term is zero, E(V) = 0. 2. The assumption of homoscedasticity indicates that the variance of the error term is assumed to remain constant across subjects, VarCVi) = Var(V) = <7y. 3. In addition to having an expectation of zero and constant variance the error term is normally distributed, V~N (0, 0 lO, 7 < 0 and as a count variable, Yco, The four simulated ME scenarios are represented by the variables, X\ X?. X£ and X*. In each of these scenarios X is subject to normally distributed classical multiplicative ME. I choose to simulate normal instead of log-normal errors (as explained in equation 2.6) to ensure that they are perfectly symmetric around their mean (the latter are skewed to the right to a certain extent). Figure 1 shows the probability and mass functions for each of the variables simulated, while the specific code used in R is shown in Appendix I. Adjustment of Recall Errors in Duration Data Using SIMEX 34 Figure 1: Probability Density and Mass Functions of the Simulated Variables In the first ME scenario I simulate non-systematic errors distributed as a .25). The multiplicative effect of these errors results in a new variable X* with a reliability ratio (RR) of .816. In the second scenario I explore the effect of heteroscedastic ME by changing the distribution of the errors from N(l, .15) to Af(l, .35) when Z > 0. This is a type of ME that could take place when different survey modes are used. For example, Roberts (2007) - after reviewing the literature - concluded that telephone interviews place a higher cognitive demand on the interviewee than face-to-face interviews, which tend to make them more prone to measurement error. In the third scenario I study the effect of systematically underreported durations by simulating errors distributed as N(.9, .25). These are the types of errors that could be expected in the presence of forward telescoping bias (e.g. Golub et al., 2000, and Johnson and Schultz, 2005, found evidence of these types of errors in reports of onset of drug usage and smoking, respectively), but also in the report of durations of socially undesirable events (e.g. Pina-Sánchez, Koskinen & Plewis, 2013 and Pina-Sánchez, Koskinen & Plewis, 2014, found an increased tendency to underreport the longer spells of unemployment). Lastly, I explore the opposite scenario, one where the errors are distributed 35 Jose Pina-Sanchez as /V(l.l, .25) to reflect overreported durations, which could be expected in the presence of backward telescoping or in reports of socially desirable events. The effects of this four types of simulated ME are shown using scatter-plots in Figure 2. Notice that the top-right plot uses Z instead of X.in the y-axis. Figure 2: Scatterplots of the effect of the different types of measurement error considered To assess the impact that these types of errors have on the regression coefficient of a linear, a logit, and a Poisson model, I compare the results from each of these models when X * is used (the naïve model) instead of X (the true model). Specifically, I focus on the bias in the regression coefficients, BIAS = ß„ ßt (3-2) where the subscript n stands for the naïve model and t for the true model. In addition, to compare the impact of ME across models and across regression coefficients using different scales, I calculate a relative measure of the bias as follows, Adjustment of Recall Errors in Duration Data Using SIMEX 36 Results for the different models studied and the impact generated by the different types of ME are presented in Table 1. In all of the scenarios studied the effect of ME was reflected in a downward bias for pL (the coefficient for the variable X"), and in upward biases for j30 and p2 (the coefficients for the constant and Z). In addition to the observed differences in the direction of the biases across coefficients there are also strong differences in their intensity. The size of the bias for ¡i2 is about twice as large as the bias for ¡30 and ft, reaching levels as alarming as 94.8% for the logit model with heteroscedastic ME, although the average size of the bias across all the scenarios is 39.5%. Table 1: Impact of Measurement Error in the Regression Estimates Linear Logit Poisson Coef SE Bias R.Bias Coef SE Bias R.Bias Coef SE Bias R.Bias "ü ßo -1.297 .035 -5.997 .388 -1.362 .069 o a ßi .150 .003 .768 .050 .092 .004 § s- H ßi .111 .016 .210 .099 .082 .038 ßo -1.013 .038 .284 21.9% -3.810 .258 2.187 36.5% -1.198 .065 .284 20.8% Naïve: multi. ßl .118 .004 -.032 21.2% .494 .034 -.275 37.5% .075 .004 -.032 34.6% ßl .156 .019 .045 40.9% .275 .084 .065 30.9% .157 .037 .045 55.1% ßo -.998 .039 .372 28.7% -3.649 .248 2.372 39.5% -1.178 .065 .372 27.4% s- ßl .116 .004 -.043 28.8% .471 .032 -.299 38.9% .074 .004 -.043 47.0% £ % ßl .169 .020 .067 60.7% .377 .085 .199 94.8% .141 .037 .067 81.7% ßo -.937 .038 .325 25.1% -3.441 .233 1.962 32.7% -1.054 .059 .325 23.9% Naïve: under. ßl ßl .121 .166 .004 .020 -.024 .052 15.9% 46.9% .490 .321 .033 .084 -.183 .124 23.8% 58.9% .069 .149 .004 .037 -.024 .052 26.1% 63.2% ßo -1.028 .038 .245 18.9% -4.029 .266 1.842 30.7% -1.110 .061 .245 18.0% S ^ 3 £ ßl .108 .003 -.039 26.1% .470 .031 -.284 37.0% .061 .003 -.039 42.7% £ ° ßl .150 .019 .053 48.0% .284 .087 .152 72.4% .134 .037 .053 64.6% While the different ME scenarios clearly show attenuated coefficients, none of the coefficients actually became statistically non-significant or changed their sign in comparison to the naïve models. This is partly due to the small effect that ME had on the standard errors, which were underestimated by a third of their size in the true logit model, and only slightly underestimated and overestimated when using a Poisson and a linear model, respectively. These results are consistent with Biewen et al. (2008) who, in a simulated probit model with one predictor, find an upward bias in the constant and a downward bias in the slope 37 Jose Pina-Sanchez induced by classical multiplicative ME. These results obtained here serve to reinforce these findings. In the presence of a type of ME different than classical additive or for a model different than simple linear regression the direction of the bias is not always towards the null. The difficulty to anticipate the direction and size of these biases - even in scenarios with moderate prevalence of ME - makes the implementation of adjustment methods an indispensable part analysing survey data prone to these types of ME. 4 Standard SIMEX and Extensions to Account for Classical Multiplicative Measurement Error The study of the adjustment of multiplicative errors dates back to the decade of the 80s. Fuller (1984) and Hwang (1986) developed a method-of-moments correction for multiplicative ME in the explanatory variables of a linear model. This method assumes that the value of ME variance is known - or that it can be estimated - and is limited to applications where the ME mechanism is affecting one of the explanatory variables only in the context of a linear model. Lyles and Kupper (1997) compared the effectiveness of this method with others such as regression calibration, and a quasi-likelihood approach, which could be applied to other non-linear outcome models. These methods, as mentioned above, are however of limited use to applied researchers in that they either require additional data in the form of replicated measures or validation subsamples, or are complex to implement. Regression calibration requires additional data in the form of replicated measures or a validation subsample. Quasi-likelihood approaches only need an estimate of the variance of the ME, and much like those relying on Bayesian statistics can be applied when a full likelihood approach is not feasible due to computational intractability. However, their implementation is relatively complex, starting from the need to use specialised software (such as WinBUGS when considering Bayesian adjustments), which discourages many analysts from attempting the implementation of the necessary adjustment. Due to only requiring an estimate of the variance of the ME, the simplicity of its application, and its generalizability to any other outcome model regardless of its complexity6, SIMEX represents a very convenient alternative. SIMEX was first presented by Cook and Stefanski (1994) and refined in the following years by Stefanski and Cook (1995) and 6 See for example He et al. (2007) who applied SIMEX to an Accelerated Failure Time models with one of the explanatory variable affected by classical ME, or Battauz et al. (2008) who adjusted for a similar type of ME problem but for an ordinal probit model as the outcome model. Adjustment of Recall Errors in Duration Data Using SIMEX 38 Carroll, Kuchenhoff, Lombard, and Stefanski (1996). "The key idea underlying SIMEX is the fact that the effect of measurement error on an estimator can be determined experimentally via simulation " (Carroll et al., 2006: 98). In particular, SIMEX exploits the relationship between the size of the ME affecting a variable and the size of the bias in the regression estimates in the outcome model. Following Fuller (1987) we know that the unadjusted estimator of the slope, /?i, does not converge asymptotically to the parameter but to: where erf and can be calculated by extrapolating G (filk,Ak^ to G Ak = —lj. Note that from equation 4.4 when Ak — —1 the bias is cancelled out. Figure 3 represents the SIMEX process graphically. The solid line denotes the part of the extrapolation function that can be approximately observed through the regression estimates resulting after the outcome model is specified using simulated predictors with increasing 7 This is the number of iterations used by default in the SIMEX packages in STATA and R. Adjustment of Recall Errors in Duration Data Using SIMEX 40 levels of ME, and the dashed line represents the extrapolation to the case of no ME, which gives the adjusted estimate. Figure 3: Extrapolation function o C\l - o fii " -a o o 0.0 0.5 1.0 15 2 0 2.5 3.0 (1+A*) Figure 3 also shows some of the limitations of SIMEX. The entire extrapolation function cannot be observed, hence, it is hard to assess the quality of the adjustment. In addition, the extrapolation function needs to be approximated using a simple functional form. Therefore adjustments are only approximated, and their effectiveness depends on how well the extrapolation function is estimated, for which the choice of the right functional form is crucial. In the case depicted by Figure 3 it makes sense to think of the quadratic function as the better approximation, but it might not always be so clear. Another cause of concern stems from the accuracy of the estimate of S O '-9 ß1 .141 .002 -.009 26.8% .617 .012 -.151 54.9% .090 .002 -.001 7.3% *Ö3 O $ ß2 .121 .004 .010 23.2% .207 .016 -.003 4.4% .119 .005 .037 48.9% i u Underestimated ß0 -1.339 .018 -.043 15.1% -5.148 .095 .849 38.8% -1.436 .027 -.074 44.9% ß1 .160 .002 .010 32.1% .681 .014 -.087 31.6% .101 .003 .010 58.1% ß2 .095 .006 -.016 35.5% .174 .020 -.035 54.6% .096 .009 .014 18.2% 1 -a nd et ß0 -1.361 .018 -.064 22.7% -5.134 .076 .863 39.4% -1.451 .030 -.089 54.4% 3 a ry mit ß1 .163 .002 .014 42.7% .681 .011 -.087 31.6% .103 .003 .012 71.0% e ts > » ß2 .090 .007 -.020 44.7% .175 .020 -.034 53.1% .092 .009 .010 13.2% ß0 -1.265 .019 .032 10.6% -4.790 .111 1.207 51.4% -1.378 .034 -.016 8.6% e g ß1 .149 .002 <.001 1.1% .629 .016 -.140 46.9% .096 .003 .004 22.8% O ß2 .122 .005 .011 18.8% .324 .020 .114 68.1% .082 .009 <.001 0.8% d r- et ß0 -1.182 .014 .114 38.3% -4.481 .093 1.515 64.5% -1.317 .027 .045 24.3% o re ta > S O '-9 ß1 .139 .002 -.011 32.3% .585 .013 -.183 61.6% .089 .003 -.003 15.4% SO T3 Hi O se ß2 .137 .004 .026 44.4% .340 .014 .130 77.4% .098 .006 .016 26.3% o fe Underestimated ß0 -1.320 .019 -.023 7.8% -4.925 .090 1.071 45.6% -1.414 .034 -.052 28.5% Hi a: ß1 .157 .002 .007 21.2% .650 .013 -.118 39.8% .100 .003 .008 47.7% ß2 .112 .007 .002 3.0% .315 .020 .105 62.8% .073 .010 -.009 15.9% rde d nd et ß0 -1.342 .018 -.045 15.1% -4.923 .083 1.073 45.7% -1.427 .035 -.065 35.5% 3 a ry mit ß1 .160 .002 .011 31.1% .651 .011 -.117 39.3% .102 .003 .010 58.3% e ts > Ö ß2 .108 .008 -.002 3.5% .314 .019 .104 62.1% .069 .009 -.013 22.6% ß0 -1.201 .019 .096 26.6% -4.565 .085 1.432 56.0% -1.214 .031 .148 47.9% e g ß1 .153 .003 .003 11.4% .649 .013 -.119 42.9% .086 .003 -.006 25.2% O ß2 .115 .005 .004 7.7% .243 .017 .033 29.9% .093 .009 .011 16.5% d r- et ß0 -1.106 .013 .191 53.0% -4.207 .072 1.790 70.1% -1.157 .020 .205 66.6% > "o3 eg er ta > S O '-9 ß1 .144 .002 -.005 18.6% .606 .011 -.163 58.5% .081 .002 -.010 44.5% £ se ß2 .134 .004 .023 41.5% .265 .012 .055 49.5% .111 .007 .029 43.3% 1 Underestimated ß0 -1.236 .022 .060 16.7% -4.640 .094 1.357 53.1% -1.235 .031 .127 41.3% ß1 .164 .003 .014 48.1% .677 .015 -.091 32.9% .092 .003 <.001 0.9% Cfl ß2 .109 .006 -.002 3.6% .239 .020 .029 26.1% .087 .010 .005 7.2% r-de d nd et ß0 -1.257 .018 .040 11.1% -4.644 .091 1.352 52.9% -1.247 .031 .115 37.3% 3 a ry mit ß1 .167 .003 .017 60.5% .680 .015 -.088 31.8% .094 .004 .002 9.6% e ts > » ß2 .105 .007 -.006 10.7% .239 .017 .029 26.1% .083 .010 .001 1.0% 8 ß0 -1.278 .019 .018 6.9% -5.252 .097 .744 37.8% -1.263 .033 .099 39.2% e ë ß1 .141 .002 -.009 21.9% .632 .012 -.136 45.7% .079 .003 -.013 41.4% O ß2 .102 .005 -.008 20.3% .199 .018 -.011 14.4% .080 .009 -.002 4.0% d r- et ß0 -1.209 .014 .087 32.4% -4.959 .087 1.038 52.7% -1.221 .017 .141 56.1% .5» er ta > S O "-P ß1 .129 .002 -.021 50.4% .584 .011 -.185 61.9% .072 .002 -.019 63.4% O OH O se ß2 .116 .004 .005 13.7% .217 .014 .007 8.9% .093 .007 .011 21.8% "o3 S3 Underestimated ß0 -1.355 .023 -.058 21.5% -5.450 .110 .547 27.8% -1.307 .038 .055 21.7% .S3 £ ß1 .146 .003 -.004 8.7% .649 .014 -.119 40.0% .082 .003 -.010 32.7% ß2 .089 .006 -.022 55.2% .188 .022 -.022 29.2% .066 .010 -.016 30.3% r-de d nd et ß0 -1.378 .020 -.082 30.4% -5.444 .102 .553 28.1% -1.322 .034 .040 16.1% 3 a ry mit ß1 .149 .002 .000 0.7% .650 .013 -.118 39.6% .084 .003 -.008 26.6% re ts V se ß2 .084 .008 -.027 66.5% .189 .019 -.021 27.6% .062 .011 -.021 39.7% 45 Jose Pina-Sanchez A first point to notice is that compared to results from the true models presented in Table 1, the standard errors are underestimated by a half. This might be due to the small size of the true standard errors (expressed in the second or third decimal point), but it also illustrates that the variance of ¡3simex using bootstrap can only be approximated. Regarding the adjustment in terms of the reduction of the biases found in the naïve analyses we can observe varying levels of success. The effectiveness of the adjustment ranged from being able to reduce it to . 8% of its size (for jS2 in the Poisson model affected by heteroscedastic ME and using the correct estimate of