Metodoloski zvezki, Vol. 13, No. 1, 2016, 27-58 Adjustment of Recall Errors in Duration Data Using SIMEX Jose Pina-Sánchez1 Abstract It is widely accepted that due to memory failures retrospective survey questions tend to be prone to measurement error. However, the proportion of studies using such data that attempt to adjust for the measurement problem is shockingly low. Arguably, to a great extent this is due to both the complexity of the methods available and the need to access a subsample containing either a gold standard or replicated values. Here I suggest the implementation of a version of SIMEX capable of adjusting for the types of multiplicative measurement errors associated with memory failures in the retrospective report of durations of life-course events. SIMEX is a method relatively simple to implement and it does not require the use of replicated or validation data so long as the error process can be adequately specified. To assess the effectiveness of the method I use simulated data. I create twelve scenarios based on the combinations of three outcome models (linear, logit and Poisson) and four types of multiplicative errors (non-systematic, systematic negative, systematic positive and heteroscedastic) affecting one of the explanatory variables. I show that SIMEX can be satisfactorily implemented in each of these scenarios. Furthermore, the method can also achieve partial adjustments even in scenarios where the actual distribution and prevalence of the measurement error differs substantially from what is assumed in the adjustment, which makes it an interesting sensitivity tool in those cases where all that is known about the error process is reduced to an educated guess. 1 Introduction Applied quantitative searchers commonly assume that variables included in their models are measured perfectly. This often implicit assumption is, however, difficult to maintain when using survey data as interviewer effects, interviewee fatigue, social desirability bias, lack of cooperation, or plain deceit inevitably introduce measurement error (ME). This is especially true for surveys using a retrospective design, which collect information about past events 1 School of Law, University of Leeds, J.PinaSanchez@leeds.ac.uk Adjustment of Recall Errors in Duration Data Using SIMEX 28 from a single contact with respondents. The advantages of retrospective designs, in comparison with prospective studies2, are well known: a) immune to problems of attrition; b) cheaper to administer; and c) more capable of detecting transitions occurring in short periods. Retrospective questions are however prone to ME as they require respondents to both interpret the question correctly and recall events that took place in the past. The consequences of using data affected by ME are both difficult to estimate and potentially disastrous (Nugent, Graycheck & Basham, 2000; and Vardeman et al, 2010)3. Unfortunately, the latter is rarely acknowledged, and in certain cases it is directly misunderstood For example, Carroll et al (2006) point at the widespread belief that ME affecting an explanatory variable will only attenuate the regression estimate of that variable4. Even amongst researchers that acknowledge the potential consequences of ME very little is done to tackle the problem besides mentioning it as a caveat. There are two reasons for this: the requirement of additional data and the complexity of the adjustment methods available. Generally, methods for the adjustment of ME need to be informed about the true unobserved values using additional data. For example, multiple imputation (Rubin, 1987, and Cole, Chu & Greenland, 2006) requires access to a validation subsample where the true values are observed. Regression calibration (Carroll & Stefanski, 1990; and Glesjer, 1990) needs at least repeated measurements, while two stage least squares (Theil, 1953) requires instrumental variables. However, researchers' access to this type of data tends to be the exception rather than the norm. In addition, these three methods belong to the family of functional methods - i.e. those that do not make any assumptions about the distribution of the true values. A second group of methods known as structural methods are technically more complex, amongst other things because they require specifying the probability function of the unobserved true values. Examples of structural methods are likelihood based adjustments, either Bayesian or Frequentist. These methods account directly for the ME mechanism in place, which tend to involve ad hoc specifications, in turn increasing the complexity of the adjustment. 2 See Solga (2001) for a comparison of data quality derived from prospective and retrospective questions. 3 "Measurement error is, to borrow a metaphor, a gremlin hiding in the details of our research that can contaminate the entire set of estimated regression parameters. " (Nugent, et al. 2000: 60). "Even the most elementary statistical methods have their practical effectiveness limited by measurement variation. " (Vardeman et al., 2010: 46). 4 "Despite admonitions of Fuller (1987) and others to the contrary, it is a common perception that the effect of ME is always to attenuate the line. In fact, attenuation depends critically on the classical additive ME model. " (Carroll et al., 2006: 46). 29 Jose Pina-Sanchez In this paper I will use simulated data to study the effectiveness of multiplicative Simulation Extrapolation Method (SIMEX) (Carroll et al. 2006; and Biewen, Nolte & Rosemann, 2008). This is an extension of the standard SIMEX method (Cook & Stefanski, 1994) capable of adjusting for the recall errors that are typically observed in the retrospective reports of life-course events. SIMEX implementation is relatively simple in that only requires an estimate of the reliability of the variable affected by ME. This is normally obtained using a subsample of replicated data. Here I will assume that such information is not available to the researcher, as it is often the case. Instead I will use this technique to show its potential to carry out sensitivity analysis when the reliability ratios have to be assumed. That is, I will demonstrate how the problem of recall errors so ubiquitous in retrospective data can be effectively dealt with by researchers who do not have neither the technical background to carry out complex adjustments, nor access to additional sources of data. In so doing my ultimate goal is to encourage a wider audience of survey researchers both to reflect about the implications of relying variables affected by ME and to consider the possibility of assessing the robustness of their findings. In the following section I review the theory regarding the types of errors that can be expected from retrospective questions and the models that have been normally used to specify them. In Section 3, I present the simulated data that will be used in the analysis and illustrate the implications of using an explanatory variable affected by multiplicative errors in different outcome models. Section 4 lays out the functioning of the standard SIMEX and the extension considered to accommodate multiplicative ME. In Section 5 the results of the analysis are presented, and in Section 6 I conclude with a discussion of the relevance of the main findings. 2 Modelling Memory Failures in Retrospective Questions on Life-Course Events Most studies aiming to assess the implications of ME or to adjust for them assume a simple error mechanism known as the classical ME model. This model was first formally defined by Novick (1966) as follows, Adjustment of Recall Errors in Duration Data Using SIMEX 30 where X* is the observed variable, equal to the true variable X, plus the ME term V. which fulfils six important assumptions: 1. Null expectancy refers to the assumption that the error term is non-systematic, or in other words, the expected value of the error term is zero, E(V) = 0. 2. The assumption of homoscedasticity indicates that the variance of the error term is assumed to remain constant across subjects, VarCVi) = Var(V) = <7y. 3. In addition to having an expectation of zero and constant variance the error term is normally distributed, V~N (0, 0 lO, 7 < 0 and as a count variable, Yco, The four simulated ME scenarios are represented by the variables, X\ X?. X£ and X*. In each of these scenarios X is subject to normally distributed classical multiplicative ME. I choose to simulate normal instead of log-normal errors (as explained in equation 2.6) to ensure that they are perfectly symmetric around their mean (the latter are skewed to the right to a certain extent). Figure 1 shows the probability and mass functions for each of the variables simulated, while the specific code used in R is shown in Appendix I. Adjustment of Recall Errors in Duration Data Using SIMEX 34 Figure 1: Probability Density and Mass Functions of the Simulated Variables In the first ME scenario I simulate non-systematic errors distributed as a .25). The multiplicative effect of these errors results in a new variable X* with a reliability ratio (RR) of .816. In the second scenario I explore the effect of heteroscedastic ME by changing the distribution of the errors from N(l, .15) to Af(l, .35) when Z > 0. This is a type of ME that could take place when different survey modes are used. For example, Roberts (2007) - after reviewing the literature - concluded that telephone interviews place a higher cognitive demand on the interviewee than face-to-face interviews, which tend to make them more prone to measurement error. In the third scenario I study the effect of systematically underreported durations by simulating errors distributed as N(.9, .25). These are the types of errors that could be expected in the presence of forward telescoping bias (e.g. Golub et al., 2000, and Johnson and Schultz, 2005, found evidence of these types of errors in reports of onset of drug usage and smoking, respectively), but also in the report of durations of socially undesirable events (e.g. Pina-Sánchez, Koskinen & Plewis, 2013 and Pina-Sánchez, Koskinen & Plewis, 2014, found an increased tendency to underreport the longer spells of unemployment). Lastly, I explore the opposite scenario, one where the errors are distributed 35 Jose Pina-Sanchez as /V(l.l, .25) to reflect overreported durations, which could be expected in the presence of backward telescoping or in reports of socially desirable events. The effects of this four types of simulated ME are shown using scatter-plots in Figure 2. Notice that the top-right plot uses Z instead of X.in the y-axis. Figure 2: Scatterplots of the effect of the different types of measurement error considered To assess the impact that these types of errors have on the regression coefficient of a linear, a logit, and a Poisson model, I compare the results from each of these models when X * is used (the naïve model) instead of X (the true model). Specifically, I focus on the bias in the regression coefficients, BIAS = ß„ ßt (3-2) where the subscript n stands for the naïve model and t for the true model. In addition, to compare the impact of ME across models and across regression coefficients using different scales, I calculate a relative measure of the bias as follows, Adjustment of Recall Errors in Duration Data Using SIMEX 36 Results for the different models studied and the impact generated by the different types of ME are presented in Table 1. In all of the scenarios studied the effect of ME was reflected in a downward bias for pL (the coefficient for the variable X"), and in upward biases for j30 and p2 (the coefficients for the constant and Z). In addition to the observed differences in the direction of the biases across coefficients there are also strong differences in their intensity. The size of the bias for ¡i2 is about twice as large as the bias for ¡30 and ft, reaching levels as alarming as 94.8% for the logit model with heteroscedastic ME, although the average size of the bias across all the scenarios is 39.5%. Table 1: Impact of Measurement Error in the Regression Estimates Linear Logit Poisson Coef SE Bias R.Bias Coef SE Bias R.Bias Coef SE Bias R.Bias "ü ßo -1.297 .035 -5.997 .388 -1.362 .069 o a ßi .150 .003 .768 .050 .092 .004 § s- H ßi .111 .016 .210 .099 .082 .038 ßo -1.013 .038 .284 21.9% -3.810 .258 2.187 36.5% -1.198 .065 .284 20.8% Naïve: multi. ßl .118 .004 -.032 21.2% .494 .034 -.275 37.5% .075 .004 -.032 34.6% ßl .156 .019 .045 40.9% .275 .084 .065 30.9% .157 .037 .045 55.1% ßo -.998 .039 .372 28.7% -3.649 .248 2.372 39.5% -1.178 .065 .372 27.4% s- ßl .116 .004 -.043 28.8% .471 .032 -.299 38.9% .074 .004 -.043 47.0% £ % ßl .169 .020 .067 60.7% .377 .085 .199 94.8% .141 .037 .067 81.7% ßo -.937 .038 .325 25.1% -3.441 .233 1.962 32.7% -1.054 .059 .325 23.9% Naïve: under. ßl ßl .121 .166 .004 .020 -.024 .052 15.9% 46.9% .490 .321 .033 .084 -.183 .124 23.8% 58.9% .069 .149 .004 .037 -.024 .052 26.1% 63.2% ßo -1.028 .038 .245 18.9% -4.029 .266 1.842 30.7% -1.110 .061 .245 18.0% S ^ 3 £ ßl .108 .003 -.039 26.1% .470 .031 -.284 37.0% .061 .003 -.039 42.7% £ ° ßl .150 .019 .053 48.0% .284 .087 .152 72.4% .134 .037 .053 64.6% While the different ME scenarios clearly show attenuated coefficients, none of the coefficients actually became statistically non-significant or changed their sign in comparison to the naïve models. This is partly due to the small effect that ME had on the standard errors, which were underestimated by a third of their size in the true logit model, and only slightly underestimated and overestimated when using a Poisson and a linear model, respectively. These results are consistent with Biewen et al. (2008) who, in a simulated probit model with one predictor, find an upward bias in the constant and a downward bias in the slope 37 Jose Pina-Sanchez induced by classical multiplicative ME. These results obtained here serve to reinforce these findings. In the presence of a type of ME different than classical additive or for a model different than simple linear regression the direction of the bias is not always towards the null. The difficulty to anticipate the direction and size of these biases - even in scenarios with moderate prevalence of ME - makes the implementation of adjustment methods an indispensable part analysing survey data prone to these types of ME. 4 Standard SIMEX and Extensions to Account for Classical Multiplicative Measurement Error The study of the adjustment of multiplicative errors dates back to the decade of the 80s. Fuller (1984) and Hwang (1986) developed a method-of-moments correction for multiplicative ME in the explanatory variables of a linear model. This method assumes that the value of ME variance is known - or that it can be estimated - and is limited to applications where the ME mechanism is affecting one of the explanatory variables only in the context of a linear model. Lyles and Kupper (1997) compared the effectiveness of this method with others such as regression calibration, and a quasi-likelihood approach, which could be applied to other non-linear outcome models. These methods, as mentioned above, are however of limited use to applied researchers in that they either require additional data in the form of replicated measures or validation subsamples, or are complex to implement. Regression calibration requires additional data in the form of replicated measures or a validation subsample. Quasi-likelihood approaches only need an estimate of the variance of the ME, and much like those relying on Bayesian statistics can be applied when a full likelihood approach is not feasible due to computational intractability. However, their implementation is relatively complex, starting from the need to use specialised software (such as WinBUGS when considering Bayesian adjustments), which discourages many analysts from attempting the implementation of the necessary adjustment. Due to only requiring an estimate of the variance of the ME, the simplicity of its application, and its generalizability to any other outcome model regardless of its complexity6, SIMEX represents a very convenient alternative. SIMEX was first presented by Cook and Stefanski (1994) and refined in the following years by Stefanski and Cook (1995) and 6 See for example He et al. (2007) who applied SIMEX to an Accelerated Failure Time models with one of the explanatory variable affected by classical ME, or Battauz et al. (2008) who adjusted for a similar type of ME problem but for an ordinal probit model as the outcome model. Adjustment of Recall Errors in Duration Data Using SIMEX 38 Carroll, Kuchenhoff, Lombard, and Stefanski (1996). "The key idea underlying SIMEX is the fact that the effect of measurement error on an estimator can be determined experimentally via simulation " (Carroll et al., 2006: 98). In particular, SIMEX exploits the relationship between the size of the ME affecting a variable and the size of the bias in the regression estimates in the outcome model. Following Fuller (1987) we know that the unadjusted estimator of the slope, /?i, does not converge asymptotically to the parameter but to: where erf and can be calculated by extrapolating G (filk,Ak^ to G Ak = —lj. Note that from equation 4.4 when Ak — —1 the bias is cancelled out. Figure 3 represents the SIMEX process graphically. The solid line denotes the part of the extrapolation function that can be approximately observed through the regression estimates resulting after the outcome model is specified using simulated predictors with increasing 7 This is the number of iterations used by default in the SIMEX packages in STATA and R. Adjustment of Recall Errors in Duration Data Using SIMEX 40 levels of ME, and the dashed line represents the extrapolation to the case of no ME, which gives the adjusted estimate. Figure 3: Extrapolation function o C\l - o fii " -a o o 0.0 0.5 1.0 15 2 0 2.5 3.0 (1+A*) Figure 3 also shows some of the limitations of SIMEX. The entire extrapolation function cannot be observed, hence, it is hard to assess the quality of the adjustment. In addition, the extrapolation function needs to be approximated using a simple functional form. Therefore adjustments are only approximated, and their effectiveness depends on how well the extrapolation function is estimated, for which the choice of the right functional form is crucial. In the case depicted by Figure 3 it makes sense to think of the quadratic function as the better approximation, but it might not always be so clear. Another cause of concern stems from the accuracy of the estimate of S O '-9 ß1 .141 .002 -.009 26.8% .617 .012 -.151 54.9% .090 .002 -.001 7.3% *Ö3 O $ ß2 .121 .004 .010 23.2% .207 .016 -.003 4.4% .119 .005 .037 48.9% i u Underestimated ß0 -1.339 .018 -.043 15.1% -5.148 .095 .849 38.8% -1.436 .027 -.074 44.9% ß1 .160 .002 .010 32.1% .681 .014 -.087 31.6% .101 .003 .010 58.1% ß2 .095 .006 -.016 35.5% .174 .020 -.035 54.6% .096 .009 .014 18.2% 1 -a nd et ß0 -1.361 .018 -.064 22.7% -5.134 .076 .863 39.4% -1.451 .030 -.089 54.4% 3 a ry mit ß1 .163 .002 .014 42.7% .681 .011 -.087 31.6% .103 .003 .012 71.0% e ts > » ß2 .090 .007 -.020 44.7% .175 .020 -.034 53.1% .092 .009 .010 13.2% ß0 -1.265 .019 .032 10.6% -4.790 .111 1.207 51.4% -1.378 .034 -.016 8.6% e g ß1 .149 .002 <.001 1.1% .629 .016 -.140 46.9% .096 .003 .004 22.8% O ß2 .122 .005 .011 18.8% .324 .020 .114 68.1% .082 .009 <.001 0.8% d r- et ß0 -1.182 .014 .114 38.3% -4.481 .093 1.515 64.5% -1.317 .027 .045 24.3% o re ta > S O '-9 ß1 .139 .002 -.011 32.3% .585 .013 -.183 61.6% .089 .003 -.003 15.4% SO a> o se ß2 .137 .004 .026 44.4% .340 .014 .130 77.4% .098 .006 .016 26.3% o fe Underestimated ß0 -1.320 .019 -.023 7.8% -4.925 .090 1.071 45.6% -1.414 .034 -.052 28.5% a: ß1 .157 .002 .007 21.2% .650 .013 -.118 39.8% .100 .003 .008 47.7% ß2 .112 .007 .002 3.0% .315 .020 .105 62.8% .073 .010 -.009 15.9% rde d nd et ß0 -1.342 .018 -.045 15.1% -4.923 .083 1.073 45.7% -1.427 .035 -.065 35.5% 3 a ry mit ß1 .160 .002 .011 31.1% .651 .011 -.117 39.3% .102 .003 .010 58.3% e ts > Ö ß2 .108 .008 -.002 3.5% .314 .019 .104 62.1% .069 .009 -.013 22.6% ß0 -1.201 .019 .096 26.6% -4.565 .085 1.432 56.0% -1.214 .031 .148 47.9% e g ß1 .153 .003 .003 11.4% .649 .013 -.119 42.9% .086 .003 -.006 25.2% O ß2 .115 .005 .004 7.7% .243 .017 .033 29.9% .093 .009 .011 16.5% d r- et ß0 -1.106 .013 .191 53.0% -4.207 .072 1.790 70.1% -1.157 .020 .205 66.6% > "o3 eg er ta > S O '-9 ß1 .144 .002 -.005 18.6% .606 .011 -.163 58.5% .081 .002 -.010 44.5% £ se ß2 .134 .004 .023 41.5% .265 .012 .055 49.5% .111 .007 .029 43.3% 1 Underestimated ß0 -1.236 .022 .060 16.7% -4.640 .094 1.357 53.1% -1.235 .031 .127 41.3% ß1 .164 .003 .014 48.1% .677 .015 -.091 32.9% .092 .003 <.001 0.9% m ß2 .109 .006 -.002 3.6% .239 .020 .029 26.1% .087 .010 .005 7.2% r-de d nd et ß0 -1.257 .018 .040 11.1% -4.644 .091 1.352 52.9% -1.247 .031 .115 37.3% 3 a ry mit ß1 .167 .003 .017 60.5% .680 .015 -.088 31.8% .094 .004 .002 9.6% e ts > » ß2 .105 .007 -.006 10.7% .239 .017 .029 26.1% .083 .010 .001 1.0% 8 ß0 -1.278 .019 .018 6.9% -5.252 .097 .744 37.8% -1.263 .033 .099 39.2% e ë ß1 .141 .002 -.009 21.9% .632 .012 -.136 45.7% .079 .003 -.013 41.4% O ß2 .102 .005 -.008 20.3% .199 .018 -.011 14.4% .080 .009 -.002 4.0% d r- et ß0 -1.209 .014 .087 32.4% -4.959 .087 1.038 52.7% -1.221 .017 .141 56.1% .5» er ta > S O "-P ß1 .129 .002 -.021 50.4% .584 .011 -.185 61.9% .072 .002 -.019 63.4% O OH O se ß2 .116 .004 .005 13.7% .217 .014 .007 8.9% .093 .007 .011 21.8% "o3 S3 Underestimated ß0 -1.355 .023 -.058 21.5% -5.450 .110 .547 27.8% -1.307 .038 .055 21.7% .S3 £ ß1 .146 .003 -.004 8.7% .649 .014 -.119 40.0% .082 .003 -.010 32.7% ß2 .089 .006 -.022 55.2% .188 .022 -.022 29.2% .066 .010 -.016 30.3% r-de d nd et ß0 -1.378 .020 -.082 30.4% -5.444 .102 .553 28.1% -1.322 .034 .040 16.1% 3 a ry mit ß1 .149 .002 .000 0.7% .650 .013 -.118 39.6% .084 .003 -.008 26.6% re ts V se ß2 .084 .008 -.027 66.5% .189 .019 -.021 27.6% .062 .011 -.021 39.7% 45 Jose Pina-Sanchez A first point to notice is that compared to results from the true models presented in Table 1, the standard errors are underestimated by a half. This might be due to the small size of the true standard errors (expressed in the second or third decimal point), but it also illustrates that the variance of ¡3simex using bootstrap can only be approximated. Regarding the adjustment in terms of the reduction of the biases found in the naïve analyses we can observe varying levels of success. The effectiveness of the adjustment ranged from being able to reduce it to . 8% of its size (for jS2 in the Poisson model affected by heteroscedastic ME and using the correct estimate of