Image Anal Stereol 2005;24:159-168 Original Research Paper A CASE STUDY ON POINT PROCESS MODELLING IN DISEASE MAPPING V iktor B eneš1-2, K arel B odläk1, J esper M0ller3 and R asmus W aagepetersen3 Charles University in Prague, Faculty of Mathematics and Physics, Sokolovska´ 83, 18675 Prague 8, Czech Republic, Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Pod Voda´renskou vez¡´i 4, 18208 Prague 8, Czech Republic, 3Aalborg University, Department of Mathematical Sciences, Fredrik Bajersvej 7G, DK-9220 Aalborg, Denmark e-mail: benesv@karlin.mff.cuni.cz,jm@math.aau.dk,rw@math.aau.dk (Accepted September 5, 2005) ABSTRACT We consider a data set of locations where people in Central Bohemia have been infected by tick-borne encephalitis (TBE), and where population census data and covariates concerning vegetation and altitude are available. The aims are to estimate the risk map of the disease and to study the dependence of the risk on the covariates. Instead of using the common area level approaches we base the analysis on a Bayesian approach for a log Gaussian Cox point process with covariates. Posterior characteristics for a discretized version of the log Gaussian Cox process are computed using Markov chain Monte Carlo methods. A particular problem which is thoroughly discussed is to determine a model for the background population density. The risk map shows a clear dependency with the population intensity models and the basic model which is adopted for the population intensity determines what covariates influence the risk of TBE. Model validation is based on the posterior predictive distribution of various summary statistics. Keywords: background intensity, Bayesian estimation, L-function, log Gaussian Cox spatial point process. INTRODUCTION The aims of statistical disease mapping are to characterize the spatial variation of cases of a disease and to investigate connections with possible covariates. In particular it is of interest to identify areas with an elevated disease risk. The data may be a point pattern showing e.g., home residences of diseased people or locations where people have acquired an infection. Often, the data are aggregated so that only counts of diseased people within subregions of the study region are available. Indeed, most statistical analyses reported in the literature are based on a so-called area level approach, where a model for aggregated data is used after an initial aggregation. However, if a spatial point pattern is available, it is more natural to use a spatial point process model. Recent surveys of both the area level approach and point process modelling in disease mapping are given by Diggle (2000), Richardson (2003), and the accompanying discussion by Knorr-Held (2003) and Moller (2003). For a discussion of statistical analysis of disease mapping in general, see Lawson et al. (2001), Lawson (2001), and the references therein. In this paper, we consider a point process approach to the analysis of a data set of positions of locations where people in Central Bohemia have been infected by tick-borne encephalitis (TBE). Specifically we consider a log Gaussian Cox point process (LGCP), where covariates concerning occurrence of different forest types, altitude, and the population density are used in the modelling of the spatially varying intensity of TBE infections. LGCPs were independently introduced in astronomy by Coles and Jones (1991) and in statistics by Moller et al. (1998); see also Moller and Waagepetersen (2002). To the best of our knowledge we here for the first time consider a Bayesian approach to inference for LGCPs with non-aggregated data. A particular problem is the determination of the ’background intensity’ of humans being at risk, cf. Diggle (2000) and Lawson (2001). Raw geographical population data connects population numbers to home locations, but typically people get infected during visits to more or less distant surroundings of their homes. This is an additional complication compared with spatial analysis of chronic diseases like cancer, where the objective may be to study association between disease incidence and risk factors at the home locations. We consider various approaches to smoothing of population data, where the smoothing to some extent is connected to the movement of people around their homes. The data and previous analyses in Zeman (1997) and other papers are described in more detail in Section Data and background. Section Bayesian analysis using LGCP considers estimation of the background 159 Bene¡ V: Point processes in disease mapping intensity, modelling of the risk function in terms of a LGCP depending on covariates, and our Bayesian approach to inference using Markov chain Monte Carlo methods. The results for different models of the background intensity are discussed in the last Section. DATA AND BACKGROUND DESCRIPTIONOFDATA AND PROBLEM SETTING TBE is an infectious debilitating illness which is transmitted by parasitic ticks and which occasionally afflicts humans. Epidemiologists and medical practitioners making decisions on prophylactic measures deal with the problem of estimating the risk that a human gets infected by TBE at a specific location, cf. Zeman (1997). Since field studies of potential animal hosts are expensive, usually the data for statistical analysis consist of case locations and a population map. Moreover, explanatory variables of geographical nature which may influence the risk of infection are often available from geographical information systems. Fig. 1. Locations of infection of 446 cases of tick-borne encephalitis in Central Bohemia. For each distinct location the number of cases associated with the location is shown (a plus corresponds to one case). a) b) c) d) Fig. 2. a) Locations of small forests (10-50 ha) (independent of the forest type). b) Locations of medium forests (50-150 ha). c) Three types of forest. Conifer: black; mixed: dark grey; foliate: light grey. d) Map of altitudes (in metres). 160 Image Anal Stereol 2005;24:159-168 Fig. 1 shows the point pattern of locations of 446 reported cases of TBE in Central Bohemia during 1971-93. The empty space in the middle of the figure corresponds to the capital Prague, and the total area of Central Bohemia is about 11860 km2. This data set was first studied in Zeman (1997). Only 255 distinct points are visible due to ties in the data caused by positional error where several cases in an area have been associated with a common representative point. The distinct points in Fig. 1 are marked with the number of cases associated with each point. Information concerning the magnitude of the positional error is not available. Different covariates are shown in Fig. 2a–d. Fig. 2a and Fig. 2b shows the locations of forests of areas between 10-50 and 50-150 ha, respectively. This covariate information is possibly relevant since ticks can be transmitted by deers and other animals living in small forest areas. Fig. 2c shows the subareas of three different forest types (conifer, foliate, and mixed forest). Fig. 2b and Fig. 2c are obtained from satellite images of LANDSAT-5 MSS with resolution power of 80×80 m2. Finally, Fig. 2d shows a map of altitudes obtained from the Institute of Military Topography, Dobruska. Some other candidates, e.g. a covariate indicating the vicinity of a river, were not included in the analysis. Fig. 3. Population at 3582 administrative units in Central Bohemia represented by discs (see the text in the subsection Description of data and problem setting). Finally, population data from the National Census Bureau, Prague, are available. For the Central Bohemia they consist of the number of inhabitants in 3582 administrative units. In Fig. 3 each unit is represented by a disc with center at a census point and radius given by 0.02?#inhabitants in the unit km (chosen to get good resolution outside cities). Clusters of discs correspond to larger towns and cities. The total number of inhabitants is 1,112,717 and the largest city has about 74,000 inhabitants. PREVIOUS DATA ANALYSIS Zeman (1997) considers both the point pattern of TBE cases and another point pattern of cases for a related disease, Lyme borreliosis (LB). The LB data consist of paired data: 866 reported locations of infection during 1987-91 in Central Bohemia and the home location of each infected person. Zeman (1997) uses the distances between cases of infection and home location to obtain a kernel for smoothing the population map. Apart from this smoothed population map no other covariates are included in Zeman’s analysis where the intensity functions of TBE and LB cases (each considered as a realization of a point process) are estimated by kernel methods. For each disease, Zeman (1997) obtains a risk map by the ratio of the estimated intensity function and the smoothed population map (Bithell, 1990). A similar ratio estimator of the risk map is suggested by Krejc¡´i¡r (2000) where again both the TBE and the LB data are analyzed, using only the population data as an explanatory variable. He assumes that each point pattern of cases is a realization of an inhomogeneous Poisson point process with an intensity function constructed from beta splines, where the coefficients are estimated by a maximum likelihood method. Incorporating the other explanatory variables in the model has so far only been studied in connection to two area level approaches for the TBE data. Mas¡ata (1999) divides Central Bohemia into 41 irregular subregions, and he includes three covariates (the area in percentage of conifer, mixed, and foliate forests in each subregion) in a Bayesian Gaussian-Gaussian model (Stern and Cressie, 1999). Jirus¡e et al. (2004) use a subdivison of 141 squares of 10×10 km2, and include the same types of covariates as Mas¡ata (1999) together with the mean altitude and the proportion covered by small forests in each square. They use first a generalized linear model and the Akaike Information Criterion to optimize the set of parameters, and second an empirical Bayesian approach to estimate the risk. Jirus¡e et al. (2004) compare the credibility intervals for risk estimators obtained by their method with that of Mas¡ata (1999), and conclude that rather similar results are obtained in subregions with a large risk for infection, although it is only the model in Mas¡ata (1999) which incorporates spatial dependence. The results of the above-mentioned papers are further discussed in final Section. 161 Bene¡ V: Point processes in disease mapping BAYESIAN ANALYSIS USING LOG GAUSSIAN COX PROCESSES Let S denote the region of Central Bohemia and x the locations of observed tick cases. Below we describe a hierarchical model. At the first level, x is assumed to be a realization of a Poisson process X with an intensity function which is a product of a background intensity and a risk function as described in the first subsection of this section. Estimation of the background intensity is discussed in the second subsection. At the next level a log linear model for the risk function is proposed in the third subsection, incorporating the covariate information and a Gaussian process so that the uncertainty of the estimated background intensity is taken into account. At the final stage hyper priors on the unknown parameters for the covariates and the Gaussian field are imposed, whereby a posterior is obtained in the last subsection of this section. For computational reasons certain approximations of the posterior are required. Strictly speaking, the multiple points in x can not occur under the proposed model. However, our approximate approach only utilizes counts of locations within certain small cells and this makes the results less sensitive to the presence of ties in x. A SIMPLIFIED MODEL Our modelling of the TBE data is motivated by the following simplifying considerations, which are similar to one of the steps in the construction of a Neyman-Scott process (Neyman and Scott, 1958; Diggle, 1983). In the observation period 1971-93 a number m = 1,112,717 of persons are living at home locations h1,...,hm ? S, and the ith person makes a number Ni of visits to the surroundings of hi, i = 1,2,...,m. The Ni’s are assumed to be independent and Poisson distributed with mean ? > 0 independent of i. Given the Ni, the location of each visit of the ith person is distributed according to some density ghi , and the locations of visits of all persons are assumed to be independent. For a visit to a location s ? S, there is associated a probability ?(s) for getting an infection during the visit. The random set of locations where persons have been infected is then a Poisson process with intensity function of the form A(s)=p(s)7z(s),seS (1) where ?(s) =??mi= 1 ghi (s) is the background intensity of humans visiting s. ESTIMATION OF BACKGROUND INTENSITY The background intensity p(s) is a crucial component of the modelling. As it is unknown, we discuss below how it may be estimated. For the LB data both locations of infection and home are available. Under the crude assumption that the densities gh. are of the form gh.(s) = g(\\s-hi\\) one may as in Zeman (1997) try to estimate g from the LB data. Recall that if f denotes the density of ||Z|| for a two-dimensional random variable Z with isotropic density gh{z) = g(\\z-h\\), then g and f are related by g(l|z||)=f(lzll)/(2tf||z||), zeR2. (2) Zeman (1997) fits a power regression to a histogram for the log distances between home and place of infection. He then obtains an expression f(h) = ahb for the density of the distances and uses this as a kernel for smoothing of the population data. Strictly speaking f ~ is not a proper density on R+, and apparently Zeman (1997) is not using the correct transformation Eq. 2 to obtain a density on R2. We try another approach where we fit a non-parametric kernel density estimate f ^ to the distances between home and place of infection in the LB data. The density f ^ is subsequently transformed by Eq. 2 into a density g. The kernel estimate of the background intensity is finally p(s) = Xj U Kjg(\\s-uj\\) (3) where U is the index set of the administrative units, K j is the number of persons associated with the jth unit, and uj is the census point of the unit, cf. Fig. 3. Here A is for the moment left unspecified as it is absorbed into another parameter introduced in the next subsection. Note that we are ignoring the fact that people in the jth unit live in the vicinity of uj and not exactly at uj. The kernel estimate and alternative models for the background intensity are further discussed in the last subsection of this Section. PRIOR DISTRIBUTIONS AND LIKELIHOOD USING A LGCP We model n in Eq. 1 by a log linear model, n(s)=exp{ßT d(s) + Y(s)) (4) where Y(s) is a zero-mean Gaussian process, ß = (ß0,...,ß6)T is a regression parameter, and d(s) = (1,d 1 s),...,d 6 s))T. Here /30 is an intercept, and d1(s),...,d6(s) are the 6 covariates associated with the location s G S, where the index corresponds to the following: 162 Image Anal Stereol 2005;24:159-168 1 r^j forest 10-50 ha, 2 r^j forest 50-150 ha, 3 r^j conifer, 4 r^j mixed forest, 5 r^j foliate, 6 r^j altitude. That means di(s),...,d5 (s) are zero-one variables corresponding to absence-presence of the attribute at location s. The role of exp(Y(s)) is partly to model deviations of p(s)/p(s) from one. Therefore we do not constrain Eq. 4 to be less than one. Actually, in the previous section, X is absorbed in exp(/30) and we replace the unknown p by the estimate Eq. 3 with X = 1. Then n(s) is more precisely a relative risk function, since for shs2 G S, n{si)/n(s2) is the ratio of risk functions at the locations s\ and s2. We assume that Y is second-order stationary and isotropic with exponential covariance function, i.e., Cov(Y(s1),Y(s2)) c(\\s1-s2\\;a2,a) = o2exp(-\\sl-s2\\/a) (5) where o2 > 0 is the variance and a > 0 is the correlation parameter. A log Gaussian Cox process (LGCP) is then obtained by assuming that conditionally on Y = (Y(s))seS and 0 = (ß,a,a), the TBE cases form a Poisson process X with intensity function p(s)n(s). We view the Gaussian distribution for Y as a prior and the conditional distribution of X given (Y, 0) as the likelihood. Furthermore, a hyper prior density p(6) for 0 is imposed; specific hyper priors are considered in the next Section. Notice that the likelihood depends on 0 only through ß, and it has density p(x|Y,ß) (|S|- p(s)exp(ßT d(s) + Y(s))ds)x V S ) x \\p{^)exp{ßT d{^) + Y{^)) (6) L,ex with respect to the unit rate Poisson process on S where | • | denotes area. POSTERIOR AND DISCRETIZATION The posterior, that is, the conditional distribution of (Y, 0) given X = x, can be specified as follows. Suppose that p(6) is proper and let Ee denote expectation conditionally on 0. For n> 1 and pairwise distinct sh...,sn G S, let f{s1,...,sn)(-|0) denote the conditional density of (Y(si),...,Y(sn)) given 0. The posterior density of (Y(sl),...,Y(sn),6) given X is defined by f(s1,...,sn)(yu---,yn,e|x)~ Ee[p(x|Y,ß)Y(sl) =yi,...,Y(sn)=yn]x | f(s1,...,sn)(yi,---,yn|d)p(d). (7) The posterior of the process (Y,6) given X = x is then given by the consistent set of finite-dimensional posterior distributions with densities of the form Eq. 7 for n > 1 and pairwise distinct sh...,sneS. If p(6) is improper we define the posterior similarly provided it is well-defined; i.e., provided [Ee[p(x|Y,ß)]p(d)dd<°°. The integral in Eq. 6 depends on the continuous random field Y which cannot be represented on a computer. In practice the integral is approximated by a Riemann sum. The region S is appropriately translated and embedded in a rectangular region, say a square [0, b}2 of side length b > 0. For an integer M the square is divided into a lattice of M2 subsquares Cj, 17 G IM = {1,... ,M2}, and in each subsquare indexed by 17 the covariate value is constant, represented by an average value dM{i}) = jCMd(s)ds/|C^|. In Waagepetersen (2003) the approximate posterior based on the discretization is described and it is proved that under certain conditions, expectations computed with respect to the approximate posterior converges to the corresponding expectations with respect to Eq. 7 when M tends to infinity. RESULTS In this section we discuss the results for the TBE data obtained by the Bayesian approach described above. Details concerning estimation of population intensity are given in the first subsection together with specification of priors, and posterior results are discussed in the second subsection. Model selection is adressed in the third subsection. The estimated relative risk map is discussed in the fourth subsection. Concluding remarks are given in last subsection of this section. SPECIFICATION OF POPULATION INTENSITY MODEL AND PRIOR DISTRIBUTIONS For the discretized LGCP, S is rescaled and embedded in a unit square which is divided into a grid m „ G IM where M = 65 Thereby of square cells CM, rj x _ _ 163 Bene¡ V: Point processes in disease mapping Central Bohemia is covered by 2166 cells, and S occupies about 51% of the unit square. Other values M = 17 and M = 33 considered yielded too coarse discretizations while M = 129 was computationally too demanding. The following models of the population intensity are considered, setting A = 1. - Model W (constant p): pconst(s = ZjeUKj/\S\ is constant. - Model P (kernel based on paired data): ppair(s) = HjGUKjg ^hs - uj\\) is estimated as in Eq. 3 using the paired LB data. - Models B,D, and E (Gaussian kernel): pT(s) = Lj€UKjg(\\s-uj\\;z) where g(;t) : R+ -»¦ R+ is given by g(h; t) =
°° for a Gaussian kernel. Fig. 4 shows a selection of the different kernels g(-). Note that the tail for P falls between the tails of D and E.
For all the population intensity models we use independent hyper priors for ß, a, and a given by
p 1(j8)oc1, ßeR7,
p2(c7)°cexp(-10-6/