Content area
In large epidemiological studies, budgetary or logistical constraints will typically preclude study investigators from measuring all exposures, covariates and outcomes of interest on all study subjects. We develop a flexible theoretical framework that incorporates a number of familiar designs such as case control and cohort studies, as well as multistage sampling designs. Our framework also allows for designed missingness and includes the option for outcome dependent designs. Our formulation is based on maximum likelihood and generalizes well known results for inference with missing data to the multistage setting. A variety of techniques are applied to streamline the computation of the Hessian matrix for these designs, facilitating the development of an efficient software tool to implement a wide variety of designs. [PUBLICATION ABSTRACT]
Lifetime Data Anal (2007) 13:583605
DOI 10.1007/s10985-007-9068-7
Received: 12 October 2007 / Accepted: 25 October 2007 / Published online: 14 December 2007 Springer Science+Business Media, LLC 2007
Abstract In large epidemiological studies, budgetary or logistical constraints will typically preclude study investigators from measuring all exposures, covariates and outcomes of interest on all study subjects. We develop a exible theoretical framework that incorporates a number of familiar designs such as case control and cohort studies, as well as multistage sampling designs. Our framework also allows for designed missingness and includes the option for outcome dependent designs. Our formulation is based on maximum likelihood and generalizes well known results for inference with missing data to the multistage setting. A variety of techniques are applied to streamline the computation of the Hessian matrix for these designs, facilitating the development of an efcient software tool to implement a wide variety of designs.
Keywords Stage-sampling design Validation sampling Constrained optimization
1 Introduction
In large epidemiological studies, budgetary or logistical constraints will typically preclude study investigators from measuring all exposures, covariates and outcomes of interest on all study subjects. There is a rich literature on innovative study designs that address this issue. Perhaps the most well known is the familiar case-control study which involves selecting a sample of subjects who have experienced the event
M. Morara W. Strauss
Battelle Memorial Institute, 505 King Avenue, Columbus, OH 43201, USA
L. Ryan (B) A. Houseman
Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115, USAe-mail: [email protected]
Optimal design for epidemiological studies subject to designed missingness
Michele Morara Louise Ryan Andres Houseman Warren Strauss
123
584 Lifetime Data Anal (2007) 13:583605
of interest, along with a sample of controls. It is well known that a carefully designed case/control study will be substantially more powerful than a comparably sized cohort study when the endpoint of interest is rare (Corneld 1951; Breslow 1996). There has been great interest in extensions of case control methodologies in recent years, for example, to handle two and multistage sampling (Breslow and Cain 1988; Wacholder and Weinberg 1994; Hu and Lawless 1997) and stratied sampling (Langholz and Goldstein 2001; Breslow and Chatterjee 1999). There has also been considerable work on understanding and extending the analytic framework for analyzing case-control studies, including viewing them as a special type of outcome-dependent designs (Zhou and Weaver 2001; Zhou et al. 2002a,b), developing semi-parametric efcient estimation (Breslow et al. 2000, 2003) or establishing a theoretical basis for ignoring the sampling mechanism (Carroll et al. 1995b; Scott and Wild 2001a,b; Lawless et al. 1999). Other authors discuss case-control studies from the missing data perspective (Chatterjee and Carroll 2005; Spinka et al. 2005). Clayton et al. (1998)discuss the use of multistage designs in longitudinal studies. Innovative approaches such as rotating panel designs (Duncan and Kalton 1987), accelerated longitudinal designs (Harezlak et al. 2005) and other strategies for intentionally incomplete designs (Helms 1992) have also been suggested in the context of longitudinal studies where there might be a desire to restrict the number of followup measurements taken on each individual in order to lower cost or respondent burden. Other authors, for example Verbeke and Lesaffre (1999) and Verbeke and Molenberghs (2000) discuss practical ways to assess the power of incomplete designs.
With one or two exceptions (Spiegelman and Gray 1991a,b), the focus of much of this literature has been on innovative analysis methods. For example, consistent parameter estimates under a risk-set sampling design can be obtained by maximizing a likelihood that has the same structure as for a standard Cox proportional hazards model. With the exception of case-control designs where quite a lot has been written about issues such as the power gains by increasing the number of controls sampled for each case, relatively little has been written about more general design strategies. For example, an investigator in the planning stage of a study might want to know whether the power gains associated with a case/control design warrant the extra risk of introducing bias. In this paper, we describe a recently developed and sophisticated software tool that implements optimal design guidelines for a wide variety of study designs. By conceptualizing all these designs as various special cases of designed missingness, we are able to exploit the considerable literature on missing data. Key to our approach is a logistic regression model that characterizes the probability of missing data as a function of various subject characteristics and outcomes. Optimal designs are obtained by maximizing an appropriate objective function with respect to the parameters in the missingness model. Our framework is quite exible, allowing for multiple stages of sampling and accommodating binary or continuous variables. Because designed missingness will always lead to data that are at least Missing at Random (MAR), if not Missing Completely at Random (MCAR), we are able to use a maximum likelihood approach that does require explicit modeling of the missingness mechanism.
There is a rich statistical literature on the topic of missing data in general, as well as missing covariates in particular. Ibrahim (1990) proposed the EM by method of weights
123
Lifetime Data Anal (2007) 13:583605 585
to adjust for missing covariates in a generalized linear model. He takes a maximum likelihood approach, but assumes that the missing covariates are categorical in order to streamline the problem from a computational perspective. Reilly and Pepe (1995) take a non-parametric approach. Many authors have discussed the use of surrogate or auxiliary variables to help boost power in settings where covariates are missing on some subjects (Horton and Laird 1999). Several authors have discussed the analysis of case-control studies from the missing data perspective (Chatterjee and Carroll 2005; Spinka et al. 2005). We have found the missing data perspective to be particularly helpful in terms of extensions to the longitudinal setting. We have developed design strategies based on maximum likelihood.
The work was motivated by the need to plan for the National Childrens Study, a projected $2.7 Billion longitudinal public health study, whose goal is to enroll 100,000 women in early stages of pregnancy or prior to conception and to follow them and their children through early adulthood in an effort to investigate environmental, social, behavioral and genetic inuences on childrens health and development-including understanding any environmental exposures that may cause or exacerbate health impacts. Due to the large sample size and longitudinal nature of the study, unique statistical issues arise that must be addressed before a cost-effective sampling design can be developed to recruit the study population and gather information on health outcomes and environmental/personal exposure data.
In Sect. 2, we introduce our modeling framework. Section 3 presents special cases of the model. Section 4 describes some of the strategies and tools used in the computation. In Sect. 5, we use the tool to evaluate and compare a variety of different design strategies. Section 6 concludes with a discussion.
2 Statistical notation and modeling framework
Consider a sample of N subjects who have been identied for participation in a large cohort study. In the National Childrens Study which motivates our work, for example, N will be 100,000. Let Y be the outcome variable of interest measured on an arbitrarily chosen subject. Depending on the context, Y could be univariate (In the NCS, for example, Y might be an indicator of whether or not the child develops autism by age 5) or it could be a vector of longitudinally measured responses (for example, a set of indicators of the presence of wheeze at each follow-up time for the child). Let X be an exposure variable of primary interest (for example, the level of a pesticide measured at the family home at the time of the childs birth). To allow the possibility of additional covariates, let W be 1 p vector with rst element 1, second element X and remaining elements any additional covariates that might be important to consider when modeling the relationship between X and Y . For example, W could include maternal age and medical history, socioeconomic status and so on. We will be considering models for the settings where the outcome of interest is continuous as well as when it is binary. For the former and assuming for now a univariate outcome, we start by considering linear models of the form:
Y = W + (1)
123
586 Lifetime Data Anal (2007) 13:583605
where the vector includes an intercept as well as the coefcient of X and the coefcients of any additional covariates in the model. For many of our design scenarios, we consider the very simple model
Y = 0 + XX + . (2)
However, we will also be interested in assessing design implications of having an effect modier (suppose this is characterized by variable M), so that we are interested in variations on the following simple model:
Y = 0 + XX + MM + XMXM + . (3)
The parameter of interest in model (3) is XM , since it describes how the coefcient of X changes according to the value of M. In all these models, we assume that the error terms, are normally distributed with mean 0 and variance . We will also be interested in design considerations for binary univariate outcomes, in which case model (1) will be replaced by
logit[Pr(Y = 1)]= W.
We consider the setting where the variable X is too expensive or difcult to measure for all N subjects. While one option is to select a random sample of the full cohort on whom to measure the exposure and perform the analysis of interest, there are a variety of options available that can lead to more study power. Often, there may be additional variables available (we denote these by Z) that are easy and/or inexpensive to measure and which are predictive of the exposure of interest. If X represents a particular pesticide metabolite, for example, such variables might include information gathered from questionnaires, such as indicators of whether pesticides have been recently used in the home or in the garden. To facilitate our discussion, we refer to these Z-variables as surrogates for X. The extensive literature on statistical adjustments for measurement error can be used to identify methods that can use the surrogate variables along with the measured Xs in the analysis. It may make sense to use these easily measured variables to identify a subset of the full cohort. For example, one might want to make sure of getting a wide range of observed X-values by selecting individuals based on the predicted values of the true X-values, based on the observed values of the surrogates, Z. As discussed earlier, for example, it may be appropriate to consider a type of case-control design which involves using the observed values of Y to identify the appropriate subset for sampling.
2.1 Multi-stage sampling
To allow the most generality, we suppose that sampling proceeds in stages, with stage 0 corresponding to the measurement of the outcome Y and any covariates in W, stage 1to m 1 corresponding to the measurement of the set of surrogates or predictors Z of the exposure of interest, and the very last stage m corresponding to the measurement
123
Lifetime Data Anal (2007) 13:583605 587
of the true exposure of interest X, for a total of m + 1 stages. We assume a multi-stage sampling paradigm where a subject can only be sampled at stage j = 1,..., m if they were sampled at stage j 1. In many cases, it will make sense to assume that Y, W and Z are measured on everyone. In the example of autism and pesticides in the NCS, all study participants are likely to ll out questionnaires at the time of enrollment regarding whether pesticides have been used in the house or garden over the past year. While certainly not perfect predictors, responses to such questionnaires should have a moderate correlation with more accurate exposure measures such as urinary metabolites. In our subsequent development, we will let j , j = 1,..., m denote the conditional probability that a subject is sampled at stage j, given that they were sampled at stage j 1. We dene 0 as the unconditional probability of being selected at stage 0.
The conditional probabilities 0,...,m are the quantities that characterize an m-stage sampling design. The formulation is exible and encompasses a broad range of designs. For example, setting 1 = 2 = = m = 1 implies a study design where a random sample of a proportion 0 of the full cohort has measurements taken on Y, Z1, Z2,..., Zm1 and X. A case-control type design would correspond to one where m, the probability of being measured on X, depends on the observed value of Y . We use a logistic regression modeling framework to characterize the possible designs. In particular, we write the probability of being sampled at stage j as
logit( j ) = j0 + j1h0(Y ) +
j1
k=1
j (k+1)hk(zk), (4)
where hk(zk) is a transformation of zk. For example, if Zk is continuous, we might assume that it will be dichotomized, for example high versus low. In this case, the function hk(zk) will be 1 if Zk > C and 0 otherwise, where C is an appropriate predetermined cutoff value such as the median. Alternatively, Zk might be trichotomized into low, medium and high levels. If Zk is binary, then hk(zk) = zk will just be the identify function. Consider for example a design with 3 stages. If logit(0) equals a constant 00, then each subject has probability exp(00)/[1 + exp(00)] of being sampled at stage 0. We might put 1 = 1 (technically this would correspond to the degenerate case where 1 is innite) so that the surrogate Z1 is measured on everyone selected at stage 0. If we model 2 as
logit(2) = 20 + 21Y,
then the probability of being sampled at the last stage is exp(20)/(1 + exp(20)) for subjects with Y = 0 and exp(20 +21)/(1+exp(20 +21)) for subjects with Y = 1. If 21 is large (so that the sampling probabilities are much higher for subjects who have experienced the event of interest), this would yield what effectively corresponds to a nested case-control design. A class of interesting cases considers the true exposure of interest X not available (see Application section, No gold standard). This is simply realized by setting m = 0.
Finding an optimal design is in principle quite straightforward: one simply has to nd the sampling design (equivalently, the set of s that dene the selection
123
588 Lifetime Data Anal (2007) 13:583605
probabilities) that lead to the smallest variance on the parameter of interest (say, the coefcient of X in the regression on Y ), subject to the constraint imposed by cost or logistical considerations. The specifics of nding the optimal design depend, of course, on the method used to t the dose-response model of interest. While there are a number of different options available, including weighted estimating equations or pseudo-likelihood, we have taken a maximum likelihood approach. This means that once probability models have been specied on all random quantities, the variance of the parameter of interest may be obtained by selecting the appropriate element from the inverse of the negative expectation of the derivative of the likelihood score.
2.2 Liklihood formulation
To develop our framework, consider a setting where we are considering a possible design with m + 1 stages including stage 0. Basically this corresponds to the case where in addition to the exposure of interest, X, there are m 1 available surrogates Z that might be useful in predicting X when it cannot be measured directly. We formalize this with some notation. Let fZ|X (z|x; Z ) denote the conditional distribution of the observed surrogates or predictors Z, given the true exposure, X, where Z
are the parameters necessary to characterize this distribution. Let fX (x; X ) denote the marginal distribution of X with X denoting the parameters that characterize this distribution. Finally, let fY|X (y|x; Y ) denote the conditional distribution of the outcome, given the exposure X, with Y denoting the regression parameters and any other necessary parameters (e.g. the error variance in the normal case). We will generally assume that conditional on the true exposure X, the outcome Y and surrogate variables Z are independent. This means that the likelihood contribution for a subject measured at stage m (i.e. all possible variables are measured on this subject) can be expressed as:
fY,X,Z(y, x, z; ) = fY|X (y|x; Y )
where we made the assumptions
These conditional independence assumptions are well accepted in the structural equations literature (see Sanchez et al. 2005, for a recent review), and will be familiar to many statisticians as non-differential measurement error (see Carroll et al. 1995a). The likelihood contribution for a subject sampled at stage j, but not stages j + 1 through m can be obtained by integrating the full likelihood (5) over the data from stages j + 1 through m. It will be convenient to introduce the vector Q to indicate
m1
j=1
fZ j |X (z j |x; Z j )
fX (x;
X ), (5)
fY|X,Z = fY|X
fZ|X =
m1
j=1
fZ j |X .
123
Lifetime Data Anal (2007) 13:583605 589
all the data, with Q0 denoting the outcome variable Y , Q j the surrogate variables Z j , j = 1,... m 1, and Qm the exposure of interest X. We use Q j to denote the random variables to be measured in all stages up to and including stage j, namely Q j = (Q0,..., Q j ), and Q j to denote the random variables to be measured from stages j + 1 to m, namely Q j = (Q j+1,..., Qm). Let be the expanded parameter vector that includes all the parameters necessary to characterize the distributions in (5), that is 0 Y , j Z j , j = 1,..., m 1, and m X . We use the notation L(q; ) to denote the complete data likelihood contribution for a single subject, dened at (5), re-expressing it as:
L(q; ) =
L j+1(q j+1; ). (8)
Note that this concise formulation applies only for sequential sampling schemes where successive stages are nested in the sense that being observed at stage j implies also being observed at stages 0 through ( j 1). For example, this formulation would not apply to rotating panel designs. We denote by U U(q; ) the complete data score, obtained by taking the derivative with respect to of log L(q; ), dened at (6). Let Pj be an indicator that a subject has data on Q j (i.e. sampled at stages 0 through j), but not on Q j (i.e. not sampled at stages j + 1 through m). Then, straightforward algebra establishes that the likelihood score for a randomly selected individual can be expressed as:
m
j=0
m1
j=0
fQm (qm; m). (6)
The likelihood contribution for a subject who is not sampled at stages j + 1 through m can be obtained by integrating over the unobserved data:
L j (q j ; ) = q
j L(q; ). (7)
Note that we use somewhat loose notation for integration, for example, with x f (x) denoting f (x)dx if X is continuous and f (x) if X is discrete. When the argument of integration is clear from the context, we will sometimes simply write x f . Some straightforward algebra establishes the following recursive relationship between the likelihoods based on data collected on stages 1 through ( j +1) versus stages 1 through j, respectively:
L j (q j ; ) = q j+1
fQ j |Qm (q j |qm; j )
Pj U j (9)
123
590 Lifetime Data Anal (2007) 13:583605
where
j L (10) is the expected value of the complete data score U with respect to the conditional distribution of Q j given Q j and where L is dened at (6). For ease of exposition, the dependence of L and U on q as well as on has been suppressed in our notation.
Expression (9) represents the multistage generalization of the well known results from (Louis 1982) who derived an expression for the likelihood score in settings where one variable (possibly multivariate) was missing for some subjects. The variance-covariance matrix of the estimates of the various parameters that characterize the distributions of Y, X and Z is given by the inverse of the Hessian H, the negative expected value of the second derivative of the log-likelihood. While there are a number of different equivalent ways to express the Hessian matrix, we have found the following succinct expression to be particularly useful from a computational perspective (see Appendix A for a derivation of the formula):
H = q L
j
k=0
(13)
Note that expression (11) is the multistage generalization of the result of Meilijson (1989), Breslow (1996), Breslow and Cain (1988), Breslow et al. (2000, 2003).
Specic expressions for the Hessian matrix H will vary according to the assumptions regarding the relationships between X, Y and Z and we will examine some of these presently. In general, numerical approximations will be needed to compute H. Our chosen approach, Gaussian quadrature, will be discussed in more detail presently.
U j = q
(11)
where j is the probability of being measured at stage j but not at stages j + 1,..., m. The j s can be computed from the conditional sampling probabilities j , dened at (4), as follows. Suppose we dene j to be the marginal probability of being measured at stage j, which can be expressed in terms of the conditional probabilities as
j =
j LU
q
m
j=0
j U j UTj
k. (12)
Then, the j s in expression (11) can be dened as:
m = m
j = j j+1, j = 0,..., m 1.
123
Lifetime Data Anal (2007) 13:583605 591
2.3 Objective function
Getting back to the question of optimal design, the idea is to nd the design that minimizes the diagonal element of H1 corresponding to the coefcient of X in the regression model for Y , or, more generally, a linear combination of the diagonal elements of H1 if the regression model for Y depends on covariates other than X. We refer to this quantity as the objective function and denote it by g(; ), indicating that it is to be minimized with respect to the design parameters , but that it is also a function of the model parameters . The objective function g(; ) should be minimized under an appropriate constraint c(; ) = 0. For example, if the total budget available is B, then the constraint would be
N
1
2
c 2 can be used (see Nocedal and Wright 2000, Chap. 17, for details on numerical constrained optimization).
3 Special cases
In this section, we work through the assumptions for some important special cases and derive the corresponding score equations.
3.1 Cross-sectional case
We will consider several broad classes of univariate or cross-sectional dose response modeling relationships, corresponding to whether the outcome of interest is continuous or binary and whether a main effects or an effect modier model is of interest.
For our assumed main effects models, the outcome Y is modeled simply as a function of X and interest lies in the coefcient of X. If Y is continuous, we assume a normal linear regression:
Y = 0 + XX + (14)
m
j=0
C j E j B = 0,
where N is the size of the full cohort, C j is the measurement cost at stage j.
We use Lagrange multipliers to minimize the desired objective function g(; ) subject to the constraint c(, ) = 0, that is, the constrained minima of g(; ) will correspond to the minimum of the Lagrange function
L(,
; ) = g(; ) c(; )
where the Lagrange multiplier is treated as an extra variable. To improve numerical stability the augmented Lagrangian L +
123
592 Lifetime Data Anal (2007) 13:583605
where is assumed to be normally distributed with mean 0 and variance . This is model (2) described earlier. If Y is binary, we assume a logistic regression model:
logit [Pr(Y = 1)] = 0 + XX . (15)
For the continuous case, broader classes of models can be generated by taking appropriate transformations. For example, a log-normal model can be easily accommodated by setting Y to be the log of the outcome of interest.
We rst consider the case where Y , X, and Z are continuous. We assume X N (X ,XX ). For each component Z j we can assume the classical measurement error model
Z j = X + j , (16)
where j N (0,j ) or, more generally, we can allow for the possibility of a bias and a different scale
Z j = j0 + X j1 + j . (17)
To simplify the score equations it is convenient to reparameterize the variances in terms of their precisions, namely = 1 , j = 1j, and XX = 1XX . Setting W equal to a 1 2 vector with elements 1, X, the the log likelihood can be easily written as
=
m + 1
2 log(2)
+
1
2 log( )
2 (Y W)2
1
2 log(j )
j
+
m1
j=1
2 (Y W j )2
+
2 (Y X )2 (18)
where j = (0, 1) for those components of Z for which we assume the classical measurement error model (16).
The components of the complete data score are thus:
= WT (Y W) (19)
=
1
2
XX
1
2 log(XX )
1 (Y W)2
(20)
j = j WT Z
j
W j (21)
123
Lifetime Data Anal (2007) 13:583605 593
1
2
1
2
. (26)
If Y follows a logistic model, the contribution of the conditional distribution of Y given X to the log likelihood becomes:
YY (1 Y )(1Y) (27)
where Y = logit1(W) corresponds to the probability Pr(Y = 1) given W. The score for is similar to (19), except that there is no longer any scale parameter . Equations 19 and 20 are replaced by
= WT (Y Y ). (28)
j =
1
j Z
1 XX (X X )2
2 (22)
X = XX (X X ) (23)
XX =
W j
j
. (24)
Clearly, (21) applies only to those components of Z for which we do not assume the classical measurement error model (16). The Hessian matrix can now be calculated using formulas (11) and (10). In the case where Y , X, and Z are normally distributed, some analytical simplications will be possible. However, numerical integration using Gaussian quadrature also works very well and allows for generalizations to non-normal settings.
We are also interested in more general models than (14), for example an effect modier model where interest lies in assessing an interaction between the exposure of interest X and another variable, for example, a measure of socioeconomic status. This setting is easily accommodated through appropriate specication of W. For example, suppose that W is a 1 4 vector with elements 1, X, M, XM, where M is an effect modier. The form of the score equations for the distributions of Y and Z remain unchanged. We only need to add the score equations for the conditional distribution of E given X.
Now we consider the binary cases. If X is binary, the contribution of the distribution of X to the log likelihood (18) becomes:
XX (1 X )(1X) (25)
where X now corresponds to the prevalence of the binary exposure X in the population. Equations 23 and 24 are replaced by one equation:
X =
x X
1 x 1 X
123
594 Lifetime Data Anal (2007) 13:583605
Similarly, if Z j follows a logistic model, Eqs. 21 and 22 are replaced by
= WT (Z j Z j ), (29)
where Z j = logit1(W j ).
3.2 Longitudinal case
As in the univariate setting, we consider both the continuous and binary case. In the interest of space, we consider only a very simple scenario where the covariate of interest, X is measured only at baseline and does not change over time. With this formulation, the optimal design considerations developed for the cross sectional setting applies fairly readily. Extensions to more complicated settings where covariates change over time are possible, but beyond the scope of the present paper. The following sections develop the score contributions that are needed in order to compute the expected Hessian given at Eq. 11.
Suppose Y j denotes the outcome of interest measured on an arbitrary child at followup time t j , j = 1,..., r. Let age j denote the age of the child at the time jth measurement and as before, let X be an exposure of interest (e.g. pesticide exposure) measured at baseline. Let age represent the vector of ages at which the child is observed. We assume that the set of observation times is xed in advance and is the same for all children. This means that age is a xed vector that does not need to be modeled. This assumption can be easily relaxed if desired.
If the outcome is a continuous random variable, we assume that interest lies in tting the longitudinal growth curve model,
Y j = 0 + X X + 2age j + 3 Xage j + b0 + b1age j + j (30)
where j are independent random error terms assumed to be distributed as N (0, ), and (b0, b1) are random effects assumed to be jointly distributed as N (0,
), with
indicating a 2 2 covariance matrix. Model (30) implies that the mean outcome increases linearly with age, but that the rate of change is modied by exposure. The model also implies that each child has their own individual intercept and slope. It is convenient to introduce some additional notation dening the vector and matrix versions of the observable data. Specifically, let Y denote the m 1 vector of outcomes, W an r p design matrix for the xed effects, V denoting a r p design matrix for the random effects. For model (30)the jth row of W has elements 1, X, age j and Xage j , while the jth row of V has elements 1 and age j . Model (30) can be expressed using the standard general formulation of a mixed model (Diggle and Lophaven 2006):
Y = W + Vb + , (31)
123
Lifetime Data Anal (2007) 13:583605 595
where N (0, I), I being the r r identity matrix. After integrating the random effects, we can write the distribution of Y conditionally to the xed effects W as
Y N (W,
) (32)
where
= I + V VT . (33)
As in the cross-sectional case, to have the simplest form of the score density it is convenient to use the precision matrix dened as the inverse of the covariance matrix (33) and to choose a suitable parameterization. In particular, dening the change of coordinates
A =[VT V + 1]1 = 1/
(34)
we can rewrite the precision matrix as
1 = T (35)
where
T = I + VAVT . (36)
) ( , A) dened in (34) is a bijection, we can parameterize the precision matrix (35) as a function of and the independent elements of A.
The contribution to the log-likelihood based on the conditional distribution of Y given W is then easily written as
r 2 log( ) +
1
2 log(|T|)
Noticing that the map ( ,
2 (Y W)T T(Y W). (37)
We use the vector differential calculus of (Wand 2002) to derive the following score equations:
= WT T(Y W) (38)
=
1
2
r (Y W)T T(Y W)
(39)
aij =
1
2 tr T1
T aij
(Y W)T
T aij (Y W)
i, j = 1, 2 i j.
(40)
123
596 Lifetime Data Anal (2007) 13:583605
From Eq. 36, it is immediate to see that:
T a11 = V
1 0
0 0 VT = 11T
T a12 = V
0 1
1 0 VT = 1 ageT + age 1T Ta22 = V
0 0
0 1 VT = age ageT
where 1 represents a constant vector of ones.
When Y is binary, suppose that the observed data satisfy the following longitudinal growth curve model,
logit(Yj ) = 0 + X X + 2age j + 3 Xage j + b0 + b1age j (41)
where Yj corresponds to the probability Pr(Y j = 1) and as before (b0, b1) N (0,
) are random effects. Just as for model (30), model (41) implies that the logit of the response probability changes linearly with age and that the rate of change is modied by exposure.
The contribution to the log-likelihood based on the conditional distribution of Y given X is
= log b (, b) (42)
where
(, b) =
m
j=1
YjYj (1 Yj )(1Yj)
f (b) (43)
is the conditional distribution of the Y j s given the random effects b, times the density of the bs.
Using the same W and V dened in the continuous outcome case, model (41) can be succinctly rewritten as:
logit(Y ) = W + Vb. (44)
Indicating with S the precision matrix of the random effects, S = 1, the score equations can be written as:
= b WT (Y Y )(, b)
b (, b) (45)
sij =
1
2 b
tr S1 S
sij
bT
S sij b
(, b)
b (, b) i, j = 1, 2 i j.(46)
123
Lifetime Data Anal (2007) 13:583605 597
The specic contributions for s11, s12 and s22 are obtained by substituting
S s11 =
0 1
1 0
0 0
0 1 .
Setting up the selection probabilities for the longitudinal case is potentially more complicated than for the cross sectional case. We consider several strategies: Instead of a univariate Y appearing in the outcome specic designs, we use a univariate function g(Y) of the multivariate vector Y. If the outcome Y is binary, one very simple approach is to put g(Y) = max{Y1,..., Yr }. This would correspond to allowing the selection probability to depend on whether or not a subject has experienced at least one event during the course of the study. Another option might be to put g(Y) = Y , so that the selection probability depends on the average response rate during the study.
4 Developing computational software
Having established our basic modeling framework, the next goal was to develop interactive software that users could apply in order to determine an optimal design for their setting.
The rst step was to allow for the maximum exibility in building the likelihood and the sampling probabilities. Following formula (6), we see that the likelihood is expressed as a product of conditional probabilities, each of them associated with a sampling stage. This feature allows specication of the model hierarchically, that is, the user is asked to dene the distributions of interest stage by stage. Notice that the presence of an effect modier and/or a longitudinal Y increases the number of variables in Eq. 6, but does not change its form. The distributions forming the likelihood can be either continuous (normal or log-normal) or binary. Furthermore, for continuous surrogate variables there is the option of choosing a measurement error model (it decreases the number of parameters to be estimated). Once the likelihood is spec-ied, the sampling probabilities are implemented by specifying the zeros in the matrix and by specifying which j s are equal to one. As described in Sect. 2, setting an element ( j+1)k equal to zero means no sampling dependency between variable z j
and zk (the rst column of the matrix corresponds to random sampling and cannot be set to zero), while setting j+1 equal to 1 forces all the individuals sampled in stage j to be sample in stage j + 1.
Given the likelihood and the sampling probabilities, the second step was to calculate the Hessian matrix. We rst computed the stage-like score densities U j given by Eq. 10, and then we computed the Hessian given by formula (11). In those equations all the integrals involve the likelihood. Since the likelihood is normal for each continuous variable (log-normal cases are transformed into normal cases), the continuous
1 0
0 0
S s12 =
S s22 =
123
598 Lifetime Data Anal (2007) 13:583605
integrals were solved by using GaussHermite quadrature. For the discrete variables integrals becomes sums, and the corresponding computation was straightforward.
About interfacing to the design software, we needed to provide exibility for the user in terms of how parameters are expressed. In the context of specifying the relationship between a surrogate and a true exposure variable, for example, rather than specifying the measurement error variance directly, many users will be more comfortable with specifying the correlation between Z j and X. Suppose the parameter j characterizes the correlation between Z j and X. Under the classic measurement error model, j = XX /(XX + j ), hence it follows that the measurement error variance for Z j can be expressed as XX (1 2j)/ 2j.
5 Application
In this section, we describe optimal design considerations for two different settings. For the rst setting, we assume that there is a gold standard available, meaning that it is possible, albeit expensive, to measure the exposure of interest, X, in a sub-sample of the study population. Such a setting might apply, for example, when a subset of study participants are subject to aggregate exposure assessment, which involve very detailed biological measurements and assessments such as air, dust, soil and dietary sampling, blood or urine testing, etc. For the second setting, we assume that while there is no true, ideal gold standard exposure measurement that is feasible to implement, some subjects will receive replicate measurements or be assessed with several different exposure instruments and the analysis will be based on a latent variable approach.
5.1 Gold standard available
This example is motivated by hypothetical planning for an assessment of the association between pesticide exposure and autism for the National Childrens Study. Suppose that study investigators are planning to undertake intensive aggregate exposure sampling for a subset of study participants, but will consider using one of two simpler and cheaper assessment tools for the majority of participants (either a simple questionnaire, or a dietary sample). Because autism is a rare disease, we assume that the marginal risk (averaged over all levels of exposure) is 3/1,000, and that it costs $20 on average (across all study participants) to properly diagnose autism. Suppose that the natural log of the true exposure can be assumed to follow a standard normal distribution, that the cost of measuring this true exposure via aggregate exposure sampling is $1,000, and that a simple logistic regression model that predicts the probability of autism as a function of true pesticide exposure yields an odds ratio of 2 for a one-unit positive shift in ln(true exposure). Implementing a simple questionnaire is very cost effective ($10), and has a correlation of 0.3 with the true exposure. Dietary assessment is more expensive to assess ($200), but is more predictive with a correlation of 0.9 with the true exposure. We applied the design software to determine the optimal designs to ensure 80% power, assuming a two-sided test at significance level .05. Table 1 shows the optimal selection probabilities (in terms of the parameters) for three scenarios:
123
Lifetime Data Anal (2007) 13:583605 599
7342
n
67
5643
32
5590
57
9002
423
$1,000)
Y
$1,000)
$600)
Y
Y
$200)
sample
design
.
=
93
Y
61
$286,970
$1,273,750
2
60
2
61
=
7
=
2
2
266
2
Stage
.
.
=
dependent
535
+
$2,381,210
09
Stage
817
$354,730
94
313
+
+
Stage
+
83
07
Stage
2
5
2
5
2
4
2
3
9
=
=
$30=
=
$220
=
=
$420
=
=
$30
.
.
.
.
.
.
.
.
=
)
)
Outcome
)
)
)
)
)
)
Sampling
equations
0
2
1
0
2
1
0
2
1
0
2
1
=
=
=
logit
logit
(Stage
logit
logit
(Stage
logit
logit
(Stage
logit
logit
(Stage
(
(
(
(
(
(
(
(
autism
n
61138
143
6828
17
6726
52
11094
5547
and
exposure
2
1
$1,000)
$1,000)
$600)
1
Z
sample
2
$200)
Z
Z
Z
35
6
=
=
.
17
2
2
10
=
=
pesticide
$1,977,010
$1,518,850
.
Stage
Stage
$2,870,970
Stage
$1,442,230
+
.
2
2
6
2
.
77
9
6
+
Stage
+
7
dependent
2613
081
82
+
5
.
.
453
.
.
0=
$30=
=
.
.
23
22
2
6222
2
$220
=
between
=
$420
=
=
$30
)
)
Covariate
Sampling
=
=
=
=
=
equations
(
(
(
(
(
(
(
(
(Stage
(Stage
)
)
)
)
)
)
0
2
1
0
2
1
0
2
1
0
2
1
relationship
logit
logit
logit
logit
logit
logit
(Stage
logit
logit
(Stage
63350
219
$1,000)
2=$1,000)
6925
37
6921
41
$800)
7226
6798
the
n
$200)
=
=
assessing
2
=
2
2
663
547
$2,119,700
Stage
598
224
$1,560,530
Stage
60
12
$1,555,720
Stage
552
$1,576,400
Stage
.
.
.
5
for
sample
.
.
.
766
.
2
0=
$30=
=
$220
=
=
$220
.
=
$30
2
5
2
5
2
)
)
)
)
)
)
)
)
=
=
=
=
==
designs
Random
0
equations
(
(
(
(
(
(
(
(
2
1
0
2
1
0
2
1
0
2
1
logit
sampling
logit
(Stage
logit
logit
(Stage
logit
logit
(Stage
logit
logit
(Stage
validation
available
Exposure)
Sample)
Exposure)
Sample)
being
(Questionnaire)
Exposure)
Sample)
design
design
Y(Autism)
(Dietary
Y(Autism)
design
(Questionnaire)
design
2-Stage
Variable
sampled
Sampling
X(True
Y(Autism)
X(True
(Dietary
X(True
(Dietary
cost
of
of
Y(Autism)
standard
of
of
1
cost
2
cost
2
cost
1
2
Hypothetical
Total
Total
Total
design
Total
Z
Z
Z
Z
Z
of
Stage
gold
1
2
1
2
1
2
with
variable
1
2
b l e
Design
Designs
Latent
1
Ta
1
2
3
4
123
600 Lifetime Data Anal (2007) 13:583605
random sampling, covariate dependent sampling and outcome dependent sampling. Each of the scenarios was applied to a two-stage sampling design, in which the health outcome Y and the surrogate exposure measure Z1 are assessed jointly in the rst stage, and true exposure X is measured in the second stage. Design 1 corresponds to a design in which the simple questionnaire information is used as the surrogate measure of exposure, and Design 2 utilizes the more expensive and more informative dietary sample as the surrogate measure of exposure. The table also shows the number of subjects that need to be sampled at each design stage, as well as the total cost of each design. The results show that only a small number of subjects need to be measured on X in order to obtain good power. For comparison, we also considered the sample size and cost of a study based on doing an analysis solely in a set of subjects for whom the gold standard X is measured. In order to obtain 80% power in this setting, it would be necessary to measure 5,497 subjects, for a total cost of $5,606,940. These results illustrate the potential of careful subsampling to reduce study cost, with covariate dependent sampling providing modest cost savings and outcome dependent sampling providing more substantive cost savings relative to the random sample design.
The simple examples described above assume that the full costs of assessing the surrogate and true measures of exposure occur within their respective stages, and that the costs associated with these measures are not shared. However, there may be cases in which part of the cost of exposure assessment would occur at an earlier stage of the design. For example, when using a covariate or outcome dependent design in which the rst stage of the design entails assessment of the dietary component, the collection of the remaining physical samples (air, dust, soil, etc.) to support the aggregate exposure assessment would necessarily occur in the rst stage of sampling. These physical samples would be archived for future chemical analysis, with the outcome or covariate dependent sampling equations dictating the probability that these samples are analyzed for each study subject. In addition, the dietary sample is actually a subcomponent of the aggregate exposure measureand thus the additional cost associated with aggregate exposure assessment (above and beyond what is already assessed in Stage 1 of the design) is $800. Design 3 in Table 1 assumes that $200 of the cost of aggregate exposure assessment is related to physical sample collection for air, dust, and soil, and that the remaining analytical costs for these three samples is $600. Design 3 suggests that when the costs of each measure are appropriately partitioned between the various stages of the design, the cost efciencies associated with the covariate or outcome dependent sampling schemes may very well disappear.
5.2 No gold standard
We extend the hypothetical example a bit further by exploring a design in which no study participants undergo aggregate exposure assessment. In Table 1, Design 4 corresponds to a two stage validation sampling design in which the health outcome Y and the questionnaire-based surrogate Z1 exposure measure are assessed jointly in the rst stage, and the dietary-based surrogate Z2 exposure measure is assessed in the second stage. This approach relies on previous knowledge of the relationship between each surrogate measure and true exposure, and assumes a latent variable model will be
123
Lifetime Data Anal (2007) 13:583605 601
used in the subsequent analysis of data from this design. Similar to the previous examples in Table 1, the latent variable design also demonstrates the potential of careful subsampling to reduce study cost.
6 Discussion
This paper has presented a unied framework for optimal design in settings where cost and logistical constraints preclude measuring all the exposure and response variables of interest in an epidemiological study. Our framework builds on the considerable existing literature related to innovative epidemiological design as well as missing data and measurement error problems. Tackling the problem of optimal design in such settings is particularly important, given the increasing trends towards the use of sophisticated biomarkers of exposure and response in epidemiological studies. While there are many articles in the literature that address the problem of statistical analysis in the presence of data that are missing by design, relatively few papers provide clear user-friendly guidelines for design.
Development of efcient and exible computational software for nding optimal designs is complicated by the potential complexity and diversity of the various modeling situations that need to be considered. Exploiting a number of elegant properties of the normal distribution allowed us to develop software that minimized computational burden. As a result, we were able to produce interactive software that prompts users via a structured interview to specify the anticipated distribution properties of their outcome variable of interest, the true exposure of interest, as well as up to (theoretically) any surrogate exposure measurements. The software allows the user to optimize over a main effects parameter for both longitudinal and univariate settings, as well an interaction parameter for the univariate setting. In this section, we describe some of the specic computational simplications that were used.
We have already seen that the likelihood expressions could be simplied by considering the joint distributions of all variables as a product of conditional distributions. Likelihood expressions were further simplied through the critical assumption that conditional on the true exposure, a surrogate exposure variable provides no additional information regarding the distribution of the outcome variable, Y . Such an assumption is also typically made in the measurement error literature, where it goes under the name non-differential measurement error. Conditional independence assumptions are also typical in structural equations modeling. Extensions to relax this assumption would be of interest.
The software was developed with several important broad principles in mind. First, the principle of modularity was used throughout. The use of object oriented programming means that all functions and expressions are dened as independently as possible. As a result, a change made in one section of the code has little impact on other parts of the code. Modularity is particularly important in terms of allowing the capacity for later expansions and extensions. Object oriented approach is also very suitable in implementing hierarchical statistical models: each layer of the hierarchy being represented by an instance of an object. The architecture of the softwares numerical-core includes:
123
602 Lifetime Data Anal (2007) 13:583605
(1) Statistical distribution objects which represent the stage-like statistical distributions dening the statistical model;
(2) Likelihood object which contains a collection of statistical distribution objects and is responsible for the calculation of the likelihood and the score values;
(3) Sampling probabilities object which contains the information about the -dependencies and the -constraints, and is responsible for calculating the sampling probabilities;
(4) Sampling design object which contains a likelihood object, a sampling probabilities object, and design information (cost, number of participants, etc.), and is responsible for the calculation of the objective function; and
(5) Sampling design optimizer object which contains a sampling design object and information about the constraints and the optimization strategy, and is responsible for nding the optimal design.
Although the objective function needed for optimal design required multiple integrations for each stage, the software accomplishes this quite quickly through the use of GaussHermite quadrature and a certain amount of pre-processing. The Gauss Hermite quadrature approach to numerical integration is specially adapted to work well in settings where the integrand can be expressed as some function times a normal density function. The software uses a default of twelve quadrature points, yet allows the user to modify this value if increased precision is required, though doing so will reduce the speed of the software. During the optimization process, where the software tries to nd the optimal sampling design strategy, the objective function needs to be calculated a great number of times. Each calculation involves several multiple integrals, and it needs to be done for different values of . However, many of the steps involved in the integrations do not depend on , and this allowed a dramatic improvement in the performance of the optimization process. A certain amount of preprocessing is done before the optimization starts. In particular, values of the likelihood L and stage-like score U j are precalculated over the multiple integration lattice, while only the j are updated during the optimization.
There are a number of natural extensions that could be considered. For example, although we have considered linear and logistic regression models, extension to other generalized linear models would be straightforward. Also, although the software currently offers only limited functional forms for the effect of covariates on the selection probabilities, extensions to allow for more general h functions would be of value, and relatively straightforward. There are number of extensions that could be developed in the longitudinal setting. The most obvious one is to allow for time-varying covariates. At present, the software handles only the case where the covariates are xed at baseline. It would also be of value to consider more complex selection processes. At present, the software allows the selection probability to depend on either the maximum of the Y observations observed on each individual, or the mean. In the case of a binary outcome, having the selection probability determined by the maximum results in a longitudinal version of a case-control study wherein subjects have a higher chance of being selected if they have had at least one event.
It would also be of value to develop design guidelines for the setting where the outcome of interest Y is not one that was planned to be measured on the full cohort.
123
Lifetime Data Anal (2007) 13:583605 603
For example, an investigator may want to initiate a sub-study that involves measurement of a more specialized outcome, such as an MRI scan, that is too expensive to measure on the entire cohort. In some settings, there may be an inexpensive surrogate outcome that can be measured on the entire cohort and calibrated against the more expensive gold standard that is assessed in a smaller subset. In a study of childhood physical activity, for example, activity diaries may be used to assess activity levels of all children, but a subset may be more intensively assessed using accelerometers.
Our approach has been based on an assumption of maximum likelihood. Thus, our optimal design guidelines are implicitly conditioned on knowing the true distributions of all random variables of interest. When these assumptions are violated, it is likely that the resulting designs are suboptimal, though the extent to which this is true is unclear.
In practice, there are a variety of other approaches that are used, including weighted estimating equations to handle missing data (Robins et al. 1995). A Bayesian formulation may also be used to integrate any uncertainty in the value of the parameters dening the statistical model. Extensions to consider optimality for estimation based on these and other approaches would be useful.
Appendix A: Multistage formula for the Hessian
The multistage Hessian H is dened as
H =
m
j=0
q j j L j T ln L j . (A-1)
Straightforward algebra yields
H =
m
j=0
q j j (L j T ln L j + ( L j )(T ln L j )
= q j j T L j + L j U j UTj
= T
m
j=0
m
j=0
q j L j j +
m
j=0
q j L j j U j UTj . (A-2)
From the definition of L j = q
j L, and since j and U j do not depend on q j , for each j = 0,..., m we can write
q j L j j =
q L j (A-3)
q j L j j U j UTj =
q L j U j UTj . (A-4)
123
604 Lifetime Data Anal (2007) 13:583605
Substituting in (A-2)wehave
H = T
m
j=0
q L j +
m
j=0
q L j U j UTj
= T q L
m
j=0
j + q L
m
j=0
j U j UTj
= T 0 + q L
m
j=0
j U j UTj
= q L
m
j=0
j U j UTj , (A-5)
which is formula (11).
References
Breslow N (1996) Statistics in epidemiology. J Am Stat Assoc 91:1428Breslow N, Cain K (1988) Logistic-regression for 2-stage case-control data. Biometrika 75:1120 Breslow N, Chatterjee N (1999) Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. App Stat 48:475468Breslow N, McNeney B, Wellner J (2003) Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. Annals Stat 31:11101139Breslow N, Robins J, Wellner J (2000) On the semi-parametric efciency of logistic regression under case-control sampling. Bernoulli 5:447455Carroll R, Ruppert D, Stefanski L (1995a) Measurement error in nonlinear models. Chapman and Hall Carroll R, Wang S, Wang C (1995b) Prospective analysis of logistic case-control studies. J Am Stat Assoc
90:157159Chatterjee N, Carroll R (2005) Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92:399418Clayton D, Spiegelhalter D, Dunn G, Pickles A (1998) Analysis of longitudinal binary data from multiphase sampling. J Roy Stat Soc Series B (Stat Methodol) 60:7187 Corneld J (1951) A method of estimating comparative rates from clinical data: applications to cancer of lung, breast and cervix. J NCI 11:12691275Diggle P, Lophaven S (2006) Bayesian geostatistical design. Scandinavian. J Stat 33(1):5364 Duncan G, Kalton G (1987) Issues of design and analysis of surveys across time. Int Stat Rev/Revue Internationale de Statistique, 55:97117Harezlak J, Ryan L, Giedd J, Lange N (2005) Individual and population penalized regression splines for accelerated longitudinal designs. Biometrics 61:10371048 Helms R (1992) Intentionally incomplete longitudinal designs: I. Methodology and comparison of some full span designs. Stat Med 11:18891913Horton N, Laird N (1999) Maximum likelihood analysis of generalized linear models with missing covariates. Stat Meth Med Res 8:37Hu X, Lawless J (1997) Pseudolikelihood estimation in a class of problems with response-related missing covariates. Canadian J Stat 25:125142 Ibrahim J (1990) Incomplete data in generalized linear models. J Am Stat Assoc 85:765769 Langholz B, Goldstein L (2001) Conditional logistic analysis of case-control studies with complex sampling. Biostatistics 2:6384Lawless J, Kalbeisch J, Wild C (1999) Semiparametric methods for response-selective and missing data problems in regression. J Royal Stat Soc Series B 61:413438
123
Lifetime Data Anal (2007) 13:583605 605
Louis T (1982) Finding the observed information matrix when using the EM algorithm. J Royal Stat Soc
Series B 44:226233 Meilijson I (1989) A fast improvement to the EM algorithm on its own terms. J Royal Stat Soc Series B
51:127138Nocedal J, Wright SJ (2000) Numerical optimization. Springer Series in Operations Research. Springer-Verlag, New YorkReilly M, Pepe M (1995) A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82:299314Robins J, Rotnitzky A, Zhao L (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90:106121Sanchez B, Budtz-Jrgensen E, Ryan L, Hu H (2005) Structural equation models - a review with applications to environmental epidemiology. J Am Stat Assoc 100:14431455Scott A, Wild C (2001a) Case-control studies with complex sampling. J Roy Stat Soc Series C 50:389401
(Part 3)
Scott A, Wild C (2001b) Maximum likelihood for generalised case-control studies. J Stat Plan Infer 96:327 Spiegelman D, Gray R (1991a) Cost-efcient study designs for binary response data with gaussian ovariate measurement error. Biometrics 47:851869Spiegelman D, Gray R (1991b) The design of cohort studies in which relative risks are corrected for exposure measurement error. Am J Epidemiol 134:736737Spinka C, Carroll R, Chatterjee N (2005) Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet Epidemiol 29:108 127Verbeke G, Lesaffre E (1999) The effect of drop-out on the efciency of longitudinal experiments. Appl
Stat 48:363375Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. SpringerWacholder S, Weinberg C (1994) Flexible maximum likelihood methods for assessing joint effects in case-control studies with complex sampling. Biometrics 50:350357 Wand M (2002) Vector differential calculus in statistics. Am Stat 56:5562Zhou H, Weaver M (2001) Outcome dependent selection models. Encylopedia Environ 3:14991502 Zhou H, Weaver M, Qin J, Wang M (2002a) A semiparametric empirical likelihood method for data from an outcome-dependent sampling design with a continuous outcome. Biometrics 58:413421 Zhou W, Liu G, Thurston S, Xu L, Miller D, Wain J, Lynch T, Su L, Christiani D (2002b) Genetic polymorphisms of n-acetyltranferase-2 and microsomal epoxide hydrolase and cumulative cigarette smoking in lung cancer. Cancer Epidemiol Biomark Prevention 11:1521
123
Springer Science+Business Media, LLC 2007