1. Introduction
This work proposes a general variable selection method for the length-biased and interval-censored failure time data with the classical proportional hazards (PH) model. Interval-censored data arise when each failure time of interest cannot be measured accurately but is only known to lie in a certain time interval formed by periodical follow-ups [1]. Such data are frequently encountered in many scientific studies including clinical trials and epidemiological surveys, and their regression analysis has been discussed extensively in the literature, see [2,3,4,5,6,7,8] for details. Specifically, Zeng et al. [4], Wang et al. [6] and Zeng et al. [7] investigated inference procedures for the additive hazards, PH and transformation models, respectively.
In addition to interval censoring, left truncation is also frequently encountered in prospective cohort studies, inducing non-randomly selected samples from the target population. A typical example of left truncation occurs in the PLCO study where individuals with any of the PLCO cancers at the onset of the study were not enrolled [6,9]. In particular, when the truncation times follow the uniform distribution (also known as the length-biased or stationarity assumption), the left-truncated data reduce to the length-biased data discussed by many authors, including but not limited to Wang [10], Shen et al. [11] and Ning et al. [12].
The analysis of length-biased data under right censoring has been investigated extensively in the literature [11,13,14,15,16]. To name a few examples, Shen et al. [11] presented unbiased estimating equation approaches for the transformation and accelerated failure time models. Qin and Shen [13] developed an inverse weighted estimating equation approach for the PH model. Qin et al. [14] developed new expectation-maximization (EM) algorithms to estimate the survival function of the failure time. For the length-biased and interval-censored data, Gao and Chan [15] developed an EM algorithm for the PH model via two-stage data augmentation. Further, Shen et al. [16] considered the mixture PH model with a nonsusceptible or cured fraction.
In many practical applications, one may collect a large number of candidate covariates, but in general, only a few covariates are useful to model the failure time of interest. In such a case, penalized variable selection provides a useful tool to eliminate irrelevant variables and further enhance the estimation accuracy. Popular penalty functions include LASSO [17], SCAD [18], adaptive LASSO (ALASSO) [19], SICA [20], SELO [21], MCP [22] and BAR [23,24]. In particular, Fan et al. [25] provided a comprehensive review of variable selection methods and the corresponding algorithms. In recent years, machine learning-based methods have also gained considerable attention due to their great ability in identifying relevant features. To name a few examples, Garavand et al. [26] used clinical examination features and compared different machine learning algorithms in developing a model for the early diagnosis of coronary artery disease. Hosseini et al. [27] used blood microscopic images and a convolutional neural network algorithm for detecting and classifying B-acute lymphoblastic leukemia. Garavand et al. [28] conducted a systematic review of advanced techniques to facilitate the rapid diagnosis of coronary artery disease. Ghaderzadeh and Aria [29] conducted a systematic review of Artificial Intelligence techniques for COVID-19 detection.
Regarding the left-truncated failure time data, a number of variable selection methods have been proposed. In particular, Chen [30] considered the right-censored data and developed a variable selection method for the additive hazards model with covariate measurement errors. He et al. [31] also considered the right-censored data and performed variable selection with penalized estimating equations for the accelerated failure time model. Li et al. [32] developed a conditional likelihood-based variable selection method for left-truncated and interval-censored data under the PH model. However, it is worth noting that the work of Li et al. [32] only involved the ALASSO penalty and their method can be anticipated to lose some efficiency due to ignoring the distribution information of the truncation times.
In this paper, we offer an efficient penalized likelihood method to achieve variable selection in the PH model with length-biased and interval-censored data. Compared with the traditional conditional likelihood method in Li et al. [32], the proposed method yields an efficiency gain via fully taking into account the distribution information of the truncation times. In particular, to optimize the penalized likelihood function with an intractable form, we develop a penalized EM algorithm by introducing pseudo-left-truncated data and Poisson random variables. The proposed method is easy to implement and computationally stable and has desirable advantages over the variable selection method based on the penalized conditional likelihood [32]. An application to a real data set arising from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial demonstrates the practical usefulness of the proposed method.
The PLCO cancer screening trial is a large-scale multicenter trial conducted to screen the PLCO cancers and investigate cancer-related mortality. To date, motivated by the rich data structure in the PLCO database, various statistical methods have already been proposed in the literature. To name a few examples, Wang et al. [6] developed an EM algorithm to estimate the spline-based PH model with interval-censored data. Sun et al. [33] considered variable selection in a semiparametric nonmixture cure model with interval-censored data. Li and Peng [34] investigated instrumental variable estimation of complier causal treatment effect with interval-censored data. Withana Gamage et al. [35] considered the estimation of the PH model with left-truncated and arbitrarily interval-censored data.
The remainder of this paper is organized as follows. In Section 2, we first introduce the notation, assumption and corresponding likelihood. Section 3 presents the proposed penalized EM algorithm, and Section 4 establishes the oracle property of the proposed estimators. In Section 5, a simulation study is conducted to assess the variable selection performance and the estimation accuracy of the proposed method, followed by an application in Section 6. Some discussions and conclusions are given in Section 7. Section 8 provides several potential future research directions.
2. Notation, Model and Penalized Likelihood
For the target population, let , and denote the failure time of interest (e.g., the time to the onset of the failure event), the truncation time (e.g., the time to the study enrollment) and the p-dimensional vector of covariates, respectively. Given , the PH model specifies that the conditional cumulative hazard function of takes the form
(1)
where is the p-dimensional vector of the unknown regression coefficients and is an unknown increasing cumulative baseline hazard function. Let d denote the number of nonzero components in , and and denote all d nonzero and zero-component vectors, respectively.Under the left-truncation scheme, only individuals with are enrolled in the study, and the failure time, truncation time and covariate vector that satisfy are denoted by T, A and , respectively. Then, we know that has the same joint distribution as given [36]. As mentioned above, if the truncation time is further assumed to follow the uniform distribution on , where is the support of , we have the length-biased sampling mechanism [10,14]. Let and denote the density and survival functions of given , respectively. Under the assumption that is independent of given , the joint density function of given evaluated at is
(2)
where denotes the density function of at time a and equals under the length-biased sampling scheme.Consider a failure time study that recruits n subjects and each failure time suffers from interval censoring due to the periodical examinations for the occurrence of the failure event. For , denote by , and the failure time, truncation time and covariate vector of the ith subject in the study, respectively. We assume that there exists a sequence of examination times for subject i, where is a random positive integer, and define . Let and denote the endpoints of the smallest interval that brackets . That is, . Clearly, is left-censored if , and is right-censored if . Then, the observed data consist of . Under the assumption that examination times are independent of the failure and truncation times given the covariates, the likelihood function based on the observed data can be written as
(3)
where .Essentially, the likelihood (3) is the product of the marginal likelihood and the conditional likelihood, that is, , where and In the above, denotes the indicator function, ; is the marginal likelihood of given ; and is the conditional likelihood given the ’s. Notably, the commonly used conditional likelihood method only utilizes for inference, which can be anticipated to lose some estimation efficiency because also involves the parameters in model (1).
For the nuisance function , we propose to approximate it with a step function that has non-negative jumps at the unique examination times. Specifically, let denote the ordered unique values of , where is an integer determined by the observed data. For , denote by the non-negative jump size of at . Then, we have , and the likelihood function (3) can be rewritten as
(4)
where .To accomplish variable selection and estimate the nonzero parameters simultaneously, we propose to maximize the following penalized log-likelihood:
(5)
where denotes a penalty function that depends on the tuning parameter . In what follows, we provide a general maximization procedure for (5) under various commonly adopted penalty functions, such as LASSO, ALASSO, SCAD, SELO, SICA, MCP and BAR [17,18,19,20,21,22,23,24]. Because the penalized log-likelihood (5) has an intractable form, performing direct maximization with some of the existing software is extremely difficult and unstable. This is also the case even without the penalty term as shown in Gao and Chan [15]. In the next section, we will propose a reliable and stable penalized EM algorithm to overcome this computation challenge.3. Estimation Procedure
The proposed penalized EM algorithm involves two layers of data augmentation, which aims at simplifying the form of (4) and obtaining a tractable objective function. In the first stage of data augmentation, for the ith subject, we introduce a set of independent pseudo-truncated data, , which are also referred to as “ghost data” [37], and the random integer follows a negative binomial distribution with , where
Given , let for each , and then follows a multinomial distribution with probabilities , where and . In the above, is the finite upper bound of the support of and can be specified as in practice [14]. After deleting some constants that are irrelevant to the parameters to be estimated, the augmented likelihood function based on is
(6)
In the second stage, for the ith subject, we introduce the independent latent variables with and , where is the Poisson random variable with mean . Then, the likelihood function (6) can be re-expressed with Poisson variables as Let be the probability mass function of and . By treating the latent variables ’s and ’s as observable, we have the following complete data likelihood where we require that and if , and if .Let , and let be the update of at the mth iteration with . Based on , we can present the expectation step (E-step) and maximization step (M-step) of the proposed algorithm. In the E-step, we calculate the conditional expectations of and given the observed data and in the . This step yields
(7)
In particular, at the mth iteration of the algorithm, the expressions of the conditional expectations are given by and where For notational simplicity, we omitted the conditional arguments including the observed data and current estimates of the parameters in the above conditional expectations.In the M-step of the algorithm, by solving , we have a closed-form expression for , which is given by
(8)
Next, by plugging (8) into (7), we have the following objective function that only involves the unknown parameter : To obtain the sparse estimator of , we propose to minimize the following objective function(9)
where .For LASSO and ALASSO, the modified shooting algorithm given in Zhang and Lu [38] and others can be adopted to minimize (9). For the BAR, the closed-form solution for is available [39]. For other penalties, after using local linear approximation for [40], one can also adopt the modified shooting algorithm to minimize (9).
In summary, for the given and initial estimator , we repeat the E-step and M-step until the convergence criterion is satisfied, rendering the sparse estimators of the regression parameters. It is worth pointing out that the proposed algorithm is insensitive to the choices of the initial value . In practice, one can simply set the initial value of each component in to 0 and the initial value of to , for . The proposed algorithm is declared to achieve convergence if the summation of the absolute differences in the estimates between two successive iterations is less than a small positive number, such as .
To select the optimal , we follow Li et al. [39] and others and adopt the BIC criterion, which is defined as
where is the final estimator of , is the final estimator of , denotes the logarithm of (4) and is the total number of the nonzero estimates in and . For a given set including candidate values of , the optimal can be set as the one that yields the smallest BIC.4. Asymptotic Properties
Without a loss of generality, we write , where includes the first d components of that are nonzero and consists of the remaining zero components. Denote the true value of by , where is the true value of and is the true value of . Let be the estimator of obtained from the method proposed above, where denotes the estimate of and denotes the estimate of . In what follows, we establish the asymptotic properties of .
For any penalty function with tuning parameter , we let and assume that belongs to the function class as considered in Lv and Fan [20]:
The class is quite general and includes the penalty functions considered in this work. To establish the asymptotic properties of , we need the following regularity conditions.-
(C1). The true regression parameter lies in a compact set in , and the true cumulative baseline hazard function is continuously differentiable and positive with for all , where is the union of the support of and . In addition, we assume that .
-
(C2). The covariate vector is bounded with probability one and the covariance matrix of is positive definite.
-
(C3). The number of examination times, M, is positive and . Additionally, there exists a positive constant such that . Furthermore, there exists a probability measure in such that the bivariate distribution function of conditional on is dominated by , and its Radon–Nikodym derivative, denoted by , can be expanded to a positive and has twice-continuous derivatives with respect to u and v when and are continuously differentiable with respect to .
Conditions (C1) and (C2) are standard conditions in a failure time data analysis [7]. Condition (C3) pertains to the joint distribution of the examination times, which ensures that two adjacent examination times should be separated by at least . Otherwise, the data may contain exactly observed failure times, requiring different theoretical treatments. Note that the conditions (C1)–(C3) are used in establishing the root-n consistency of the unpenalized maximum likelihood estimator of the regression vector [7], which is required for the penalty term in the penalized likelihood. Also, the conditions (C1)–(C3) ensure that the log profile likelihood has a quadratic expansion around [7].
(root-n consistency). Under conditions (C1) to (C3), if , then , where denotes the Euclidean norm for a given vector.
(oracle property). Under conditions (C1) to (C3), if and , then has the following properties:
-
1.
(Sparsity) ;
-
2.
(Asymptotic normality) , where is the upper-left sub-matrix of the efficient Fisher information matrix for β, denoted by .
Theorem 1 indicates that is consistent, and Theorem 2 (i) implies that is sparse and has the selection consistency property, that is, . Theorem 2 (ii) implies that the estimators of the nonzero regression parameters are semiparametrically efficient. The detailed proofs of Theorems 1 and 2 under the penalty ALASSO are given in the Appendix A. For other penalty functions that belong to , one can prove the above two theorems with analogous techniques, which are omitted in this paper.
5. A Simulation Study
We conducted a simulation study to evaluate the finite-sample performance of the proposed penalized EM algorithm. We first assumed that there exist 10 covariates, following the marginal standard normal distribution with the pairwise correlation . We set the true value of denoted by to be (large effects) or (weak effects). The truncation time followed the uniform distribution on with . The failure time of interest was generated from model (1) with . Because we considered the length-biased sampling, only pairs that satisfy were kept in the simulated data, which were denoted as .
To construct interval censoring, for subject i, we generated a series of potential examination times with , and . Then, was defined as the smallest interval that brackets . On average, we had about 30–44% left-censored observations and 19–30% right-censored ones. We considered some classical penalty functions, including LASSO, ALASSO, SCAD, SELO, SICA, MCP and BAR [17,18,20,21,22,24]. In order to find the optimal for each penalty, we considered 20 equally spaced points over the interval and selected the one that minimizes the BIC. In particular, b was chosen to guarantee that all the regression parameter estimates were penalized to zero, while a was chosen to ensure that all the covariates were selected. The following results are based on or 400 and 100 replications.
To assess the variable selection performance, we calculated the FP and TP, which are defined as the average number of selected covariates whose true coefficients are zero and the average number of selected covariates whose true coefficients are not zero, respectively. To measure the estimation accuracy, we reported the median of the mean squared errors (MMSE) and the standard deviation of the mean squared errors (SD), where the mean squared error is defined as and denotes the population covariance matrix of the covariate vector .
Table 1 and Table 2 present the results obtained by the proposed method with and , respectively. In both tables, we also present the results of the oracle estimation and the analysis method without conducting variable selection. The results given in Table 1 and Table 2 show that almost all the penalty functions gave a similar variable selection performance. The only exception is LASSO, which yielded a slightly larger FP. This observation can be anticipated because LASSO often selects more noises than the other penalty functions [39]. In addition, one can see from the tables that except LASSO, the MMSEs yielded by the other penalty functions were close to the oracle estimation and smaller than the analysis method without conducting variable selection. As the sample size increased, the performance of the variable selection and estimation accuracy improved for all the penalty functions.
For comparison, we also include in Table 1 and Table 2 the results obtained by the variable selection method based on the penalized conditional likelihood (PCL). The detailed implementation of the PCL method is given in the Appendix B. Notably, Li et al. [32] also maximized the PCL to conduct variable selection but only considered the ALASSO penalty. It is clear that, compared with the PCL approach, the proposed method yielded smaller MMSEs, implying a more accurate estimation performance. Furthermore, the SDs obtained by the proposed method are smaller than those of the PCL method, because the PCL method ignores the distribution information of the truncation times and thus loses some estimation efficiency.
In this study, we also considered other simulation settings with or 50. Specifically, we set for and for and let other simulation specifications be the same as above. The simulation results are presented in Table 3 and Table 4 and show similar conclusions as above. In particular, the proposed method with ALASSO, SCAD, SELO, SICA, MCP and BAR yielded much smaller MMSEs than the analysis method without conducting variable selection when p increased. This clearly demonstrated the necessity of conducting variable selection in the presence of a large number of covariates.
6. An Application
6.1. Background and Analysis Methods
We applied the proposed method to a set of real data arising from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial [9,41]. Sponsored by the National Cancer Institute, the PLCO cancer screening trial was initiated in 1993 and recruited participants who had not previously taken part in any other cancer screening trials at ten screening centers nationwide. The recruited participants were aged from 55 to 74. In particular, the participants who were randomly assigned to the screening group received the Prostate-Specific Antigen (PSA) test periodically over 13 years. If abnormally high PSA levels were found, a prostate biopsy was conducted to determine the occurrence status of prostate cancer. In this study, we focused on the prostate cancer screening data in the screening group and aimed at identifying the important risk factors of the development of prostate cancer. This is because prostate cancer in general causes no signs or symptoms in the early stages, but as the disease progresses, it can cause serious complications, such as urination problems and anemia. Therefore, exploring the risk factors of prostate cancer exhibits a pressing need and is also beneficial to conduct early prevention for males. To this end, the failure time of interest was defined as the age at onset of prostate cancer. Because the participants were only examined intermittently, only interval-censored observations could be obtained for the onset of the prostate cancer. In addition, because the study excluded individuals who had already developed prostate cancer at the study recruitment, the age at the onset of prostate cancer suffered from left truncation with the truncation time being the age the individual enrolled in the study.
We considered seven potential risk factors, including Race (1 for African American and 0 otherwise), Education (1 for at least college and 0 otherwise), Cancer (1 for having an immediate family member with any PLCO cancer and 0 otherwise), ProsCancer (1 for having an immediate family member with prostate cancer and 0 otherwise), Diabetes (1 for having diabetes and 0 otherwise), Stroke (1 for having strokes before and 0 otherwise) and Gallblad (1 for having gall bladder stones and 0 otherwise). The sample size was n = 32,897, and the left and right censoring rates were about and , respectively.
To achieve variable selection, we implemented the proposed method with LASSO, ALASSO, SCAD, SELO, SICA, MCP and BAR as in the simulation study. To select the optimal for each penalty, we utilized a two-step method. In the first step, we examined a range of points over to roughly identify a narrower interval that includes the optimal tuning parameter. Here, as in the simulation study, a was selected to ensure that all the covariates were selected, while b was chosen to ensure that all the regression parameter estimates were penalized to zero. Next, we further considered 20 equally spaced points within the narrower interval and selected the optimal that minimizes the BIC. In addition, we also employed the BIC to select the best penalty among all the penalties considered for the data, and it turned out that the PH model with SCAD and MCP yielded the smallest BIC value. To calculate the standard error, we used the nonparametric bootstrap method with 100 bootstrap samples. In addition to the proposed method, we also considered the variable method based on the penalized conditional likelihood (PCL) for comparison.
6.2. Results
We summarize in Table 5 the results obtained by the proposed and PCL methods. The results indicated that, except LASSO, the proposed method with other penalties recognized Race, ProsCancer and Diabetes as significant risk factors for prostate cancer. Specifically, being African American and having an immediate family member with prostate cancer increased the risk of developing prostate cancer, while having diabetes exhibited a lower risk of developing prostate cancer. These findings are in accordance with the conclusions obtained by Meister [42], Pierce [43] and others. In contrast, the results given in Table 5 show that the method based on PCL yielded relatively larger standard error estimates compared with the proposed method. This finding again demonstrates the advantage or efficiency gain of the proposed method when taking into account the distribution information of the truncation times in the inference procedure.
7. Discussion and Conclusions
In this article, we considered the length-biased and interval-censored data and developed a penalized analysis procedure to choose important variables among the large number of covariates in the PH model. The main contribution of this work is the development of a novel penalized EM algorithm via introducing two-stage data augmentation, which can greatly simplify the penalized nonparametric maximum likelihood estimation. Specifically, by introducing pseudo-truncated data and Poisson random variables, the possible high-dimensional parameters involved in have explicit solutions, making the proposed algorithm simple and computationally stable. In contrast to the work of Li et al. [32] that only involved the ALASSO penalty, we proposed to jointly utilize the local linear approximation and the modified shooting algorithm, yielding the sparse estimators of the regression parameters under various popular penalty functions. Thus, the proposed method can offer flexible options for the data analyst. The numerical results obtained from a simulation study showed the satisfactory performance and desirable advantage of the proposed method in finite samples. Moreover, by legitimately taking into account the distribution information of the truncation times, the proposed method is more efficient than the traditional penalized conditional likelihood approach (e.g., Li et al.’s method [32]).
Notably, the findings of our prostate cancer data analysis may have certain public health implications. Specifically, African Americans and individuals who have immediate family members with prostate cancer are specific population groups and need to receive early prevention (e.g., cancer screening) in order to reduce the risk of developing prostate cancer.
8. Suggestions for Future Work
Notably, in the proposed method, we only investigated the variable selection technique in the setting of . Obviously, in some practical applications, such as the gene expression studies, p is usually much larger than n, and future efforts will be devoted to extend the proposed method to handle the case of . In addition, generalizations of the proposed method to other regression models (e.g., transformation and additive hazards models [7,44]) and taking into account multivariate interval censoring [45] warrant further research.
Methodology, F.F.; Writing—review & editing, G.C.; Supervision, J.S. All authors have read and agreed to the published version of the manuscript.
The data used in the paper are not publicly available but can be requested from
The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
EM | expectation-maximization |
LASSO | the least absolute shrinkage and selection operator penalty |
ALASSO | the adaptive LASSO penalty |
SCAD | the smoothly clipped absolute deviation penalty |
SELO | the seamless- |
SICA | the smooth integration of counting and absolute deviation penalty |
MCP | the minimax concave penalty |
BAR | the broken adaptive ridge penalty |
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Simulation results for the variable selection and estimation accuracy with large effects.
Method | Penalty |
|
|
||||
---|---|---|---|---|---|---|---|
TP | FP | MMSE (SD) | TP | FP | MMSE (SD) | ||
Proposed method | LASSO | 3 | 1.27 | 0.163 (0.107) | 3 | 1.15 | 0.092(0.060) |
ALASSO | 3 | 0.12 | 0.051 (0.058) | 3 | 0.11 | 0.025 (0.022) | |
SCAD | 3 | 0.09 | 0.025 (0.046) | 3 | 0.07 | 0.012 (0.014) | |
SELO | 3 | 0.14 | 0.030 (0.042) | 3 | 0.07 | 0.013 (0.016) | |
SICA | 3 | 0.16 | 0.030 (0.042) | 3 | 0.07 | 0.013 (0.015) | |
MCP | 3 | 0.16 | 0.025 (0.042) | 3 | 0.07 | 0.012 (0.015) | |
BAR | 3 | 0.13 | 0.032 (0.040) | 3 | 0.11 | 0.014 (0.016) | |
Oracle | - | - | 0.024 (0.041) | - | - | 0.011 (0.014) | |
Without VS | - | - | 0.057 (0.055) | - | - | 0.026 (0.020) | |
PCL method | LASSO | 3 | 1.42 | 0.163 (0.162) | 3 | 1.29 | 0.089 (0.074) |
ALASSO | 3 | 0.20 | 0.081 (0.108) | 3 | 0.13 | 0.033 (0.041) | |
SCAD | 3 | 0.12 | 0.056 (0.116) | 3 | 0.08 | 0.025 (0.041) | |
SELO | 3 | 0.18 | 0.065 (0.087) | 3 | 0.10 | 0.021 (0.038) | |
SICA | 3 | 0.18 | 0.066 (0.087) | 3 | 0.10 | 0.021 (0.037) | |
MCP | 3 | 0.13 | 0.054 (0.116) | 3 | 0.10 | 0.025 (0.043) | |
BAR | 3 | 0.19 | 0.065 (0.089) | 3 | 0.13 | 0.023 (0.035) |
“Proposed method” denotes the proposed penalized EM algorithm; “Without VS” denotes the analysis method without conducting variable selection; and “PCL method” denotes the variable selection method based on the penalized conditional likelihood.
Simulation results for the variable selection and estimation accuracy with weak covariate effects.
Method | Penalty |
|
|
||||
---|---|---|---|---|---|---|---|
TP | FP | MMSE (SD) | TP | FP | MMSE (SD) | ||
Proposed method | LASSO | 3 | 0.80 | 0.047 (0.040) | 3 | 0.73 | 0.024(0.024) |
ALASSO | 3 | 0.25 | 0.027 (0.024) | 3 | 0.12 | 0.009 (0.010) | |
SCAD | 3 | 0.36 | 0.030 (0.028) | 3 | 0.17 | 0.010 (0.011) | |
SELO | 3 | 0.23 | 0.017 (0.019) | 3 | 0.10 | 0.008 (0.009) | |
SICA | 2.99 | 0.22 | 0.017 (0.021) | 3 | 0.07 | 0.008 (0.008) | |
MCP | 2.98 | 0.25 | 0.018 (0.025) | 3 | 0.10 | 0.007 (0.009) | |
BAR | 3 | 0.22 | 0.015 (0.018) | 3 | 0.12 | 0.008 (0.009) | |
Oracle | - | - | 0.013 (0.014) | - | - | 0.007 (0.007) | |
Without VS | - | - | 0.043 (0.031) | - | - | 0.020 (0.013) | |
PCL method | LASSO | 2.93 | 1.18 | 0.065 (0.075) | 3 | 0.72 | 0.039 (0.040) |
ALASSO | 2.85 | 0.50 | 0.060 (0.062) | 2.98 | 0.05 | 0.015 (0.028) | |
SCAD | 2.86 | 0.55 | 0.076 (0.063) | 2.99 | 0.15 | 0.017 (0.026) | |
SELO | 2.87 | 0.47 | 0.053 (0.060) | 2.98 | 0.06 | 0.018 (0.028) | |
SICA | 2.88 | 0.48 | 0.053 (0.059) | 2.98 | 0.07 | 0.015 (0.027) | |
MCP | 2.84 | 0.43 | 0.072 (0.068) | 2.98 | 0.05 | 0.015 (0.028) | |
BAR | 2.87 | 0.37 | 0.052 (0.059) | 2.97 | 0.08 | 0.014 (0.028) |
“Proposed method” denotes the proposed penalized EM algorithm; “Without VS” denotes the analysis method without conducting variable selection; and “PCL method” denotes the variable selection method based on the penalized conditional likelihood.
Simulation results for the variable selection and estimation accuracy with
Method | Penalty |
|
|
||||
---|---|---|---|---|---|---|---|
TP | FP | MMSE (SD) | TP | FP | MMSE (SD) | ||
Proposed method | LASSO | 3 | 1.59 | 0.307 (0.143) | 3 | 1.55 | 0.184 (0.082) |
ALASSO | 3 | 0.45 | 0.085 (0.075) | 3 | 0.13 | 0.028 (0.030) | |
SCAD | 3 | 0.26 | 0.041 (0.073) | 3 | 0.15 | 0.011 (0.017) | |
SELO | 3 | 0.35 | 0.046 (0.051) | 3 | 0.12 | 0.014 (0.019) | |
SICA | 3 | 0.31 | 0.045 (0.051) | 3 | 0.12 | 0.015 (0.019) | |
MCP | 3 | 0.14 | 0.030 (0.043) | 3 | 0.13 | 0.012 (0.017) | |
BAR | 3 | 0.46 | 0.042 (0.044) | 3 | 0.23 | 0.015 (0.018) | |
Oracle | - | - | 0.026 (0.035) | - | - | 0.010 (0.017) | |
Without VS | - | - | 0.216 (0.148) | - | - | 0.087 (0.043) | |
PCL method | LASSO | 3 | 1.87 | 0.362 (0.223) | 3 | 1.33 | 0.220 (0.129) |
ALASSO | 3 | 0.85 | 0.114 (0.124) | 3 | 0.24 | 0.027 (0.046) | |
SCAD | 2.97 | 0.41 | 0.080 (0.134) | 3 | 0.14 | 0.028 (0.038) | |
SELO | 2.99 | 0.80 | 0.094 (0.106) | 3 | 0.20 | 0.025 (0.035) | |
SICA | 3 | 0.50 | 0.091 (0.112) | 3 | 0.18 | 0.024 (0.035) | |
MCP | 2.98 | 0.44 | 0.073 (0.146) | 3 | 0.15 | 0.021 (0.044) | |
BAR | 2.99 | 0.77 | 0.100 (0.136) | 3 | 0.37 | 0.026 (0.037) |
“Proposed method” denotes the proposed penalized EM algorithm; “Without VS” denotes the analysis method without conducting variable selection; and “PCL method” denotes the variable selection method based on the penalized conditional likelihood.
Simulation results for the variable selection and estimation accuracy with
Method | Penalty |
|
|
||||
---|---|---|---|---|---|---|---|
TP | FP | MMSE (SD) | TP | FP | MMSE (SD) | ||
Proposed method | LASSO | 3 | 1.54 | 0.394 (0.117) | 3 | 1.33 | 0.269 (0.091) |
ALASSO | 3 | 0.45 | 0.130 (0.076) | 3 | 0.21 | 0.037 (0.033) | |
SCAD | 3 | 0.29 | 0.038 (0.061) | 3 | 0.03 | 0.014 (0.023) | |
SELO | 3 | 0.27 | 0.049 (0.041) | 3 | 0.14 | 0.018 (0.022) | |
SICA | 3 | 0.27 | 0.048 (0.047) | 3 | 0.20 | 0.016 (0.020) | |
MCP | 3 | 0.14 | 0.028 (0.046) | 3 | 0.06 | 0.014 (0.017) | |
BAR | 3 | 0.58 | 0.045 (0.044) | 3 | 0.37 | 0.017 (0.020) | |
Oracle | - | - | 0.021 (0.030) | - | - | 0.011 (0.016) | |
Without VS | - | - | 0.637 (0.453) | - | - | 0.169 (0.070) | |
PCL method | LASSO | 3 | 1.74 | 0.469 (0.263) | 3 | 1.58 | 0.297 (0.139) |
ALASSO | 2.99 | 0.82 | 0.155 (0.160) | 3 | 0.38 | 0.037 (0.045) | |
SCAD | 2.99 | 0.61 | 0.101 (0.152) | 3 | 0.20 | 0.024 (0.035) | |
SELO | 2.99 | 0.70 | 0.088 (0.105) | 3 | 0.54 | 0.023 (0.051) | |
SICA | 2.99 | 0.65 | 0.089 (0.103) | 3 | 0.57 | 0.022 (0.051) | |
MCP | 2.99 | 0.56 | 0.093 (0.157) | 3 | 0.29 | 0.025 (0.041) | |
BAR | 3 | 1.02 | 0.097 (0.164) | 3 | 0.62 | 0.036 (0.042) |
“Proposed method” denotes the proposed penalized EM algorithm; “Without VS” denotes the analysis method without conducting variable selection; and “PCL method” denotes the variable selection method based on the penalized conditional likelihood.
Analysis results of the prostate cancer screening data.
Method | Covariate | LASSO | ALASSO | SCAD | SELO | SICA | MCP | BAR | Without VS |
---|---|---|---|---|---|---|---|---|---|
Proposed | Race | 0.332 (0.084 ) | 0.364 * (0.088) | 0.444 * (0.072) | 0.408 * (0.082) | 0.408 * (0.083) | 0.444 * (0.072) | 0.412 * (0.081) | 0.458 * (0.064) |
method | Education | 0.047 (0.027 ) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0.070 * (0.031) |
Cancer | 0.032 (0.029 ) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0.047 (0.032 ) | |
ProsCancer | 0.346 * (0.053 ) | 0.377 * (0.058) | 0.421 * (0.049) | 0.397 * (0.053) | 0.398 * (0.056) | 0.421 * (0.049) | 0.405 * (0.052) | 0.394 * (0.049) | |
Diabetes | −0.310 * (0.060 ) | −0.336 * (0.068) | −0.416 * (0.059) | −0.373 * (0.065) | −0.374 * (0.068) | −0.416 * (0.059) | −0.384 * (0.062) | −0.402 * (0.064) | |
Stroke | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | −0.244 * (0.109 ) | |
Gallblad | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | −0.064 (0.058 ) | |
PCL | Race | 0.307 * (0.087 ) | 0.399 * (0.114) | 0.420 * (0.102) | 0.385 * (0.102) | 0.394 * (0.105) | 0.420 * (0.102) | 0.377 * (0.114) | 0.430 * (0.067) |
method | Education | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | −0.005 (0.032 ) |
Cancer | 0.075 * (0.033 ) | 0.065 (0.049) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0.089 * (0.033 ) | |
ProsCancer | 0.360 * (0.062 ) | 0.407 * (0.062) | 0.453 * (0.062) | 0.436 * (0.063) | 0.441* (0.064) | 0.453 * (0.062) | 0.437 * (0.063) | 0.407 * (0.051) | |
Diabetes | −0.216 * (0.073 ) | −0.286 * (0.078) | −0.316 * (0.077) | −0.266 * (0.089) | −0.279 * (0.093) | −0.316 * (0.077) | −0.266 * (0.087) | −0.320 * (0.065 ) | |
Stroke | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | −0.011 (0.109) | |
Gallblad | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0 (-) | 0.090 (0.060) |
“Proposed method” denotes the proposed penalized EM algorithm; “PCL method” denotes the variable selection method based on the penalized conditional likelihood; “Without VS” denotes the analysis method without conducting variable selection; and “*” indicates that the covariate effect is significant at the level of 0.05.
Appendix A. Proofs of The Asymptotic Results
Appendix A.1. Proof of Theorem 1
The penalized log-likelihood function can be written as
According to Proposition 3.1 and the discussion in Section 4.4 given by Huang and Wellner [
To prove (A3), we have
Appendix A.2. Proof of Theorem 2
(i) According to theorem 1, we have
For any
(ii) We next prove the asymptotic normality of
Let
Because
On the other hand, because
Appendix B. The Penalized Conditional Likelihood Method
In this section, we provide the EM algorithm for implementing the maximum conditional likelihood approach. The conditional likelihood function can be written as
To present the EM algorithm, for the ith subject, we introduce a set of new independent latent variables
In the E-step of the EM algorithm, we need to determine the conditional expectation of the log-likelihood function
References
1. Sun, J. The Statistical Analysis of Interval-Censored Failure Time Data; Springer: New York, NY, USA, 2006.
2. Huang, J. Efficient estimation for the proportional hazards model with interval censoring. Ann. Stat.; 1996; 24, pp. 540-568. [DOI: https://dx.doi.org/10.1214/aos/1032894452]
3. Shen, X. Proportional odds regression and sieve maximum likelihood estimation. Biometrika; 1998; 85, pp. 165-177. [DOI: https://dx.doi.org/10.1093/biomet/85.1.165]
4. Zeng, D.; Cai, J.; Shen, Y. Semiparametric additive risks model for interval-censored data. Stat. Sin.; 2006; 16, pp. 287-302.
5. Zhang, Y.; Hua, L.; Huang, J. A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand. J. Stat.; 2010; 37, pp. 338-354. [DOI: https://dx.doi.org/10.1111/j.1467-9469.2009.00680.x]
6. Wang, L.; McMahan, C.S.; Hudgens, M.G.; Qureshi, Z.P. A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics; 2016; 72, pp. 222-231. [DOI: https://dx.doi.org/10.1111/biom.12389] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26393917]
7. Zeng, D.; Mao, L.; Lin, D.Y. Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika; 2016; 103, pp. 253-271. [DOI: https://dx.doi.org/10.1093/biomet/asw013]
8. Zhou, Q.; Hu, T.; Sun, J. A sieve semiparametric maximum likelihood approach for regression analysis of bivariate interval-censored failure time data. J. Am. Stat. Assoc.; 2017; 112, pp. 664-672. [DOI: https://dx.doi.org/10.1080/01621459.2016.1158113]
9. Prorok, P.C.; Andriole, G.L.; Bresalier, R.S.; Buys, S.S.; Chia, D.; Crawford, E.D.; Fogel, R.; Gelmann, E.P.; Gilbert, F.; Gohagan, J.K. Design of the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control. Clin. Trials; 2000; 21, pp. 273S-309S. [DOI: https://dx.doi.org/10.1016/S0197-2456(00)00098-2]
10. Wang, M.C. Nonparametric estimation from cross-sectional survival data. J. Am. Stat. Assoc.; 1991; 86, pp. 130-143. [DOI: https://dx.doi.org/10.1080/01621459.1991.10475011]
11. Shen, Y.; Ning, J.; Qin, J. Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J. Am. Stat. Assoc.; 2009; 104, pp. 1192-1202. [DOI: https://dx.doi.org/10.1198/jasa.2009.tm08614]
12. Ning, J.; Qin, J.; Shen, Y. Semiparametric accelerated failure time model for length-biased data with application to dementia study. Stat. Sin.; 2014; 24, pp. 313-333. [DOI: https://dx.doi.org/10.5705/ss.2011.197] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24478570]
13. Qin, J.; Shen, Y. Statistical methods for analyzing right-censored length-biased data under Cox model. Biometrics; 2010; 66, pp. 382-392. [DOI: https://dx.doi.org/10.1111/j.1541-0420.2009.01287.x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19522872]
14. Qin, J.; Ning, J.; Liu, H.; Shen, Y. Maximum likelihood estimations and EM algorithms with length-biased data. J. Am. Stat. Assoc.; 2011; 106, pp. 1434-1449. [DOI: https://dx.doi.org/10.1198/jasa.2011.tm10156] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22323840]
15. Gao, F.; Chan, K.C.G. Semiparametric regression analysis of length-biased interval-censored data. Biometrics; 2019; 75, pp. 121-132. [DOI: https://dx.doi.org/10.1111/biom.12970]
16. Shen, P.S.; Peng, Y.; Chen, H.J.; Chen, C.M. Maximum likelihood estimation for length-biased and interval-censored data with a nonsusceptible fraction. Lifetime Data Anal.; 2022; 28, pp. 68-88. [DOI: https://dx.doi.org/10.1007/s10985-021-09536-2]
17. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.); 1996; 58, pp. 267-288. [DOI: https://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x]
18. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.; 2001; 96, pp. 1348-1360. [DOI: https://dx.doi.org/10.1198/016214501753382273]
19. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc.; 2006; 101, pp. 1418-1429. [DOI: https://dx.doi.org/10.1198/016214506000000735]
20. Lv, J.; Fan, Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann. Stat.; 2009; 37, pp. 3498-3528. [DOI: https://dx.doi.org/10.1214/09-AOS683]
21. Dicker, L.; Huang, B.; Lin, X. Variable selection and estimation with the seamless-L-0 penalty. Stat. Sin.; 2013; 23, pp. 929-962.
22. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat.; 2010; 38, pp. 894-942. [DOI: https://dx.doi.org/10.1214/09-AOS729]
23. Liu, Z.; Li, G. Efficient regularized regression with penalty for variable selection and network construction. Comput. Math. Methods Med.; 2016; 2016, 3456153. [DOI: https://dx.doi.org/10.1155/2016/3456153]
24. Dai, L.; Chen, K.; Sun, Z.; Liu, Z.; Li, G. Broken adaptive ridge regression and its asymptotic properties. J. Multivar. Anal.; 2018; 168, pp. 334-351. [DOI: https://dx.doi.org/10.1016/j.jmva.2018.08.007] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30911202]
25. Fan, J.; Li, R.; Zhang, C.H.; Zou, H. Statistical Foundations of Data Science; Chapman and Hall/CRC: New York, NY, USA, 2020.
26. Garavand, A.; Salehnasab, C.; Behmanesh, A.; Aslani, N.; Zadeh, A.; Ghaderzadeh, M. Efficient Model for Coronary Artery Disease Diagnosis: A Comparative Study of Several Machine Learning Algorithms. J. Healthc. Eng.; 2022; 2022, 5359540. [DOI: https://dx.doi.org/10.1155/2022/5359540] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36304749]
27. Hosseini, A.; Eshraghi, M.A.; Taami, T.; Sadeghsalehi, H.; Hoseinzadeh, Z.; Ghaderzadeh, M.; Rafiee, M. A mobile application based on efficient lightweight CNN model for classification of B-ALL cancer from non-cancerous cells: A design and implementation study. Inform. Med. Unlocked; 2023; 39, 101244. [DOI: https://dx.doi.org/10.1016/j.imu.2023.101244]
28. Garavand, A.; Behmanesh, A.; Aslani, N.; Sadeghsalehi, H.; Ghaderzadeh, M. Towards Diagnostic Aided Systems in Coronary Artery Disease Detection: A Comprehensive Multiview Survey of the State of the Art. Int. J. Intell. Syst.; 2023; 2023, 6442756. [DOI: https://dx.doi.org/10.1155/2023/6442756]
29. Ghaderzadeh, M.; Aria, M. Management of Covid-19 Detection Using Artificial Intelligence in 2020 Pandemic. Proceedings of the ICMHI ’21: 5th International Conference on Medical and Health Informatics; Kyoto, Japan, 14–16 May 2021; pp. 32-38.
30. Chen, L.P. Variable selection and estimation for the additive hazards model subject to left-truncation, right-censoring and measurement error in covariates. J. Stat. Comput. Simul.; 2020; 90, pp. 3261-3300. [DOI: https://dx.doi.org/10.1080/00949655.2020.1800705]
31. He, D.; Zhou, Y.; Zou, H. High-dimensional variable selection with right-censored length-biased data. Stat. Sin.; 2020; 30, pp. 193-215. [DOI: https://dx.doi.org/10.5705/ss.202018.0089]
32. Li, C.; Pak, D.; Todem, D. Adaptive lasso for the Cox regression with interval censored and possibly left truncated data. Stat. Methods Med. Res.; 2020; 29, pp. 1243-1255. [DOI: https://dx.doi.org/10.1177/0962280219856238]
33. Withana Gamage, P.; McMahan, C.; Wang, L. Variable selection in semiparametric nonmixture cure model with interval-censored failure time data. Stat. Med.; 2019; 38, pp. 3026-3039.
34. Li, S.; Peng, L. Instrumental Variable Estimation of Complier Causal Treatment Effect with Interval-Censored Data. Biometrics; 2023; 79, pp. 253-263. [DOI: https://dx.doi.org/10.1111/biom.13565]
35. Withana Gamage, P.; McMahan, C.; Wang, L. A flexible parametric approach for analyzing arbitrarily censored data that are potentially subject to left truncation under the proportional hazards model. Lifetime Data Anal.; 2023; 29, pp. 188-212. [DOI: https://dx.doi.org/10.1007/s10985-022-09579-z]
36. Huang, C.Y.; Qin, J. Semiparametric estimation for the additive hazards model with left-truncated and right-censored data. Biometrika; 2013; 100, pp. 877-888. [DOI: https://dx.doi.org/10.1093/biomet/ast039]
37. Turnbull, B.W. The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Stat. Soc. Ser. (Methodol.); 1976; 38, pp. 290-295. [DOI: https://dx.doi.org/10.1111/j.2517-6161.1976.tb01597.x]
38. Zhang, H.H.; Lu, W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika; 2007; 94, pp. 691-703. [DOI: https://dx.doi.org/10.1093/biomet/asm037]
39. Li, S.; Wu, Q.; Sun, J. Penalized estimation of semiparametric transformation models with interval-censored data and application to Alzheimer’s disease. Stat. Methods Med. Res.; 2020; 29, pp. 2151-2166. [DOI: https://dx.doi.org/10.1177/0962280219884720] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31718478]
40. Zou, H.; Li, R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat.; 2008; 36, pp. 1509-1533.
41. Andriole, G.L.; Crawford, E.D.; Grubb, R.L.; Buys, S.S.; Chia, D.; Church, T.R.; Fouad, M.N.; Isaacs, C.; Prorok, P. Prostate cancer screening in the randomized prostate, lung, colorectal, and ovarian cancer screening trial: Mortality results after 13 years of follow-up. J. Natl. Cancer Inst.; 2012; 104, pp. 125-132. [DOI: https://dx.doi.org/10.1093/jnci/djr500] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22228146]
42. Meister, K. Risk Factors for Prostate Cancer; American Council on Science and Health: New York, NY, USA, 2002.
43. Pierce, B.L. Why are diabetics at reduced risk for prostate cancer? A review of the epidemiologic evidence. Urol. Oncol. Semin. Orig. Investig.; 2012; 30, pp. 735-743. [DOI: https://dx.doi.org/10.1016/j.urolonc.2012.07.008]
44. Lu, T.; Li, S.; Sun, L. Combined estimating equation approaches for the additive hazards model with left-truncated and interval-censored data. Lifetime Data Anal.; 2023; 29, pp. 672-697. [DOI: https://dx.doi.org/10.1007/s10985-023-09596-6]
45. Sun, L.; Li, S.; Wang, L.; Song, X.; Sui, X. Simultaneous variable selection in regression analysis of multivariate interval-censored data. Biometrics; 2022; 78, pp. 1402-1413. [DOI: https://dx.doi.org/10.1111/biom.13548] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34407218]
46. Murphy, S.A.; Van Der Vaart, A.W. On profile likelihood. J. Am. Stat. Assoc.; 2000; 95, pp. 449-465. [DOI: https://dx.doi.org/10.1080/01621459.2000.10474219]
47. Huang, J.; Wellner, J.A. Interval censored survival data: A review of recent progress. Proceedings of the First Seattle Symposium in Biostatistics; Lin, D.Y.; Fleming, T.R. Springer: New York, NY, USA, 1997; pp. 123-169.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Length-biased failure time data occur often in various biomedical fields, including clinical trials, epidemiological cohort studies and genome-wide association studies, and their analyses have been attracting a surge of interest. In practical applications, because one may collect a large number of candidate covariates for the failure event of interest, variable selection becomes a useful tool to identify the important risk factors and enhance the estimation accuracy. In this paper, we consider Cox’s proportional hazards model and develop a penalized variable selection technique with various popular penalty functions for length-biased data, in which the failure event of interest suffers from interval censoring. Specifically, a computationally stable and reliable penalized expectation-maximization algorithm via two-stage data augmentation is developed to overcome the challenge in maximizing the intractable penalized likelihood. We establish the oracle property of the proposed method and present some simulation results, suggesting that the proposed method outperforms the traditional variable selection method based on the conditional likelihood. The proposed method is then applied to a set of real data arising from the Prostate, Lung, Colorectal and Ovarian cancer screening trial. The analysis results show that African Americans and having immediate family members with prostate cancer significantly increase the risk of developing prostate cancer, while having diabetes exhibited a significantly lower risk of developing prostate cancer.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 School of Mathematics, Jilin University, Changchun 130012, China;
2 Guangzhou Institute of International Finance, Guangzhou University, Guangzhou 510006, China
3 Department of Statistics, University of Missouri, Columbia, MO 65211, USA;