Introduction
Modeling medical costs is a central issue in health economics and insurance, particularly for identifying risk factors and informing cost-effective healthcare interventions. However, the presence of severe skewness, and complex nonlinear effects in medical cost data poses significant challenges for accurate modeling and analysis.
To address longitudinal medical cost data, various statistical methods have been developed. For example, marginal models describe relationships among repeated observations [1], Markov models analyze transitions between health states over time [2], and random effects models capture relationships between medical expenditures and terminal events such as death [3]. Recently, a growing number of models have been proposed to address high-dimensional covariates in medical cost data. Prominent examples include generalized partially linear models [4, 5], single-index models [6–8], and partially nonlinear single-index models [9–11]. These frameworks offer flexibility for analyzing skewed medical cost data but often assume a known link function, which reduces their adaptability in practical settings. Furthermore, studies relying on a subset of covariates may overlook critical information inherent in high-dimensional data.
To overcome the “curse of dimensionality," sufficient dimension reduction techniques have emerged as effective tools aiming to reduce the dimensionality of the data while retaining all relevant information for prediction or inference. Key methodological advancements include techniques such as minimum average variance estimation [12], inverse regression methods [13–15], and semiparametric approaches [16]. However, these methods can fail in the presence of redundant variables. To address this issue, Chiaromonte et al. (2002) proposed partial sufficient dimension reduction for discrete variables [17], and Feng et al. (2013) extended this idea to continuous variables through the Partial Discretization Expectation Estimation (PDEE) method [18]. Partial sufficient dimension reduction effectively eliminates redundancy while preserving essential information, making it particularly suitable for high-dimensional and complex datasets.
Following dimension reduction, additive models have been widely applied for post-reduction modeling. Notable methods include backfitting algorithms [19], average regression estimation [20], and fast instrumental variable pilot estimation [21, 22]. While computationally efficient, these approaches face challenges in capturing multivariate nonlinear relationships and ensuring theoretical consistency under high-dimensional conditions.
To address these limitations, this paper proposes a Partially Nonlinear Index Model (PNIM) that integrates partial sufficient dimension reduction with fast instrumental variable pilot estimation. Unlike most existing models, the PNIM does not assume a known link function, providing greater flexibility in capturing complex relationships. By combining partial sufficient dimension reduction with fast instrumental variable pilot estimation, our method effectively handles high-dimensional covariates while maintaining computational efficiency. Simulation studies and real-world applications demonstrate the proposed model’s superior performance in identifying structural dimensions, accurately estimating relationships, and addressing challenges in medical cost analysis.
The remainder of this paper is organized as follows: Section 2 introduces the model framework; Section 3 outlines the estimation procedure; Section 4 discusses asymptotic properties; Section 5 presents simulation results; Section 6 applies the model to MEPS data; and Section 7 concludes with key findings and policy implications.
Model Specification
Suppose that Y is a response variable and X is a p–dimensional covariate vector. The variation of the response can be explained by covariates with the general function . Additionally, there may exist some nonlinear covariate effect, e.g., age on medical costs [4, 23]. Combining these two parts, we can construct the model below:
(1)
However, is difficult to estimate accurately and challenging to implement in practice due to the “curse of dimensionality" when p is large [24]. To reduce dimensionality while explaining the variation in the response, we use an index model. Specifically, we aim to find such that can be approximated by a lower-dimensional function . Thus, our partially nonlinear index model (PNIM) is:
(2)
where is the regression intercept, is a p-dimensional covariate vector, is another covariate of interest which has a nonlinear effect on the outcome, for , and is an unspecified q–variate regression function. We assume error term to be independently identically distributed with mean 0 and finite variance , and is independent of . Ideally, if , we achieve sufficient dimension reduction, as extensively discussed in literature [25].
Model (2) is a very general form, including many existing semiparametric and nonparametric models as special cases. For instance, model (2) reduces to a nonparametric model if [26]. If the index function and q = 1, model (2) becomes a partially linear model discussed by [27]. When the nonlinear function , model (2) reduces to a multi-index model, and the single-index model as its special case with q = 1 [28].
Remark 1: In model (2), and are unspecified functions to be estimated, is an unknown parameter vector. For identifiability, as is common in additive models, we assume that (see Fan et al. (1998) [20]). Unlike Xia (2008) [24], we impose no restrictions on the parameter when applying the partial sufficient dimension reduction using the sliced inverse regression method [13], as this approach does not require a specific form of the regression function.
Estimation for PNIM
We employ a two-step procedure for estimation in model (2). In the first step, we search for a lower q–dimensional covariate vector using partially sufficient dimension reduction [18]. Once we obtain the estimated basis direction and the structural dimension q, the second step involves estimating the nonparametric component via the fast instrumental variable pilot estimation [21], which is based on the standard additive model framework.
Partial Sufficient Dimension Reduction
Dimension reduction plays a crucial role in addressing high-dimensional covariates in regression analysis. Traditional methods, such as minimum mean variance estimation [13], inverse regression methods [29, 30], and semiparametric estimation [25], provide effective tools for reducing dimensionality. However, these methods often fail to account for redundant variables in high-dimensional settings.
To address this limitation, we employ partial sufficient dimension reduction (PSDR), which extends the concept of sufficient dimension reduction by eliminating redundancy while preserving essential structural information. Specifically, we introduce the partial central subspace , defined as the low-dimensional subspace of X that retains all information about Y, conditioned on T. Formally, satisfies:
(3)
where denotes statistical independence, and is the orthogonal projection operator onto subspace in the standard inner product space. This implies that Y is independent of X given the projection and T.
The concept of the partial central subspace was first introduced by Chiaromonte et al. (2002) [17] for categorical variables T. However, when T is continuous, directly identifying becomes challenging due to the complexity of conditional distributions. To overcome this difficulty, we adopt the Partial Discretization Expectation Estimation (PDEE) method proposed by Feng et al., (2013) [18], which simplifies the estimation process through discretization.The PDEE method consists of two main steps: 1. Partial discretization. The continous variable T is transformed into a set of dichotomous variables. This step simplifies the estimation process by creating discrete strata based on T, making it easier to compute conditional expectations. 2. Subspace estimation. Using the transformed variables and the original predictors X, the partial central subspace is estimated by finding the basis directions that retain the most information about Y . The specific steps of the operation are as follows:
Step 1: Partial Discretization
Discretize the continuous variable T into a set of binary variables. Let , where s is a sample observation value from a random variable S with support , and IA denotes the indicator function of event A. This creates two sample subsets:
* , with sample size n1, corresponding to ,
* , with sample size n2, corresponding to .
Step 2: Subspace Estimation
Let denote the partial central subspace for . To estimate , we define as a positive semidefinite matrix, for any given s, there exists Span and Span, where . The estimating procedures for M(s) and M are detailed as follows:
* Step 2-A: For any fixed , we can gain Mn(s), a consistent estimate of M(s), using available partial dimension reduction methods such as the partial sliced inverse regression (SIR) estimation [17], partial sliced average variance estimation (SAVE) [31], or partial directional regression (DR) [14].
In this subsection, we utilize SIR method apply on the samples and to obtain the estimates of the basis and , respectively. Then we have a matrix , where .
* Step 2-B: M, the expected value of M(s), can be calculated by considering all possible values of s. Here, we choose an easy way where s is an independent copy of T. Let be a set of l independent copies of T. For each si, we gain M(si), then M can be estimated by(4)
* Step 2-C: Perform the spectral decomposition of M. Arrange the obtained feature vectors in descending order base on their eigenvalues. Assuming the structural dimension of the partially reduced subspace is d, then is the the estimate of the partially central subspace.
Existing methods in literature can be used to determine the structural dimension d, such as the sequential test method and the weighted sequential test method [13, 32]. In this paper, we apply the modified BIC-type criterion of Feng et al. (2013) [18] to estimate the structural dimension:
(5)
where are the eigenvalues of matrix Mn, is the number of free parameters, and Cn is a penalty constant. In our simulation studies, we will follow Feng et al. (2013) [18] to choose .
Fast instrumental variable pilot estimation
One can easily find that model (2) simplifies to a standard additive model if the basis direction and the structural dimension q of dimension reduction subspace are estimated. We will discuss the estimation methods of the index function and the nonlinear function in the additive model as follows:
(6)
where for , and is an estimator of shown in Subsection 3.1. For the standard additive model (6), average regression estimation is a popular consideration, such as average regression surface [20, 33], and modified average regression estimation [22]. In this paper, we extend the modified average regression method of Cheng et al. (2011) [22] to our models including a univariate index function or a multiple index function.
Model (6) can be rewritten as
(7)
where , , and .
We can estimate the multivariate function and the univariate function using a method similar to Cheng et al. (2011) [22]. First, we define
(8)
Let and be the density functions of Z and T, respectively, and the joint density function of (Z,T) is denoted by f(z,t). We further define a joint function
(9)
One can easily find that this function has the following two attractive characteristics:
(10)
Multiplying each side of (7) by , and further taking the expectation conditional on Z = z, we can remove the other nonparametric component :
(11)
This implies that an estimator of can be obtained through the following equation
(12)
where the estimators of and are provided in (13) and (14) below. This estimator of is a modified average regression estimator.
In the next step, we will give the estimators of and , respectively. For the independent and identically distributed sample , we have from the identifiability condition and (7). Define , thus we can estimate by
(13)
where , , and . The parameter h0 denotes a bandwidth (with the same definition for h1-h6 in the following). is a univariate kernel function. The choice of bandwidth h is critical in determining the smoothing level of the kernel estimator. In this work, we select h using cross-validation(CV), following the approach described by Cheng et al., 2011) [22], which aims to minimize the asymptotic mean squared error (AMSE). For The kernel function, we employ the standard normal density. This kernel is widely used due to its smoothness and well-behaved properties, which making it particularly effective for capturing the underlying structure of the data (see Cheng et al., 2011) [22]).
Following the similar results of Kim et al. (1999) [21], we can obtain an estimator of as follows:
(14)
where with . and are kernel smoothers of the corresponding densities fT(t) and f(z,t) using the functions and , respectively.
Similarly, we can give an estimator of as follows:
(15)
with given in (13). can be similarly given below
(16)
where . and are the kernel smoothers of the corresponding densities fZ(z) and f(z,t), which are defined similarly in (14).
Remark 2: The modified average regression estimation proposed by Cheng et al. (2011) [22] extends the fast instrumental variable pilot estimation method of Kim et al. (1999) [21] as a practical alternative for the quantile regression additive model. The key difference between the two methods lies in the treatment of the response variable: Cheng et al. (2011) [22] replace the observed Y in Kim et al. (1999) [21] with the estimated (see Equations 13, (14), and (16)).
Oracle efficient estimator
The variance of the estimator in the first step might be inflated according to the asymptotic properties of kim et al. (1999) [21]. Therefore, an oracle efficient estimate of model (7) is introduced herein, derived similar to Linton et al. (1997) [34] and kim et al. (1999) [21]. We show that it is asymptotically normally distributed, with the same variance as if the other additive components were known. The oracle estimate is obtained through the sequential fitting of univariate locally polynomial regression for each of the additive components of (7), with the other additive components replaced by their estimators from the first step.
Now we can discuss the oracle estimator for and . Assume that and are d times continuously differentiable in the neighborhood of z0 and t0, respectively. Denote with nonnegative integers , , , and . We have the local linear approximation of near a fixed point z0 as follows:
(17)
where z lies in a neighborhood of z0, and .
Furthermore, define
(18)
where and are the first step estimators defined in (14) and (16), respectively. Then, the oracle efficient estimator of and , denoted as and , can be obtained by minimizing the following terms, respectively:
(19)
and
(20)
where the kernel function is a q–dimensional kernel function, which can be written as , and is a 1–dimensional kernel function with being a univariate kernel function.
Asymptotic Properties
The asymptotic properties of the estimated basis direction and the structural dimension q of the partial sufficient dimension reduction can be derived in a manner similar to that in Feng et al. (2013) [18]. The asymptotic results of the estimators for the nonparametric components and can be established similarly by using the asymptotic results of Kim et al. (1999) [21]. Here, we focus on the asymptotic results and .
To establish the asymptotic properties of the proposed estimators, we need some regularity conditions where kernel functions , and are used to estimate the joint density fZ,T(z,t), fZ(z) and fT(t), respectively. These conditions are similar to those of kim et al. (1999) [21], but with some modification for our setting.
A1: The kernel function defined in Section 2.3 is symmetric about zero and of order d with support in [–1,1], implying that . Furthermore, K(u) is assumed to be bounded and Lipschitz continuous, i.e., there exists a constant c such that for any u,v.
A2: Function and joint density function of fZ,T(z,t) are d–times continuously differentiable in each direction, where .
A3: The joint density function fZ,T(z,t) is bounded away from both zero and infinity within its compact support.
A4: Conditional variances and are Lipschitz continuous, and bounded away from zero and infinity.
In the following, we only display the asymptotic results of and , where the asymptotic properties of and can be obtained similarly.
Theorem 1. Suppose that conditions A1–A4 hold, and bandwidth . Then we have
(21)
where
with , , m(z,t) is defined in Equation 8 and . .
If and in (18)-(20) are replaced by their true values, and is replaced by , then the estimators defined in (19) and (20) are referred to as oracle estimators, say and , respectively. The asymptotic properties of the second-step estimators defined in (19) and (20) will be given below, implying that and behave asymptotically like the oracle estimators.
Theorem 2. Suppose that conditions A1–A4 are satisfied, the bandwidth , and for some a0>0. Then, for all , we have
(22)
with probability one, as , where with .
The proof of Theorem 1 and Theorem 5 directly follows that of kim et al. (1999) [21]. However, the covariate should be replaced by in our setting. We will rewrite the asymptotic properties of as follows. The other asymptotic results can be obtained similarly.
Theorem 3. Suppose that conditions A1–A4 hold, and bandwidth . Then we have
(23)
where
with , and . , and is the estimated basis direction in Section 3.1.
Proof: The proof can be completed directly from Theorem 1 and method with the following fact:
(24)
◻
Simulation Results
In this section, we consider two examples involving different regression functions to evaluate the finite sample performance of the proposed method. The model and parameter settings are as follows.
* (I) The structural dimension is q = 1, and , assuming that data are generated from the model:(25)
* where constant is given below, Xi is from the two scenarios (A) and (B) below, respectively, and Ti is independently generated from a uniform [0,1]. This implies that and . The error is independent and identically distributed. Additionally, assume that (X,T) is independent of . Now we consider the following cases about covariates to assess the linear condition for partial sliced inverse regression:
* (A) Covariate is a 20-dimensional vector, where each component Xk is generated independently from a Bernoulli distribution with . , implying that .
* (B) We take , and covariate is a 10-dimensional vector, where each component Xk follows a standard normal distribution N(0,1), and
* (II) The structural dimension is q = 2; and ; . We assume that the sample is from the model(26)
We keep all the other assumptions the same as in setting (I), but the covariates Xi and Ti are assumed to be correlated. Let , where , with matrix elements and if . We consider the case of , which implies low correlation between covariates. Then, it is easy to find that and with the standard normal distribution random variable , since is an odd function of t.
To assess the estimation accuracy of using our proposed method, we take the rule of Fent et al. (2013) [18] by using the squared trace correlation coefficient [35], for the pair of generic random vectors and . The squared trace correlation coefficient is defined as , where . Here, and are the variance matrices of U and V, respectively, and is the covariance matrix between U and V. A squared trace correlation coefficient closer to unity indicates higher estimation efficiency. For further details, see Li and Dong (2009) [35] and references therein.
Table 1 illustrates the effectiveness of the proposed method in dimension reduction. It presents the means and the standard errors of , computed from 1000 replicates for various sample sizes and slice numbers(). The number of slices K refers to dividing the range of the response variable Y into K intervals (slices) of roughly equal size, with each slice used to estimate the relationship between Y and the covariates X. The slicing process is a key step in partial sliced inverse regression, helping to reduce data complexity while retaining essential statistical information. reflects the alignment between the true and estimated central subspaces, with values closer to 1 indicating better dimension reduction performance. The results demonstrate that higher sample sizes improve estimation accuracy, with values approaching unity. The proportion of correct structural dimension estimates by BIC is also examined in this simulation across 1000 repeated experiments for various scenarios. All experiments were conducted on a MacBook Air (M220222) equipped with an 8-core CPU, 8-core GPU, and 24GB of unified memory. We also recorded the time required for 1000 replicates. From the perspective of computation time, the proposed method exhibits reasonable computational performance.
[Figure omitted. See PDF.]
Based on the obtained structural dimension and central subspace, fast instrumental variable pilot estimation is employed to estimate the nonparametric components and . It is challenging to directly from the 3-dimensional picture for the estimated bivariate curve in a plane, and hence we only present the estimated curves of and of case I in the first four pictures of Figure 1, and for case (II) in the fifth picture of Figure 1.
[Figure omitted. See PDF.]
a: Estimated g(x) for case (A, I). b: Estimated f(t) for case (A, I). c: Estimated g(x) for case (B, I). d: Estimated f(t) for case (B, I). e: Estimated f(t) for case II, = 0.2. f: Estimated f(t) for case II, = 0.6. Note: The nonparametric estimation plots are based on simulations with the sample size n = 1000 and the slice number K = 5 are chosen to reflect typical scenarios in dimension reduction tasks. In Case II, only is plotted to highlight the specific behavior of the response variable under these conditions. We present the estimation of for the weak correlation case ( = 0.2) and the high correlation case ( = 0.6).
From Table 1 and Fig 1 , it is evident that the estimated parameters and nonparametric components closely approximate their respective true values across the majority of conditions. Nonetheless, performance is subject to variation across different scenarios as a result of factors including covariate correlation, structural dimension, and sample size. As the correlation among covariates increases, the estimation accuracy for basis directions decreases, for instance, r2 drops from 0.9317 at to 0.9303 at , and further to 0.8267 at for n = 1000. This demonstrates that higher correlations among covariates can distort the estimation of the central subspace. Conversely, the method performs well under low correlation or independence. Larger sample sizes significantly enhance estimation precision. For instance, at , r2 increases from 0.9082 for n = 500 to 0.9589 for n = 2000 when K = 5. Similarly, for , r2 improves from 0.9010 for n = 500 to 0.9363 for n = 2000. However, in cases with high correlation as , the improvement is less significant, with r2 decreasing from 0.8405 for n = 500 to 0.7917 for n = 2000, indicating that high correlation poses challenges for accurate subspace estimation. The choice of slice numbers (K = 5 or K = 10) has a relatively minor impact on results, as r2 values remain consistent across most cases. For example, at and n = 2000, the r2 values for K = 5 and K = 10 are 0.9363 and 0.9409, respectively. This suggests that the proposed method is robust to the choice of slice numbers, simplifying its practical implementation. Finally, the simulation results indicate that the proportion of correct structural dimension estimates using BIC is consistently 100% across all scenarios and sample sizes. This underscores the robustness of the proposed method in accurately determining structural dimensions, even under conditions with higher covariate dimensions and correlations. Comparing CaseI and CaseII, CaseII is slightly worse than CaseI due to the higher structural dimension (d = 2) and correlated covariates(). These factors increase the complexity of estimating the central subspace, leading to slightly reduced values.
* (III) We compare our proposed estimators with the ‘PLSI’ method by Xia and Härdle (2008) [36], which was also evaluated by Feng et al. (2013) [18], using the model they proposed:(27)
Here X,W and are independent and follow N(0,Ip), N(0,1) and N(0,1), respectively; and . To investigate the effect of different dimensions of , we consider two specific cases under Model III: Case III-A: The dimension of X is 6, and Case III-B: The dimension of X is 10 p = 10,
Since the model and method proposed by Feng et al. (2013) [18] differ significantly from those presented in our paper, we make the following adjustments for the comparison. In our model, we treat the term as , while the component is considered as a whole and represented by . Therefore, in the simulation study, we focus on comparing the performance of PLSI model for estimating and the whole part of . The sample size ranges from 500 to 2000, and the number of slices is fixed at K = 5. We report the squared trace correlation coefficient r2 between and its estimators.
As shown in Table 2, for estimating in the model, our method performs similarly to the PDEE-SIR method proposed by Feng et al. (2013) [18]. As shown in Figure 2, when treating as a whole, the resulting plot closely matches the one obtained by estimating first and then multiplying by W as done in Feng et al. (2013) [18], demonstrating the effectiveness of our estimation. Thus, for robustness, our method is recommendable, as it effectively handles the nonlinear component of our model, which poses identification challenges in the approach proposed by Feng et al. (2013) [18]. We also compared the computational efficiency of the two methods and found that our method is similar to the one proposed by Feng et al. (2013) [18], but slightly slower.
[Figure omitted. See PDF.]
a: case (III-A) with n = 1000. b: CaseIII-B with n = 1000.
[Figure omitted. See PDF.]
Application
The Medical Expenditure Panel Survey (MEPS) is a nationally representative survey of healthcare utilization and expenditures for the civilian non-institutionalized population in the United States [37]. This study analyzes a subset from the 2010 Full Year Consolidated Data File of the Household Component, which includes detailed information on health conditions, medical expenditures, insurance coverage, and demographic characteristics.
This article examines medical data from elderly households surveyed by MEPS in 2010, where all members were aged 65 or older. The dataset includes a total of 2139 individuals aged between 65 and 84 years. Chen et al. (2014) [7]analyzed this dataset using a partial prior marginal model and highlighted that medical expense data is often highly skewed to the right, with a large mean $9235 but a small median $3955.
Among the sample, 41.1% of males had hospitalization experience, with a mortality rate of 15.0%. Additionally, 79.5% of white individuals with hospitalization experience had a mortality rate of 1.4%. Key medical conditions such as cardiovascular diseases (82.7%), physical dyskinesia (76.6%), cancer (28.9%), diabetes (21.9%), and respiratory diseases (16.9%) were found to significantly influence medical expenses. Based on these findings, Chen et al. (2014) [7] selected these covariates for their analysis.
However, MEPS data contains a vast number of covariates, and selecting only a subset of them may not provide a comprehensive understanding of medical expenses. To address this, this study applies the partially nonlinear index model proposed earlier to analyze MEPS data. This model not only accommodates high-dimensional covariates but also captures the nonlinear effects of time-related covariates on medical expenses.
In addition to age, which is modeled as a continuous covariate, the following 14 variables are also included in the analysis:
* White Race (X1): X1 = 1 for white individuals, X1 = 0 for non-white individuals.
* Gender (X2): X2 = 1 for male, X2 = 0 for female.
* Mortality (X3): X3 = 1 if deceased, X3 = 0 otherwise.
* Hospitalization (X4): X4 = 1 if hospitalized, X4 = 0 otherwise.
* Cardiovascular Diseases (HBV Disease, X5): X5 = 1 if diagnosed, X5 = 0 otherwise.
* Respiratory Diseases (X6): X6 = 1 if diagnosed, X6 = 0 otherwise.
* Body Movement Disorders (X7): X7 = 1 if diagnosed, X7 = 0 otherwise.
* Cancer (X8): X8 = 1 if diagnosed, X8 = 0 otherwise.
* Diabetes (X9): X9 = 1 if diagnosed, X9 = 0 otherwise.
* Family Size (X10): X10 = 1 for households with at least two members, X10 = 0 for individuals living alone.
* Supplemental Insurance (X11): X11 = 1 if the individual has additional insurance apart from Medicare, X11 = 0 otherwise.
* Family Income Levels (X12): X12 = 1 for poor, X12 = 2 for near poor, X12 = 3 for low income, X12 = 4 for middle income, and X12 = 5 for high income.
* Medicare Coverage (X13): X13 = 1 if covered by Medicare, X13 = 0 otherwise.
* Years of Education (X14): X14 = 0 for no education, X14 = 1 to 8 for elementary education, X14 = 9 to 12 for high school education, X14 = 13 to 16 for college education, and X14 = 17 or higher for postgraduate education.
For convenience, in the process of partial sufficient reduction, we use the standardized variable X to obtain the basis direction , and then the true basis direction B is defined as , where is the covariance matrix of X. The estimated basis direction B, based on the partially sufficient dimension reduction with K = 5, is displayed in Table 2. The estimated structural dimension q = 2.
From Table 3, we can see that hospitalization is largely reflected in both central subspace directions, while heart and blood vessel disease, body movement disorder, having additional insurance, and having Medicare coverage are only reflected in the second direction. Hospitalization plays the most influential role in the first central subspace directions, and heart and blood vessel disease plays the most influential role in the second central subspace directions. The important factors or variables are correctly identified, and the results are consistent with those of Chen et al. (2016) [4].
[Figure omitted. See PDF.]
After getting the estimated structural dimension and the basis direction, a standard additive model can be established. The estimated nonlinear age effect in is shown in Figure 3. We can see that there is a small dip at first, then the age effect increases from 66 to 69. There are some fluctuations between 69 and 79, followed by a significant dip at 82. After reaching age 82, the medical cost exhibits a clearly increasing pattern. These nonlinear age effects reveal important age-specific variations in medical costs, offering insights into healthcare policy. The significant increase in costs from ages 66 to 69 indicates the necessity for targeted preventive care and chronic disease management to address early-onset risks in this demographic. The fluctuating trends between 69 and 79 emphasize the importance of personalized healthcare strategies to address diverse needs in this age group, ensuring resources are allocated effectively. Meanwhile, the steep increase after age 82 underscores the necessity of robust elderly care systems, including long-term care insurance and community-based support services, to manage the escalating financial burden on healthcare systems. These findings align with those of Chen et al. (2016) [4], further validating the proposed model’s capacity to uncover meaningful patterns in medical cost data.
[Figure omitted. See PDF.]
Concluding Remarks
This paper proposes a partially nonlinear index model (PNIM) that effectively addresses the challenges of high-dimensional covariates and nonlinear covariate effects. The partially sufficient dimension reduction framework estimates structural dimensions and basis directions, providing computational efficiency and robustness. Next, Using the fast instrumental variable pilot estimation, the nonparametric components were estimated within the standard additive model framework. We also established the asymptotic properties of the estimators. The simulation results and the analysis of the MEPS dataset demonstrated the model’s practical utility, particularly in identifying key factors that influence medical costs and their nonlinear relationships.
Our analysis of the MEPS data uncovered several significant findings that could inform policy decisions. Hospitalization was identified as the most influential factor in the first central subspace direction, highlighting its crucial role in determining medical costs. This indicates that policy interventions focused on reducing avoidable hospital admissions—such as enhancing outpatient care, preventive health programs, and post-discharge management—could greatly ease the financial burden on the healthcare system. The nonlinear age effects, as illustrated by the regression curve, highlight critical age-specific variations in medical costs. The significant rise in costs between ages 66 and 69 highlights the necessity for targeted chronic disease prevention programs for this demographic. In contrast, the swift increase in costs after age 82 necessitates enhanced long-term care policies, which include better access to long-term care insurance and community-based support systems for seniors. It also highlights the role of chronic conditions and supplemental insurance. Variables such as heart disease, blood vessel disease, and supplemental insurance coverage significantly influenced the second central subspace direction. These findings highlight the importance of policies that emphasize targeted prevention and management of chronic diseases while ensuring access to affordable supplemental insurance for at-risk populations. These results highlight the potential of PNIM as a strong analytical framework for understanding the complex interplay of factors influencing medical costs, offering actionable insights to guide evidence-based healthcare policy.
Future research will address some limitations of the current framework. One notable challenge is that medical cost data often exhibit inherent skewness (long tails to the right) and heteroscedasticity (variance that changes across different levels of covariates), which may not be fully captured by the current model’s emphasis on mean regression and presumption of homoscedasticity [7]. Skewness can lead to biased estimates when the mean is dominated by extreme values, while heteroscedasticity can affect the efficiency and consistency of parameter estimates. These characteristics are particularly prominent in medical cost data due to the high variability in healthcare utilization among different demographic and socioeconomic groups.
A promising direction involves extending the partially sufficient dimension reduction framework to account for these distributional properties. Specifically, adopting the MAVE approach [24] within the framework of quantile regression, as suggested by Kong and Xia (2014) [38], could allow for a more robust analysis. Quantile regression not only addresses skewness by focusing on conditional quantiles rather than the mean but also naturally accommodates heteroscedasticity by estimating covariate effects across the entire distribution of the response variable. This would enable a more comprehensive understanding of cost drivers across different percentiles, such as the lower quantiles (representing low-cost groups) and upper quantiles (representing high-cost outliers).
By analyzing the entire distribution of medical costs, this extension would provide deeper insights into the varying effects of covariates on healthcare expenditures. For example, certain predictors may have a greater impact on high-cost outliers than on average patients, which has significant implications for designing targeted healthcare policies. This approach would enhance the utility of the model in addressing the complexities of medical cost data and guide the development of evidence-based policies that account for variability across population subgroups.
Additionally, another limitation is that our current model assumes complete observations and does not explicitly address missing data issues, which are common in real-world medical datasets. Standard sufficient dimension reduction methods, including our proposed approach, generally require complete data, limiting their applicability in the presence of missing values. Future extensions could incorporate methods such as multiple imputations or inverse probability weighting to effectively handle missing data. Recent studies, such as Xue and Zhang (2020) [39], have developed empirical likelihood-based approaches for partially linear single-index models with missing observations, demonstrating how bias correction and imputation techniques can enhance model robustness. In the future, we can refer to Xue and Zhang’s (2020) [39] expansion of empirical likelihood in our model to better adapt to scenarios with missing data and enhance the model’s applicability in empirical medical research.
Acknowledgments
Special thanks to Professor Lei Liu from Washington University in St. Louis for providing the MEPS data.
References
1. 1. Chen Y, Ning J, Cai C. Regression analysis of longitudinal data with irregular and informative observation times. Biostatistics 2015;16(4):727–39. pmid:25813646
* View Article
* PubMed/NCBI
* Google Scholar
2. 2. Castelli C, Combescure C, Foucher Y, Daures J-P. Cost-effectiveness analysis in colorectal cancer using a semi-Markov model. Stat Med 2007;26(30):5557–71. pmid:18058847
* View Article
* PubMed/NCBI
* Google Scholar
3. 3. Liu L, Huang X, O’Quigley J. Analysis of longitudinal data in the presence of informative observational times and a dependent terminal event, with application to medical cost data. Biometrics 2008;64(3):950–8. pmid:18162110
* View Article
* PubMed/NCBI
* Google Scholar
4. 4. Chen J, Liu L, Shih Y-CT, Zhang D, Severini TA. A flexible model for correlated medical costs with application to medical expenditure panel survey data. Stat Med 2016;35(6):883–94. pmid:26403805
* View Article
* PubMed/NCBI
* Google Scholar
5. 5. Tian Y, Feng Y. Transfer learning under high-dimensional generalized linear models. J Am Stat Assoc 2023;118(544):2684–97. pmid:38562655
* View Article
* PubMed/NCBI
* Google Scholar
6. 6. Zhou X-H, Liang H. Semiparametric single-index two-part regression models. Comput Stat Data Anal 2006;50(5):1378–90. pmid:20191094
* View Article
* PubMed/NCBI
* Google Scholar
7. 7. Chen J, Kim I, Terrell GR, Liu L. Generalised partial linear single-index mixed models for repeated measures data. J Nonparametr Statist 2014;26(2):291–303.
* View Article
* Google Scholar
8. 8. Fan J, Yang Z, Yu M. Understanding implicit regularization in over-parameterized single index model. J Am Stat Assoc 2023;118(544):2315–28. pmid:38550788
* View Article
* PubMed/NCBI
* Google Scholar
9. 9. Wahba, G. Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics (SIAM); 1990.
10. 10. Liang H. Second-order asymptotic efficiency of PMLE in generalized linear models. Stat Probabil Lett 1995;24(3):271–9.
* View Article
* Google Scholar
11. 11. Song L, Zhao Y, Wang X. Sieve least squares estimation for partially nonlinear models. Stat Probabil Lett. 2010;80(17-18):1271–83.
12. 12. Xia Y, Tong H, Li WK, Zhu L-X. An adaptive estimation of dimension reduction space. J R Stat Soc Ser B Stat Methodol 2002;64(3):363–410.
* View Article
* Google Scholar
13. 13. Li K-C. Sliced inverse regression for dimension reduction. J Am Stat Assoc 1991;86(414):316–27.
* View Article
* Google Scholar
14. 14. Li B, Wang S. On directional regression for dimension reduction. J Am Stat Assoc. 2007;102(479):997–1008.
* View Article
* Google Scholar
15. 15. Asrir N, Mkhadri A. Sliced inverse regression via natural canonical thresholding. Commun Stat Theory Methods. 2025;54(7):2158–77.
* View Article
* Google Scholar
16. 16. Ma Y, Zhu L. A semiparametric approach to dimension reduction. J Am Stat Assoc 2012;107(497):168–79. pmid:23828688
* View Article
* PubMed/NCBI
* Google Scholar
17. 17. Chiaromonte F, Cook RD, Li B. Sufficient dimensions reduction in regressions with categorical predictors. Ann Statist. 2002;30(2):475–97.
* View Article
* Google Scholar
18. 18. Feng Z, Wen XM, Yu Z, Zhu L. On partial sufficient dimension reduction with applications to partially linear multi-index models. J Am Stat Assoc 2013;108(501):237–46.
* View Article
* Google Scholar
19. 19. Opsomer JD, Ruppert D. Fitting a bivariate additive model by local polynomial regression. Ann Statist 1997;25(1):186–211.
* View Article
* Google Scholar
20. 20. Fan J, Härdle W, Mammen E. Direct estimation of low-dimensional components in additive models. Ann Statist 1998;26(3):943–71.
* View Article
* Google Scholar
21. 21. Kim W, Linton OB, Hengartner NW. A computationally efficient oracle estimator for additive nonparametric regression with bootstrap confidence intervals. J Comput Graph Stat. 1999;8(2):278–97.
* View Article
* Google Scholar
22. 22. Cheng Y, De Gooijer JG, Zerom D. Efficient estimation of an additive quantile regression model. Scand J Stat 2011;38(1):46–62.
* View Article
* Google Scholar
23. 23. Chen J, Liu L, Zhang D, Shih Y-CT. A flexible model for the mean and variance functions, with application to medical cost data. Stat Med 2013;32(24):4306–18. pmid:23670952
* View Article
* PubMed/NCBI
* Google Scholar
24. 24. Xia Y. A multiple-index model and dimension reduction. J Am Stat Assoc 2008;103(484):1631–40.
* View Article
* Google Scholar
25. 25. Ma Y, Zhu L. A review on dimension reduction. Int Stat Rev 2013;81(1):134–50. pmid:23794782
* View Article
* PubMed/NCBI
* Google Scholar
26. 26. Härdle W. Applied nonparametric regression. Cambridge University Press; 1990.
27. 27. Härdle W, Mori Y, Vieu P, Härdle W, Liang H. Partially linear models. In: Statistical methods for biostatistics and related fields. Berlin, Heidelberg: Springer; 2007, pp. 87–103. https://doi.org/10.1007/978-3-540-32691-5_5
28. 28. Chen D, Hall P, Müller HG. Single and multiple index functional regression models with nonparametric link. Ann Statist. 2011;39(3):1720–47.
* View Article
* Google Scholar
29. 29. Cook RD. On the interpretation of regression plots. J Am Stat Assoc 1994;89(425):177–89.
* View Article
* Google Scholar
30. 30. Cook RD. Regression graphics: ideas for studying regressions through graphics John Wiley & Sons; 2009.
31. 31. Shao Y, Cook R, Weisberg S. Partial central subspace and sliced average variance estimation. J Stat Plann Inference 2009;139(3):952–61.
* View Article
* Google Scholar
32. 32. Bura E, Cook RD. Extending sliced inverse regression. J Am Stat Assoc 2001;96(455):996–1003.
* View Article
* Google Scholar
33. 33. Cai Z, Fan J. Average regression surface for dependent data. J Multivar Anal. 2000;75(1):112–42.
* View Article
* Google Scholar
34. 34. Linton OB, Chen R, Wang N, Härdle W. An analysis of transformation for additive nonparametric regression. J Am Stat Assoc 1997;92(440):1512–21.
* View Article
* Google Scholar
35. 35. Dong Y, Li B. Dimension reduction for non-elliptically distributed predictors: second-order methods. Biometrika 2010;97(2):279–94.
* View Article
* Google Scholar
36. 36. Xia Y, Härdle W. Semi-parametric estimation of partially linear single-index models. J Multivar Anal 2006;97(5):1162–84.
* View Article
* Google Scholar
37. 37. Aizcorbe A, Liebman E, Pack S, Cutler DM, Chernew ME, Rosen AB. Measuring health care costs of individuals with employer-sponsored health insurance in the U.S.: a comparison of survey and claims data. Stat J IAOS. 2012;28(1–2):43–51. https://doi.org/10.3233/SJI-2012-0743 pmid:26146526
38. 38. Kong E, Xia Y. An adaptive composite quantile approach to dimension reduction. Ann Statist 2014;42(4):1657–88.
* View Article
* Google Scholar
39. 39. Xue L, Zhang J. Empirical likelihood for partially linear single-index models with missing observations. Comput. Stat. Data Anal. 2020;144:106877.
* View Article
* Google Scholar
Citation: Zhao X, Xia Y, Xu X (2025) Sufficient dimension reduction on partially nonlinear index models with applications to medical costs analysis. PLoS One 20(5): e0321796. https://doi.org/10.1371/journal.pone.0321796
About the Authors:
Xiaobing Zhao
Roles: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision
Affiliation: School of Data Sciences, Zhejiang University of Finance and Economics, Hangzhou, Zhejiang Province, China
Yufeng Xia
Roles: Writing – review & editing
Affiliation: School of Data Sciences, Zhejiang University of Finance and Economics, Hangzhou, Zhejiang Province, China
Xuan Xu
Roles: Funding acquisition, Investigation, Methodology, Software, Writing – original draft
E-mail: [email protected]
Affiliation: Asset and Laboratory Management Office, Zhejiang University of Finance and Economics, Hangzhou, Zhejiang Province, China
ORICD: https://orcid.org/0009-0006-4687-0409
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
1. Chen Y, Ning J, Cai C. Regression analysis of longitudinal data with irregular and informative observation times. Biostatistics 2015;16(4):727–39. pmid:25813646
2. Castelli C, Combescure C, Foucher Y, Daures J-P. Cost-effectiveness analysis in colorectal cancer using a semi-Markov model. Stat Med 2007;26(30):5557–71. pmid:18058847
3. Liu L, Huang X, O’Quigley J. Analysis of longitudinal data in the presence of informative observational times and a dependent terminal event, with application to medical cost data. Biometrics 2008;64(3):950–8. pmid:18162110
4. Chen J, Liu L, Shih Y-CT, Zhang D, Severini TA. A flexible model for correlated medical costs with application to medical expenditure panel survey data. Stat Med 2016;35(6):883–94. pmid:26403805
5. Tian Y, Feng Y. Transfer learning under high-dimensional generalized linear models. J Am Stat Assoc 2023;118(544):2684–97. pmid:38562655
6. Zhou X-H, Liang H. Semiparametric single-index two-part regression models. Comput Stat Data Anal 2006;50(5):1378–90. pmid:20191094
7. Chen J, Kim I, Terrell GR, Liu L. Generalised partial linear single-index mixed models for repeated measures data. J Nonparametr Statist 2014;26(2):291–303.
8. Fan J, Yang Z, Yu M. Understanding implicit regularization in over-parameterized single index model. J Am Stat Assoc 2023;118(544):2315–28. pmid:38550788
9. Wahba, G. Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics (SIAM); 1990.
10. Liang H. Second-order asymptotic efficiency of PMLE in generalized linear models. Stat Probabil Lett 1995;24(3):271–9.
11. Song L, Zhao Y, Wang X. Sieve least squares estimation for partially nonlinear models. Stat Probabil Lett. 2010;80(17-18):1271–83.
12. Xia Y, Tong H, Li WK, Zhu L-X. An adaptive estimation of dimension reduction space. J R Stat Soc Ser B Stat Methodol 2002;64(3):363–410.
13. Li K-C. Sliced inverse regression for dimension reduction. J Am Stat Assoc 1991;86(414):316–27.
14. Li B, Wang S. On directional regression for dimension reduction. J Am Stat Assoc. 2007;102(479):997–1008.
15. Asrir N, Mkhadri A. Sliced inverse regression via natural canonical thresholding. Commun Stat Theory Methods. 2025;54(7):2158–77.
16. Ma Y, Zhu L. A semiparametric approach to dimension reduction. J Am Stat Assoc 2012;107(497):168–79. pmid:23828688
17. Chiaromonte F, Cook RD, Li B. Sufficient dimensions reduction in regressions with categorical predictors. Ann Statist. 2002;30(2):475–97.
18. Feng Z, Wen XM, Yu Z, Zhu L. On partial sufficient dimension reduction with applications to partially linear multi-index models. J Am Stat Assoc 2013;108(501):237–46.
19. Opsomer JD, Ruppert D. Fitting a bivariate additive model by local polynomial regression. Ann Statist 1997;25(1):186–211.
20. Fan J, Härdle W, Mammen E. Direct estimation of low-dimensional components in additive models. Ann Statist 1998;26(3):943–71.
21. Kim W, Linton OB, Hengartner NW. A computationally efficient oracle estimator for additive nonparametric regression with bootstrap confidence intervals. J Comput Graph Stat. 1999;8(2):278–97.
22. Cheng Y, De Gooijer JG, Zerom D. Efficient estimation of an additive quantile regression model. Scand J Stat 2011;38(1):46–62.
23. Chen J, Liu L, Zhang D, Shih Y-CT. A flexible model for the mean and variance functions, with application to medical cost data. Stat Med 2013;32(24):4306–18. pmid:23670952
24. Xia Y. A multiple-index model and dimension reduction. J Am Stat Assoc 2008;103(484):1631–40.
25. Ma Y, Zhu L. A review on dimension reduction. Int Stat Rev 2013;81(1):134–50. pmid:23794782
26. Härdle W. Applied nonparametric regression. Cambridge University Press; 1990.
27. Härdle W, Mori Y, Vieu P, Härdle W, Liang H. Partially linear models. In: Statistical methods for biostatistics and related fields. Berlin, Heidelberg: Springer; 2007, pp. 87–103. https://doi.org/10.1007/978-3-540-32691-5_5
28. Chen D, Hall P, Müller HG. Single and multiple index functional regression models with nonparametric link. Ann Statist. 2011;39(3):1720–47.
29. Cook RD. On the interpretation of regression plots. J Am Stat Assoc 1994;89(425):177–89.
30. Cook RD. Regression graphics: ideas for studying regressions through graphics John Wiley & Sons; 2009.
31. Shao Y, Cook R, Weisberg S. Partial central subspace and sliced average variance estimation. J Stat Plann Inference 2009;139(3):952–61.
32. Bura E, Cook RD. Extending sliced inverse regression. J Am Stat Assoc 2001;96(455):996–1003.
33. Cai Z, Fan J. Average regression surface for dependent data. J Multivar Anal. 2000;75(1):112–42.
34. Linton OB, Chen R, Wang N, Härdle W. An analysis of transformation for additive nonparametric regression. J Am Stat Assoc 1997;92(440):1512–21.
35. Dong Y, Li B. Dimension reduction for non-elliptically distributed predictors: second-order methods. Biometrika 2010;97(2):279–94.
36. Xia Y, Härdle W. Semi-parametric estimation of partially linear single-index models. J Multivar Anal 2006;97(5):1162–84.
37. Aizcorbe A, Liebman E, Pack S, Cutler DM, Chernew ME, Rosen AB. Measuring health care costs of individuals with employer-sponsored health insurance in the U.S.: a comparison of survey and claims data. Stat J IAOS. 2012;28(1–2):43–51. https://doi.org/10.3233/SJI-2012-0743 pmid:26146526
38. Kong E, Xia Y. An adaptive composite quantile approach to dimension reduction. Ann Statist 2014;42(4):1657–88.
39. Xue L, Zhang J. Empirical likelihood for partially linear single-index models with missing observations. Comput. Stat. Data Anal. 2020;144:106877.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 Zhao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Modeling medical costs is a crucial task in health economics, especially when high-dimensional covariates and nonlinear effects are present. In this study, we propose a partially nonlinear index model (PNIM) that integrates partially sufficient dimension reduction with a rapid instrumental variable pilot estimation method. Through simulations, we demonstrate that the proposed model excels at capturing significant nonlinear relationships. When applying the model to the Medical Expenditure Panel Survey (MEPS) dataset, we identify important nonlinear age effects on medical costs and highlight key factors such as hospitalization, cardiovascular diseases, and supplemental insurance coverage. These findings provide valuable insights for healthcare policy, including targeted interventions for specific age groups and enhanced management of chronic conditions. Overall, the proposed method offers a flexible and computationally efficient framework for analyzing complex medical cost data, with broad applicability in health economics.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer