New adjusted missing value imputation in multiple regression with simple random sampling and rank set sampling methods

Abstract

This research compared the efficiency of several adjusted missing value imputation methods in multiple regression analysis. The four imputation methods were the following: regression-ratio quartile1,3 (R-RQ1,3) imputation of Al-Omari, Jemain and Ibrahim; adjusted regression-chain ratio quartile1,3 (AR-CRQ1,3) imputation of Kadilar and Cinji; adjusted regression-multivariate ratio quatile1,3 (AR-MRQ1,3) imputation of Feng, Ni, and Zou; and adjusted regression-multivariate chain ratio quartile1,3 (AR-MCRQ1,3) imputation of Lu for each simple random sampling (SRS) and rank set sampling (RSS). The performance measures mean square error (MSE) and mean absolute percentage error (MAPE). The study showed that the AR-MRQ1 method with SRS provided the minimum mean square error for small error variance. However, the AR-MCRQ3 provided the minimum mean square error for a large error variance. Considering all error variance in mean absolute percentage error, the AR-MCRQ1 provided the minimum mean absolute percentage error. The AR-MRQ1 method with RSS provided the minimum mean square error for a small error variance. However, the AR-MCRQ3 provided the minimum mean square error for medium and large error variance. Regarding the mean absolute percentage error measure, the AR-MRQ1 provided the minimum mean absolute percentage error for a small error variance. However, the AR-MCRQ1 provided the minimum mean absolute percentage error for medium and large error variance. For both SRS and RSS, AR-MCRQ1 was the best method for missing value imputation in multiple regression analysis, followed by AR-MCRQ3. Moreover, the RSS estimators provided smaller MSE and MAPE than the SRS estimators. Therefore, the RSS estimators were more efficient than the SRS estimators.

Full text

Translate

Turn on search term navigation

1. Introduction

Multiple regression analysis is a study of the relationship between many independent variables and a dependent variable to determine which independent variables can estimate or explain the variation of the dependent variable. The regression model is a matrix of ✓. Estimation of the parameters often causes problems. Information or data can be incomplete, and significant data values may be missing, resulting from erratic data storage or transfer tools or limitations of the technology, leading to an inability to utilize the data [1] fully. To use multiple regression analysis effectively, the data must be complete. For this reason, some researchers may not include the data of some variables and make an analysis only on a smaller set of full data, which may not give a meaningful result of a biased estimate or a high standard error. The discarded data can differ significantly from the remaining data [2], making the estimation unreliable. Therefore, a reasonable estimation of missing values is fundamental.

Simple random sampling (SRS) is a sampling of a population of size N, where each possible sample of size n had the same probability of being selected. The sample obtained at such sampling is called a simple random sample [2]. Rank set sampling (RSS) was introduced by McIntyre (1952) for estimating mean pasture and forage yields as a more efficient and cost-effective method than the commonly used simple random sampling in situations where the visual ordering of the sample units can be done quickly, the exact measurement of the units is difficult and expensive [3]. Takahasi and Wakimoto (1968) provided the necessary mathematical theory of RSS [4]. Samawi and Muttlak (1996) studied using RSS to estimate the population ratio and suggested the RSS estimator of the population ratio [5]. The data sampled by SRS and RSS may be quite different, and the various proposed missing value imputation methods investigated in this study may give different results from the data obtained by the two sampling methods.

Three assumptions are discussed in the context of missing imputation, which also forms the basis of later simulations: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) [6–8].

It is essential to understand why the data are missing. Graham et al. (2003) in [7] described that missing data can informally be thought of as being caused in some combination of three ways: random processes, processes that are measured, and processes that are not measured. Modern missing data methods generally work well for the first two causes, but not for the last. More formally, missing data assumptions are commonly described as falling into one of three categories, defined by [9]: First, data can be “Missing Completely at Random”, or MCAR. When data are MCAR, missing cases are no different than non-missing cases in the analysis. Thus, these cases can be thought of as randomly missing from the data, and the only real penalty for failing to account for missing data is loss of power. Second, data can be “Missing at Random”, or MAR. In this case, missing data depends on known values and thus is described entirely by variables observed in the data set. Accounting for the values that “cause” the missing data will produce unbiased results in an analysis. Third, data can be missing in an unmeasured fashion, termed “nonignorable” (also called “Missing Not at Random” (MNAR) and “Not Missing at Random” (NMAR)). Since the missing data depends on events or items that the researcher has not measured, it cannot be adquately attributed without measuring some related variables.

Missing data can frequently occur in a longitudinal data analysis. Many imputation methods have been proposed in the literature to handle such an issue. Complete case (CC), mean substitution (MS), last observation carried forward (LOCF), and multiple imputation (MI) are the four most frequently used methods in practice. In a real-world data analysis, the missing data can be MCAR, MAR, or MNAR, depending on why the data was missed. In this paper, simulations under various situations (including missing rates and slope sizes) were conducted to evaluate the performance of the four methods considered using bias, RMSE, and 95% coverage probability as evaluation measures. The results showed that LOCF has the most significant bias and the poorest 95% coverage probability in most cases under both MAR and MCAR assumptions. Hence, LOCF should not be used in a longitudinal data analysis. Under MCAR assumption, CC and MI methods performed equally well. Under MAR assumption, MI has the minor bias, smallest RMSE, and best 95% coverage probability. Therefore, the CC or MI method is the appropriate method to be used under MCAR. In contrast. the MI method is a more reliable and better-grounded statistical method under MAR MAR [10].

In 1996, rank set sampling (RSS) was studied by [5], which was suggested by [3] and developed by [4] to examine the ratio estimator. It was proved that RSS estimators were more efficient than simple random sampling (SRS). Later in 2003, the chain ratio (CR) estimator with SRS was examined [11]. The efficiency comparison used the mean squared error (MSE). They proved that the CR estimator was more efficient than the traditional ratio estimator under certain conditions.

A 2006 study investigated missing value estimation in multiple regression analysis. Three methods were used to estimate the missing values of a dependent variable: regression, EM algorithm, and pairwise deleted methods. The sample sizes were 20, 100, and 150. The missing values were 10%, 30%, and 50%. The error variances were 25, 50, 75, and 100. The evaluation measure was MSE. The study found that the regression method had the lowest MSE from all 500 iterations, outperforming other methods in many situations [12].

In 2009, modified ratio estimators of the population mean of a variable of interest were suggested, involving the first or third quartiles of an auxiliary variable correlated with the variable of interest. The newly suggested estimators were investigated using SRS and RSS methods. The efficiency was compared in terms of MSE. These estimators were unbiased, and the RSS estimators were more efficient than SRS estimators for the same quartile with the same correlation coefficient and sample size. Also, when the two quartiles were compared, an estimator based on the regression-ratio quartile3 (R-RQ3) imputation of Al-Omari, Jemain, and Ibrahim was more efficient than an estimator based on the regression-ratio quartile1 (R-RQ1) imputation of Al-Omari, Jemain, and Ibrahim [13].

In 2012, imputation methods were proposed when data on the dependent variable was missing in multiple linear regression analysis. The proposed methods of estimation were called ratio-Q1 imputation method (RQ1), ratio-Q3 imputation method (RQ3), regression-ratio-Q1 imputation method (R-RQ1), and regression-ratio-Q3 imputation method (R-RQ3). The efficiency of the proposed methods was compared to the mean and regression imputation method in a variance simulation situation. The evaluation measures were estimated root mean squares error (RMSE) and mean absolute percentage error (MAPE). In each tested situation, linear regression models with two independent variables were considered under the assumption that the error was distributed normally. The missing value is missing at random results showed that the R-RQ1 imputation of Al-Omari, Jemain, and Ibrahim was more efficient in the following situations: (1) a small sample size (n = 20), a large percentage of missing values (20%), a large value of variance (σ² = 1.5, 2); (2) the medium sample size (n = 40, 60), a small percentage of missing values (10%), significant value of variance (σ² = 0.5, 1); (3) the medium sample size (n = 40, 60), the medium percentage of missing values (15%), the medium value of variance (σ² = 1, 1.5); and (4) a large sample size (n = 100), the medium percentage of missing values (15%), the medium value of variance (σ² = 1.5). The R-RQ3 imputation of Al-Omari, Jemain, and Ibrahim was efficient for the situation of large sample size (n = 100), medium percentage of missing values (15%), and medium value of variance (σ² = 1) [1].

In a later year, there was an investigation of the research [15] by [14] of a chain ratio (CR) estimator and a regression estimator with a linear combination of two auxiliary variables that used the supplementary data to increase the accuracy of the estimator. That study proposed a multivariate chain ratio (MCR) estimator and a regression estimator that used a linear combination of two auxiliary variables versus a traditional multivariate ratio (MR) estimator [15] and a regression estimator that used the information of two auxiliary variables [15], and a chain ratio (CR) estimator that used one auxiliary variable [11]. The evaluation measure was MSE. It was found that the proposed MCR estimator and the proposed regression estimator that used a linear combination of two auxiliary variables were equally highly effective, followed by the traditional MR estimator and the regression estimator that used the data of the two auxiliary variables [15].

Multiple Imputation, Maximum Likelihood, and Fully Bayesian methods are the three most used model-based approaches in missing data problems. Although it is easy to show that when the assumption is missing at random (MAR), the complete case analysis is unbiased and efficient, the methods above are still commonly used in practice for this setting. To examine the performance of relationships between these three methods in this setting, we derive and investigate small samples and asymptotic expressions of the estimates and standard errors. We thoroughly examine how these estimates are related to the three approaches in the linear regression model when the assumption is MAR. They show that when the assumption is MAR in the linear model, the estimates of the regression coefficients using these three methods are asymptotically equivalent to the complete case estimates under general conditions. A simulation of an accurate data set from a liver cancer clinical trial was conducted to compare the properties of these methods when the assumption was MAR [16].

In 2015, a paper proposed imputation estimators when there were missing observations on a dependent variable in multiple linear regression analysis. The proposed estimators were called ratio-Q1 (RQ1), ratio-Q3 (RQ3), regression-ratio-Q1 (R-RQ1) and regression-ratio-Q3 (R-RQ3). In various simulation situations, the efficiencies of the proposed estimations were compared to two existing methods—mean imputation and regression imputation. Mean absolute percentage error (MAPE) was the evaluation measure. For each situation, a multiple linear regression model with two independent variables was considered under the assumption that the error was distributed normally. Variances, sample sizes, and percentages of missing values were varied. Findings revealed that the MAPE from every estimator increased as either the percentage of missing values or the variance of error increased. Moreover, for all situations, R-RQ1, R-RQ3, and regression imputation performed better than the others [17].

In 2017, a study compared the ratio and regression estimators empirically based on bias and coefficient of variation. The author conducted simulation studies based on the sampling rate, population size, heterogeneity of the auxiliary variable X, deviation from linearity, and model misspecification. The study showed that the ratio estimator was better than the regression estimator when the regression line was close to the origin. Ratio and regression estimators worked even if there was a weak linear relationship between X and Y and if there was a minimum model misspecification. The relationship between the target and auxiliary variables was weak, Bootstrap estimated yield lower bias. The regression estimator was generally more efficient than the ratio estimator [18]. In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution, and the dependent variable was generated using a combination of explanatory variables. Missing values in covariates were simulated using an assumption called missing at random (MAR). Four levels of missingness (10%, 20%, 30%, and 40%) were assessed. A linear regression analysis was fitted, and the model performance was measured; MSE, and R-squared were obtained. The study showed that MI was superior in handling missing data with the highest R-squared and the lowest MSE when the percentage of missingness was less than 30%. Both methods could not hold a larger than 30% level of missingness [19].

In 2021, Little reviewed assumptions about missing data mechanisms that underlie methods for the statistical analysis of data with missing values. Little explained Rubin’s original definition of missing at random (MAR), its motivation and criticisms, and his sufficient conditions for ignoring the missingness mechanism for likelihood-based, Bayesian, and frequentist inference. Related definitions, including missing completely at random (MCAR), missing at random (MAR), and partially MAR. Little presented a formal argument for weakening Rubin’s sufficient conditions for frequentist maximum likelihood and inference with precision based on the observed information. Some simple examples of MAR are described, with an example where the missingness mechanism can be ignored even though MAR does not hold. Alternative approaches to statistical inference based on the likelihood function were reviewed, along with non-likelihood frequentist approaches, including weighted generalized estimating equations. Connections with the causal inference literature were also discussed [20].

Missing data has been a common issue in many domains of study. If this issue was disregarded, an erroneous conclusion might ensue. In 2022, a study developed new imputation methods and compared the efficiency of eight imputation methods: hot deck imputation (HD), k-nearest neighbors imputation (KNN), stochastic regression, imputation (SR), predictive mean matching imputation (PMM), random forest imputation (RF), stochastic regression random forest with equivalent weight imputation (SREW), k-nearest random forest with equivalent weight imputation (KREW), and k-nearest stochastic regression and random forest with equivalent weight imputation (KSREW). Simulations were run using various sample sizes (30, 60, 100, and 150) and missing percentages (10%, 20%, 30%, and 40%). Average mean square error (AMSE) was used to compare the efficiencies. The proposed composite approach outperformed the single ones. On the other hand, increasing the number of components to a four-component method did not affect the imputation performance [21].

In 2022, Isabella Sayers et al. conducted a comparison of five missing value imputation methods in multiple regression model: multiple regression, regression-ratio-Q1 (R-RQ1), regression-ratio-Q3 (R-RQ3), stochastic regression, and k-nearest stochastic regression, with equivalent weighting methods under missing completely at random (MCAR). The sample size, error variance, and missing values percentage varied. The evaluation measure was MSE. The small sample sizes were 20 and 40; the medium sample sizes were 60 and 80, and the large sample sizes were 100 and 120. The percentages of missing values were 5, 10, 15, and 20. The missing values were constructed by a missing-at-random method. The variances of error used in this study were 1, 3, 5, 7, and 9. The research result showed that, for all percentages of missing values, all variances of error, and all sample sizes, most of the R-RQ1 method is the maximum efficiency, followed by the multiple regression method [22].

In 2024, Song and Guo presented a fully informative multiple imputation (fiMI) method. It was based on a linear regression model with a missing response variable, utilizing all observable data to obtain estimates of the regression coefficients and, thereby, the predicted values of the missing response variable. This provided a good explanation of the relationship between the response variable and their respective variables and effectively enhanced the imputation accuracy of the response variable. The stability and sensitivity of the fiMI method were evaluated through a simulation study. Subsequently, the proposed fiMI method was applied to two real data sets: the admission prediction data set and the goalkeeper data set [23].

The studies above inspired the authors to study the use of a ratio estimator that utilizes the 1st and 3rd quartiles [13]. The estimator was used to estimate the missing values in multiple regression analysis with a simple random sampling (SRS). An efficient missing-value imputation method was the regression-ratio quartile1,3 imputation with SRS of Al-Omari, Jemain, and Ibrahim. This method was highly efficient [1, 22]. The authors propose three new adjusted methods for comparison: the adjusted regression-chain ratio quartile1,3 imputation with SRS of Kadilar and Cingi from the research [11], the adjusted regression-multivariate ratio quartile1,3 imputation with SRS of Feng, Ni, and Zou from the research [14], which provided highly efficient, and the adjusted regression-multivariate chain ratio quartile1,3 imputation with SRS of Lu from the research [15], which provided the highest efficiency.

In addition, the research compares an estimator of the utilization ratio of the 1st and 3rd quartiles analyzed with a rank set sampling (RSS) (referenced in [18]) versus simple random sampling. It was found that the efficiencies of the utilization ratio estimators of the 1st and 3rd quartiles analyzed with RSS were higher than those analyzed with SRS. Interested in developing further, the authors propose four new adjusted methods for estimating missing values in regression analysis: the adjusted regression-ratio quartile1,3 imputation with RSS of Al-Omari, Gemain, and Ibrahim from the proposed ratio estimator [13]; the adjusted regression-chain ratio quartile1,3 imputation with RSS of Kadilar and Cinji from the proposed chain ratio estimator [11]; the adjusted regression-multivariate ratio quartile1,3 imputation with RSS of Feng, Ni, and Zou from the proposed multivariate ratio estimator [14]; and the adjusted regression-multivariate chain ratio quartile1,3 imputation with RSS of Lu from the proposed multivariate chain ratio estimator [15]. The efficiency measures were mean square error (MSE) and mean absolute percentage error (MAPE). The optimum method was used to select a proper missing value imputation method under different sample sizes, error variances, and percentages of missing values. Moreover, the research also compared the performance of SRS and RSS estimators considering the MSE and MAPE.

Our contributions to the field of statistics are tables suggesting the best missing value imputation method for multiple regression analysis based on sampling method, sample size, missing value percentage, and error variance. In addition, the table suggested a better sampling method for SRS and RSS estimators based on sample sizes and error variances.

2. Materials and methods

This research compared new adjusted missing-value imputation methods in multiple regression with simple random sampling and rank-set sampling methods. The measures for comparing missing value imputation were mean square error and mean absolute percentage error. The study comprised many steps, as follows.

2.1 The population data of the independent variables (X₁,X₂) and the error (✓) were constructed with a size of 100,000 values. This data had a normal distribution with the probability density function,

(1)

where expected value was E (X)=μ, and variance was Var (X)=σ². The independent variable X₁ had parameters μ=3,σ²=2.25. The independent variable X₂ had parameters μ = 5, σ²=4, and the error had parameters μ=0,σ²=1,3,5,7,9 [1].

2.2 The above independent variables and the error were sampled with simple random sampling (SRS) and rank set sampling (RSS) methods for a small sample size of 20 and 40; a middle sample size of 60 and 80; a large sample size of 100 and 120; a considerable sample size of 200 and 500. For each sampling instance, it was repeated 1,000 times [1].

2.3 The multicollinearity between independent variables was tested whether the independent variables were correlated or not by using the Pearson correlation coefficient. If the correlation coefficient was more significant than or equal to 0.6, return to sampling in Sect 2.2 again. The multicollinearity was tested, and the variables were resampled repeatedly until the correlation coefficient was less than 0.6, indicating that the independent variables were not correlated [24, 25]. The construction of the dependent variable (Y) is described in Sect 2.5. Multicollinearity can be tested based on VIF and Pearson correlation coefficient. In this research, we used the Pearson correlation coefficient because it was easier to use the values of the independent variables to determine the correlation coefficient without needing to determine the value of the dependent variable’s value first.

2.4 The parameter of the y-intercept was set to β₀=0.5, and the regression coefficients were set to β₁=1 and β₂=−0.3 to create the dependent variable in Sect 2.5.

2.5 The dependent variable Y was linearly correlated with the independent variables X₁ and X₂. The y-intercept β₀, regression coefficient β₁ and β₁, and error μ = 5 were constructed using the following linear relationship model,

(2)

where Y_i was the dependent variable; X_1i and X_2i were the 1st and 2nd independent variables; β₀ was the y-intercept or value of Y for X_1i=0 and X_2i=0, β₁ and β₂ were the regression coefficient or slope of the straight line; and ϵ_i was the error with μ = 0, and σ²=1,2,3,4 and 5 [1].

2.6 The independent variable X₂ had missing values of 5, 10, 15, 20, 30, and 40 percentages missing completely at random (MCAR) [6, 9].

2.7 The number of missing values was calculated, and the positions of missing values were randomized.

2.7.1 The number of missing values of the independent variable X₂ was calculated as follows:

Number of missing values = Sample sizes x Missing value percentages

2.7.2 The positions of missing values of the independent variable X₂ were randomized and determined as missing completely at random (MCAR). MCAR was assumed in this simulation study. The missing values in X₂ were not related to X₁. In the simulation, they were simulated independently of each other. The positions of the missing values in X₂ were randomized, they were replaced with the series mean of the entire data set in X₂. Missing values of 5, 10, 15, 20, 30, and 40% are shown in Table 1.

[Figure omitted. See PDF.]

Sample sizes provided eight values, the proportions of missing values provided six values, and missing value estimation methods provided eight values, which had quite a broad scope in multiple regression analysis with two covariates and MCAR. There were 8x6x8 = 384 scenarios for each error variance of the sampling method.

2.8 The missing values of the independent variable X₂ were calculated using the series mean, which replaced the missing values with the mean of the entire data set in X₂ as follows [26]:

(3)

where X_i was the data value with available values; k was the amount of data with available values; and was the estimate of the missing value predicted from the mean of the available values.

2.9 In Sect 2.8, the estimates of the missing values of the independent variable X₂ were placed in the positions of the missing values.

2.10 The regression-ratio imputations were replaced in all eight methods as follows:

2.10.1 Regression-ratio quartile1,3 imputation with SRS of Al-Omari, Jemain, and Ibrahim (R-RQ1,3SRS)

Regression-ratio quartile1 imputation (R-RQ1) was a ratio imputation estimation method applied by replacing to ȳ. This method uses advantage of the 1st quartile of variables X₁ and X₂ [1]. Regression-ratio quartile3 imputation (R-RQ3) is described in the same way.

2.10.1.1 Multiple regression imputation mechanisms of MCAR

Regression-ratio quartile1,3 imputation with SRS of Al-Omari, Jemain, and Ibrahim relies on the imputation of the multiple regression into ȳ for the missing value estimator presented in Sect 2.10.1.4, as follows:

1) The data X_2,r+1,...,X_2n, the missing values of the independent variables, were replaced with X_{2,r+1,new,...,}X_2n,new. The estimate of the dependent variable was calculated using the missing value estimation with multiple regression analysis. 2) The independent variable X_1i with complete information and the independent variable X_2i,new with replaced missing values provided the estimation of the dependent variable as follows: .

2.10.1.2 Ratio estimator with simple random sampling

The estimators, and (11), that we used in this study, were derived ultimately from the ratio estimator with simple random sampling (R-SRS). Here, we show the derivation steps from the equation of R-SRS to and (11) in the subsections below. Simple random sampling is a sampling of a population of size N, where each possible sample of size n had the same probability of being selected. The sample obtained at such sampling is called a simple random sample [2].

The ratio estimator () is a method for estimating μ_Y with based on the relationship between the auxiliary variable X (e.g. sample mean of ) and the dependent variable Y (e.g. sample mean of ȳ) under the condition that the population mean is a known parameter. We defined it under simple random sampling as follows [13]:

(4)

where was a ratio estimate, and were the sample mean of the independent variable X₂ and the dependent variable Y, respectively.

2.10.1.3 Ratio estimator using quartile1,3 (RQ1,3)

AI-Omari et al. (2009) proposed the following ratio estimator [13],

(5)

In addition, they also proposed a ratio estimator for estimating the population mean of the dependent variable by taking advantage of the 1st and 3rd quartiles of the independent variable X₁ and X₂. The calculation formula is the following,

(6)

and

(7)

where was an estimate of the mean using quartile1, was an estimate of the mean using quartile3; and were the sample mean of the independent variable X₂ and the dependent variable Y, respectively; q₁ and q₃ were quartile1 and 3 of the population of the independent variable X₂, respectively.

2.10.1.4 Missing value estimator

In and (7), Jomprapan (2012) [1] substitute for μ_X since it is difficult to obtain the population mean. Therefore, the sample mean is instead of the population mean μ_X and the sample mean was instead of the sample mean . Two estimators of missing value were obtained as and ((9), which are called ratio quartile1 imputation with simple random sampling (RQ1SRS) and ratio quartile3 imputation with simple random sampling (RQ3SRS),

(8)

and

(9)

where was a ratio quartile1 estimate, was a ratio quartile3 estimate, was a sample mean of the independent variable X₂ when the data was complete, was a sample mean of the independent variable X₂ without accounting for the missing values, ȳ was a mean of the complete dependent variable Y and q₁, q₃ were quartile1, 3 of the samples of the random variable X₂, respectively [1, 17].

However, using only one estimator (ȳ) for all missing values may not be appropriate because using a constant for all missing values causes the estimator’s variance to be underestimated. Therefore, she took the estimate from the multiple regression analysis, substituted it in ȳ in and (9), resulting in and (11), which are called regression-ratio quartile1 imputation with a simple random sampling of Al-Omari, Jemain, and Ibrahim (R-RQ1SRS) and regression-ratio quartile3 imputation with a simple random sampling of Al-Omari, Jemain, and Ibrahim (R-RQ3SRS),

(10)

and

(11)

where was the regression-ratio quartile1 imputation estimate with a simple random sampling; was the regression-ratio quartile3 imputation estimate with a simple random sampling; and was the value of the dependent variable Y for multiple regression analysis.

The and were calculated from and (11), respectively, which were substituted in the same position as the missing value of the independent variable X₂ for the estimable dependent variable in Sect 2.10.1.1.

The authors propose three adaptation methods, based on ideas from [11, 14, 15], taking advantage of the 1st and 3rd quartiles and was obtained by imputation with multiple regression analysis to replace ȳ using simple random sampling as follows in Sect 2.10.2–2.10.4.

2.10.2 Adjusted regression-chain ratio quartile1,3 imputation with SRS of Kadilar and Cingi (AR-CRQ1,3SRS)

Kadilar and Cingi (2003) proposed the following chain ratio estimator [11],

(12)

In addition to the two methods proposed in Sect 2.10.1.4 by [1], the authors propose another technique, a new adaptation of the chain ratio method [11], for estimating missing values, that took advantage of the 1st and 3rd quartiles of variables X₂. The technique is the following Equations,

(13)

and

(14)

The authors also propose using to replace ȳ in and (14). Hence, μ_X was substituted by since obtaining the population mean was difficult. Also, was used instead of . Two estimators of missing values were obtained as and (16). These are called the adjusted regression-chain ratio quartile1 imputation with a simple random sampling of Kadilar and Cingi (AR-CRQ1SRS) and the adjusted regression-chain ratio quartile3 imputation with a simple random sampling of Kadilar and Cingi (AR-CRQ3SRS),

(15)

and

(16)

where was the adjusted regression-chain ratio quartile1 imputation estimate with a simple random sampling; was the adjusted regression-chain ratio quartile3 imputation estimate with a simple random sampling; and α was any constant value. In this case, we set α = 0 . 10 , 0 . 30 , 0 . 50 , 0 . 70 , 0 . 90. The experiment showed that α = 0 . 90 provided the best estimation performance.

2.10.3 Adjusted regression-multivariate ratio quartile1,3 imputation with SRS of Feng, Ni, and Zou (AR-MRQ1,3SRS)

Feng et al. (1998) proposed a multivariate ratio estimator [14],

(17)

The authors propose a new adaptation of the multivariate ratio method of [14]. The adapted methods estimated missing values by taking advantage of the 1st and 3rd quartiles of variables X₁ and X₂,

(18)

and

(19)

In addition, the authors also propose using to replace ȳ in and (19) and replacing μ_X with . Also, replaced , resulting in two missing value estimators defined by and (21). These are called the adjusted regression-multivariate ratio quartile1 imputation with a simple random sampling of Feng, Ni, and Zou (AR-MRQ1SRS) and the adjusted regression-multivariate ratio quartile3 imputation with a simple random sampling of Feng, Ni, and Zou (AR-MRQ3SRS),

(20)

and

(21)

where was the adjusted regression-multivariate ratio quartile1 imputation estimate with a simple random sampling; was the adjusted regression-multivariate ratio quartile3 imputation estimate with a simple random sampling; and ϵ₁,ϵ₂ were the weighted values, and ϵ₁+ϵ₂=1. In this case, we set ϵ₁=0.10,0.30,0.50,0.70,0.90 and ϵ₂=0.10,0.30,0.50,0.70,0.90. By trial and error, ϵ₁=0.9 and ϵ₂=0.10 were found to give the best estimation performance.

2.10.4 Adjusted regression-multivariate chain ratio quartile1,3 imputation with SRS of Lu (AR-MCRQ1,3SRS)

Lu (2013) proposed a multivariate chain ratio estimator [15],

(22)

The authors propose a new adaptation of the multivariate chain ratio method [15]. This adaptation estimated the missing values by taking advantage of the 1st and 3rd quartiles of variables X₁ and X₂,

(23)

and

(24)

In addition, the authors also propose using to replace ȳ in and (24), resulting in two missing value estimators defined by and (26). These are called the adjusted regression-multivariate chain ratio quartile1 imputation with a simple random sampling of Lu (AR-MCRQ1SRS) and the adjusted regression-multivariate chain ratio quartile3 imputation with a simple random sampling of Lu (AR-MCRQ3SRS),

(25)

and

(26)

where was the adjusted regression-multivariate chain ratio quartile1 imputation estimate with a simple random sampling; was the adjusted regression-multivariate chain ratio quartile3 imputation estimate with a simple random sampling; α was any constant value; and ω₁,ω₂ were the weighted values, and ω₁+ω₂=1. We set α=0.10,0.30,0.50,0.70,0.90; ω₁=0.10,0.30,0.50,0.70,0.90, and ω₂=0.10,0.30,0.50,0.70,0.90. The experiment showed that α = 0 . 10, ω₁=0.10, and ω₂=0.90 gave the best estimation performance.

According to the research of [13], a ratio estimator was proposed by taking advantage of the 1st and 3rd quartiles with a rank set sampling (RSS) (referenced in [18]) compared with a simple random sampling (SRS). It was found that the ratio estimator of the 1st and 3rd quartiles with RSS had higher efficiency than SRS. Therefore, the authors adopted this concept and proposed four additional adjustment methods for estimating missing value in the regression analysis: [11, 13–15] by utilization of the 1st and 3rd quartiles and using which was ŷ obtained by imputation of the regression method to replace ȳ by RSS as follows:

2.10.5 Adjusted regression-ratio quartile1,3 imputation with RSS of Al-Omari, Jemain, and Ibrahim (AR-RQ1,3RSS)

2.10.5.1 Multiple regression imputation

This imputation was similar to Sect 2.10.1.1, regression-ratio quartile1,3 imputation with RSS of Al-Omari, Jemain, and Ibrahim relies on the imputation the multiple regression, into ȳ in the missing value estimator presented in Sect 2.10.5.3.

2.10.5.2 Ratio estimator with a rank set sampling (R-RSS)

A rank set sampling method followed the steps below [13, 18].

1) The first data set of size n units was randomized. Then, the smallest value of the independent variable X, which was correlated to the value of the dependent variable Y, was selected.

2) The second data set of size n units was randomized. Then, the second smallest value of the independent variable, which was correlated to the value of the dependent variable, was selected, and so on.

3) The nth data set of size n units was randomized. Then, the nth most significant value of the independent variable, which was correlated to the value of the dependent variable, was selected.

The selected data sets were used to estimate the missing values. A rank set sampling estimator μ_Y was based on the relationship between the auxiliary variable X and the dependent variable Y, assuming that the population mean μ_X was a known parameter. We could define it under rank set sampling as follows [13],

(27)

where was a ratio estimate; and were the sample mean of the independent variable X₂ and the dependent variable Y, respectively.

2.10.5.3 Proposed missing value estimator

AI-Omari et al. (2009) proposed the following ratio estimator [13],

(28)

In addition, they also proposed a ratio estimator to estimate the population mean of the dependent variable by taking advantage of the 1st and 3rd quartiles of the variables X₁ and X₂. The calculation formula is the following,

(29)

and

(30)

where was an estimate of the mean using quartile1, was an estimate of the mean using quartile3; and were sample mean of the independent variable X₂ and the dependent variable Y, respectively; q₁ and q₃ were quartile1 and 3 of the population of the independent variable X₂, respectively.

The authors propose a new adjusted ratio method of [13] using , which was ŷ obtained by imputation with the multiple regression method, to replace ȳ in and (30). The population mean (μ_X) was substituted by the sample mean () since obtaining the population mean was difficult. The sample mean () was substituted into the sample mean (), resulting in two missing value estimators defined in and (32) below. We called these estimators an adjusted regression-ratio quartile1 imputation with a rank set sampling of Al-Omari, Jemain, and Ibrahim (AR-RQ1RSS) and an adjusted regression-ratio quartile3 imputation with a rank set sampling of Al-Omari, Jemain, and Ibrahim (AR-RQ3RSS),

(31)

and

(32)

where was the adjusted regression-ratio quartile1 imputation estimate with a rank set random sampling; was the adjusted regression-ratio quartile3 imputation estimate with a rank set sampling; and was the value of the dependent variable Y.

2.10.6 Adjusted regression-chain ratio quartile1,3 imputation with RSS of Kadilar and Cingi (AR-CRQ1,3RSS)

Kadilar and Cingi (2003) proposed the following chain ratio estimator [11],

(33)

The authors propose a new adaptation of the chain ratio method of [11], for estimating values that take advantage of the 1st and 3rd quartiles of variables X₁ and X₂. The technique is the following Equatios,

(34)

and

(35)

In addition, the authors also propose using to replace ȳ in and (35). Hence, substituted μ_X since it was difficult to obtain the population mean. Also, was used instead of . Two estimators of missing values were obtained as and (37). These are called the adjusted regression-chain ratio quartile1 imputation with a rank set sampling of Kadilar and Cingi (AR-CRQ1RSS) and the adjusted regression-chain ratio quartile3 imputation with a simple random sampling of Kadilar and Cingi (AR-CRQ3RSS),

(36)

and

(37)

where was the adjusted regression-chain ratio quartile1 imputation estimate with a rank set sampling; was the adjusted regression-chain ratio quartile3 imputation estimate with a rank set sampling; and α was any constant value. In this case, we set α = 0 . 10 , 0 . 30 , 0 . 50 , 0 . 70 , 0 . 90. The experiment showed that α = 0 . 90 provided the best estimation performance.

2.10.7 Adjusted regression-multivariate ratio quartile1,3 imputation with RSS of Feng, Ni, and Zou (AR-MRQ1,3RSS)

Feng et al. (1998) proposed a multivariate ratio estimator [14],

(38)

The authors propose a new adaptation of the multivariate ratio method [14]. The adapted methods estimated missing values by taking advantage of the 1st and 3rd quartiles of variables X₁ and X₂,

(39)

and

(40)

In addition, the authors also propose using to replace ȳ in and (40) and replacing μ_X with . Also, replaced , resulting in two missing value estimators defined by and (42). These are called the adjusted regression-multivariate ratio quartile1 imputation with a rank set sampling of Feng, Ni, and Zou (AR-MRQ1RSS) and the adjusted regression-multivariate ratio quartile3 imputation with a rank set sampling of Feng, Ni, and Zou (AR-MRQ3RSS),

(41)

and

(42)

where was the adjusted regression-multivariate ratio quartile1 imputation estimate with a rank set sampling; was the adjusted regression-multivariate ratio quartile3 imputation estimate with a rank set sampling; and ϵ₁,ϵ₂ were the weighted values, and ϵ₁+ϵ₂=1. In this case, we set ϵ₁=0.10,0.30,0.50,0.70,0.90 and ϵ₂=0.10,0.30,0.50,0.70,0.90. By trial and error, ϵ₁=0.9 and ϵ₂=0.10 were found to give the best estimation performance.

2.10.8 Adjusted regression-multivariate chain ratio quartile1,3 imputation with RSS of Lu (AR-MCRQ1,3RSS)

Lu (2013) proposed a multivariate chain ratio estimator [15],

(43)

(44)

and

(45)

In addition, the authors also propose using to replace ȳ in and (45), resulting in two missing value estimators defined by and (47). These are called the adjusted regression-multivariate chain ratio quartile1 imputation with a rank set sampling of Lu (AR-MCRQ1RSS) and the adjusted regression-multivariate chain ratio quartile3 imputation with a rank set sampling of Lu (AR-MCRQ3RSS),

(46)

and

(47)

where was the adjusted regression-multivariate chain ratio quartile1 imputation estimate with a rank set sampling; was the adjusted regression-multivariate chain ratio quartile3 imputation estimate with a rank set sampling; α was any constant value; and ω₁,ω₂ were the weighted values, and ω₁+ω₂=1. We set α=0.10,0.30,0.50,0.70,0.90; ω₁=0.10,0.30,0.50,0.70,0.90, and ω₂=0.10,0.30,0.50,0.70,0.90. The experiment showed that α = 0 . 10, ω₁=0.10, and ω₂=0.90 gave the best estimation performance.

2.11 The true value of the dependent variable (y) generated from Sect 2.5 and the estimated value (or predicted value) of the dependent variable () obtained from Sect 2.10 were used to calculate the mean squared error and the mean absolute percentage error.

2.12 Comparison of the eight regression-ratio imputation methods was based on mean squared error and mean absolute percentage error.

2.12.1 Mean Square Error (MSE) [1, 27–29].

(48)

where y_i was the actual value of the i th dependent variable; was the i th predicted value of the dependent variable; and MSE was the mean square error of the i th predicted value of the dependent variable.

2.12.2 Mean Absolute Percentage Error (MAPE) [1, 17].

(49)

where MAPE was the mean absolute percentage error of the predicted value of the dependent variable.

2.13 R Studio Version 4.2.1 program was used to simulate each experimental scenario. For each sampling method, the simulation was repeated for 1,000 cycles [1, 17].

2.14 The actual data collected for multiple regression analysis are the following.

Two precious metal loss data set, gold and platinum were collected. The loss was in the production process of a jewelry company. The three independent variables were the total weight of the specimen (X₁), the total output weight of the specimen (X₂), and the recovery value (X₃). The dependent variable was the loss of precious metals (Y). The sample size used was 12 sample units.

3. Results

3.1. Results of a study on simulated data

Mean square error for various simulation scenarios are the following: error variances of 1, 3, 5, 7, and 9; sample sizes of 20, 40, 60, 80, 100, 120, 200, and 500; and missing value percentages of 5, 10, 15, 20, 30 and 40%, with SRS in missing value estimation methods as shown in Tables 2, 3, 4, 5 and 6.

[Figure omitted. See PDF.]

As shown in Tables 2 and 3 for error variance of 1 and 3, most of the sample size and missing value percentage, the AR-MRQ1 method achieved the minimum mean square error, followed by AR-MRQ3. On the other hand, as can be observed in Tables 4, 5, and 6, for error variances of 5, 7, and 9, for every sample size and missing value percentage, the AR-MCRQ3 method achieved the minimum mean square error, followed by AR-MCRQ1.

Mean absolute percentage error for various simulation scenarios are the following: error variances of 1, 3, 5, 7, and 9; sample sizes of 20, 40, 60, 80, 100, 120, 200 and 500; and missing value percentages of 5, 10, 15, 20, 30, and 40%, with SRS in missing value estimation methods as shown in Tables 7, 8, 9, 10, and 11. As shown in Tables 7, 8, 9, 10, and 11, for error variance of 1, 3, 5, 7, and 9, every sample size and missing value percentage, AR-MCRQ1 method achieved the minimum mean absolute percentage error, followed by AR-MCRQ3.

Mean square error for various simulation scenarios are the following: error variances of 1, 3, 5, 7, and 9; sample sizes of 20, 40, 60, 80, 100, 120, 200, and 500; and missing value percentages of 5, 10, 15, 20, 30, and 40%, with RSS in missing value estimation methods as shown in Tables 12, 13, 14, 15, and 16.

[Figure omitted. See PDF.]

As shown in Table 12, for an error variance of 1, for every sample size and missing value percentage, the AR-MRQ1 method achieved the minimum mean square error, followed by AR-MRQ3. On the other hand, as shown in Tables 13, 14, and 15, for error variances of 3, 5, and 7, every sample size and missing value percentage, the AR-MCRQ3 method achieved the minimum mean square error, followed by AR-MCRQ1. As shown in Table 16, for an error variance of 9 and most of the sample sizes and missing value percentage, the AR-MCRQ1 method achieved the minimum mean square error, followed by AR-MCRQ3.

Mean absolute percentage error for various simulation scenarios are the following: error variances of 1, 3, 5, 7, and 9; sample sizes of 20, 40, 60, 80, 100, 120, 200, and 500; and missing value percentages of 5, 10, 15, 20, 30, and 40% with RSS in missing value estimation methods as shown in Tables 17, 18, 19, 20, and 21.

As shown in Table 17, for error variance of 1, for most of the sample size and missing value percentages, the AR-MRQ1 method achieved the minimum mean square error, followed by AR-MRQ3. On the other hand, as shown in Tables 19, 20, and 21, for error variances of 5, 7, and 9, most of the sample size and missing value percentage, the AR-MCRQ1 method achieved the minimum mean absolute percentage error, followed by AR-MCRQ3. As shown in Table 18, for an error variance of 3 and most of the sample sizes and missing value percentage, the AR-MCRQ3 method achieved the minimum mean square error, followed by AR-MCRQ1.

[Figure omitted. See PDF.]

3.2. Results of a study on actual data

Mean square errors and mean absolute percentage errors of gold loss for the scenarios where error variance was 1, 3, 5, 7, and 9; a sample of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS and RSS are shown in Tables 22, 23, 24, and 25.

Table 22 shows that for all error variances and missing value percentages of gold loss, the AR-MRQ3 method achieved the minimum mean square error with SRS.

[Figure omitted. See PDF.]

Table 23 shows that for all error variances and missing value percentages of gold loss, the AR-MCRQ1 method achieved the minimum mean absolute percentage error with SRS.

[Figure omitted. See PDF.]

Table 24 shows that for all error variances and lower missing value percentages of gold loss with RSS, the AR-RQ1 method achieved the minimum mean square error. For all error variances and higher missing value percentages, the AR-MRQ1 and AR-MRQ3 methods achieved the same minimum mean square error.

[Figure omitted. See PDF.]

Table 25 shows that for all error variances and missing value percentages of gold loss with RSS, the AR-MCRQ1 method achieved the minimum mean absolute percentage error. The AR-CRQ1 and AR-MCRQ3 methods achieved the minimum mean absolute percentage error for all error variances and higher missing value percentages.

Mean square errors and mean absolute percentage errors of platinum loss for the scenarios where error variances were 1, 3, 5, 7, and 9; a sample size of 12; and missing value percentages of 5, 10, 15, and 20% with SRS and RSS are shown in Tables 26, 27, 28, and 29.

[Figure omitted. See PDF.]

Table 26 shows that, for all error variances and missing value percentages of platinum loss with SRS, the AR-MRQ3 method achieved the minimum mean square error.

[Figure omitted. See PDF.]

Table 27 shows that, for all error variances and missing value percentages of platinum loss with SRS, the AR-MCRQ1 method achieved the minimum mean absolute percentage error.

[Figure omitted. See PDF.]

Table 28 shows that, for all error variances and missing value percentages of platinum loss of 10, and 15% with RSS, the AR-RQ1 method achieved the minimum mean square error. For all error variances and missing value percentages of platinum loss of 20%, the AR-RQ3 method achieved the minimum mean square error. For all error variances and missing value percentage of platinum loss of 5%, the AR-MCRQ3 method achieved the minimum mean square error. Finally, the AR-MRQ3 method achieved the minimum mean square error for all error variances and higher missing value percentages.

[Figure omitted. See PDF.]

Table 29 shows that, for all error variances and missing value percentages of platinum loss with RSS, the AR-MCRQ1 method achieved the minimum mean absolute percentage error.

[Figure omitted. See PDF.]

Table 30 is a summary of the comparative results between the four methods. The best method, which provided the minimum MSE and MAPE for each kind of error variance and sample size, is tabulated in two columns, SRS and RSS.

AR-MRQ1 with SRS achieved the minimum mean square error for a small error variance. However, AR-MCRQ3 achieved the minimum mean square error for a large error variance. AR-MCRQ1 achieved the minimum mean absolute percentage error for all kinds of error variances.

AR-MRQ1 with RSS achieved the minimum mean square error for a small error variance. However, AR-MCRQ3 and AR-MCRQ1 achieved the minimum mean square error for a larger error variance. AR-MRQ1 achieved the minimum mean absolute percentage error for a small error variance. However, AR-MCRQ1 achieved the minimum mean absolute percentage error for a larger error variance.

[Figure omitted. See PDF.]

Table 31, for a small error variance, the AR-MRQ1 method with SRS and RSS provided the minimum MSE. However, for middle and large error variances, AR-MCRQ3 provided the minimum MSE. Regarding the MAPE measure, for a small error variance, AR-MCRQ1 and AR-MRQ1 with SRS and RSS provided the minimum MAPE, respectively. However, AR-MCRQ1 with SRS and RSS provided the minimum MAPE for middle and large error variances. Finally, AR-MCRQ1 was the best method for missing value imputation in multiple regression analysis, followed by AR-MCRQ3.

[Figure omitted. See PDF.]

Table 32, for all error variances, the sample sizes were increased, and the number of times for methods was increased with the RSS estimators providing smaller MSE than the SRS estimators. Moreover, the RSS estimators provided smaller MSE and MAPE than the SRS estimators for all error variances and sample sizes.

4. Discussion

This research compared the efficiency of four new adjusted missing value imputations in multiple regression analysis. The study results show that AR-MRQ1 with SRS achieved the minimum mean square error for a small error variance. However, AR-MCRQ3 achieved the minimum mean square error for a large error variance. For all error variances in mean absolute percentage error, AR-MCRQ1 achieved the minimum mean absolute percentage error. The results of this study were similar to those reported in [14, 15]. In those papers, the proposed estimator was a multivariate chain ratio (MCR) and a regression estimator that used a linear combination of two auxiliary variables. The MCR and regression estimators that used the two auxiliary variables were equally highly effective, followed by the traditional multivariate ratio (MR) and regression estimators that used the two auxiliary variables. This study was also consistent with [11] in that the chain ratio (CR) estimator with SRS was more efficient than the traditional ratio estimator under certain conditions. Our results may be contrasted to [22] which reported that the regression-ratio Q1 (R-RQ1) estimator was the most efficient. This discrepancy was because the study did not include AR-CRQ1,3, AR-MRQ1,3, and AR-MCRQ1,3 estimators in the test.

AR-MRQ1 achieved the minimum mean square error for RSS and small error variance. However, AR-MCRQ3 achieved the minimum mean square error for a large error variance, followed by AR-MCRQ1. AR-MRQ1 achieved the minimum mean absolute percentage error for a small error variance. However, AR-MCRQ1 achieved the minimum mean absolute percentage error for a large error variance. To the best of the author’s knowledge, RSS has not been studied before.

Moreover, the RSS estimators provided smaller mean square error and mean absolute percentage error than the SRS estimators for the same error variances, sample sizes, and missing value percentages. Therefore, the RSS estimators were more efficient than the SRS estimators. Our results agree well with those reported by [3–5, 13]: the RSS estimators were more efficient than the SRS estimators for the same quartile, coefficient of correlation, and sample size.

5. Conclusion

This study compared the efficiencies of four new adjusted missing value imputation methods in multiple regression analysis. The four estimation methods were the following: a regression-ratio quartile1,3 (R-RQ1,3) imputation of Al-Omari, Jemain, and Ibrahim; an adjusted regression-chain ratio quartile1,3 (AR-CRQ1,3) imputation of Kadilar and Cinji; an adjusted regression-multivariate ratio quatile1,3 (AR-MRQ1,3) imputation of Feng, Ni, and Zou; and an adjusted regression-multivariate chain ratio quartile1,3 (AR-MCRQ1,3) imputation of Lu for each simple random sampling (SRS) and rank set sampling (RSS). The measures for comparing the performance were mean square error (MSE) and mean absolute percentage error (MAPE).

Future recommendations

5.1 The random forest imputation may be a suitable method for estimating missing values imputations in multiple regression analysis [21]. A comparative study should be conducted between this imputation method and the author’s method.

5.2 Many kinds of variables can have missing values. We may consider missing value imputation for the dependent variable or that for both the independent and dependent variables [30]. An attempt at imputation of missing values for the dependent variable and for both the independent and dependent variables, although it will be extensive work, should be considered.

Supporting information

S1 Code. The data used to support this study were simulated from a normal distribution using the R studio program.

https://doi.org/10.1371/journal.pone.0316641.s001

(ZIP)

S2 Code. The data used to support this study were simulated from a normal distribution using the R studio program.

https://doi.org/10.1371/journal.pone.0316641.s002

(ZIP)

Actual Data. The data used for gold and platinum loss by customer groups.

https://doi.org/10.1371/journal.pone.0316641.s003

Acknowledgments

The funders had no role in study design, data collection and analysis, publication decisions, or manuscript preparation.

References

1. 1. Jomprapan R. Missing value estimation in multiple linear regression analysis. M.Sc. Thesis. National Institute of Development Administration; 2012.

2. 2. Suwatthi P. Sample survey: sampling designs and analysis. Bangkok: Academic Document Promotion and Development Project, National Institute of Development Administration; 2009.

3. 3. Mcintyre GA. A. method for unbiased selective sampling using ranked sets. Austral J Agricult Res. 1952;3(4):385–90.

* View Article

* Google Scholar

4. 4. Takahasi K, Wakimoto K. On unbiased estimates of the population mean based on the sample stratified by means of ordering. Annals Inst Statist Math. 1968;20(1):1–31.

* View Article

* Google Scholar

5. 5. Samawi HM, Muttlak HA. Estimation of ratio using rank set sampling. Biometric J. 1996;36:753–64.

* View Article

* Google Scholar

6. 6. Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.

7. 7. Graham JW, Cumsille PE, Elek-Fisk E. Methods for handling missing data. New York: Wiley; 2003.

8. 8. Von Buuren S. Flexible imputation of missing data. New York: Chapman & Hall; 2018.

9. 9. Little RJ, Rubin DB. Statistical analysis with missing data. New York: Wiley; 1987.

10. 10. Little RJ. Regression with missing X’s: a review. J Am Statist Assoc1992;87:1227–37.

* View Article

* Google Scholar

11. 11. Kadilar C, Cingi H. A study on the chain ratio-type estimator. Hacet J Math Statist. 2003;32:105–8.

* View Article

* Google Scholar

12. 12. Sujitta S, Chancharoen R, Jaruchat B, Chanchai C and Wilaiwan N. A comparison of estimation methods for missing data in multiple linear regression with two independent variables. Thailand Statistician. 2006:13–26.

* View Article

* Google Scholar

13. 13. Al-Omari AI, Jemain AA, Ibrahim K. New ratio estimators of the mean using simple random sampling and rank set sampling methods. Revista Investigat Oper. 2009;30(2):97–108.

* View Article

* Google Scholar

14. 14. Feng SY, Ni JX, Zou GH. The theory and methods of sampling survey. Beijing: China Statistics Press; 1998.

15. 15. Lu J. The chain ratio estimator and regression estimator with linear combination of two auxiliary variables. PLOS ONE 2013;8(11):e81085–4.

* View Article

* Google Scholar

16. 16. Chen Q, Ibrahim JG. A note on the relationships between multiple imputation, maximum likelihood and fully bayesian methods for missing responses in linear regression models. Statist Interface. 2013;6(3):315–24.

* View Article

* Google Scholar

17. 17. Jomprapan R, Siripanich P. Missing value estimation in multiple linear regression analysis. J Develop Administ. 2015;55(1):183–202.

* View Article

* Google Scholar

18. 18. Paglinawan DM. Comparison of regression estimator and ratio estimartor: a simulation study. Philippine Statist. 2017;66(1):49–57.

* View Article

* Google Scholar

19. 19. Hasan H, Ahmad S, Osman BM, Sapri S, Othman N. A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: a simulation study. AIP Conf Proc 2017;1870(1):060003.

* View Article

* Google Scholar

20. 20. Little RJ. Missing data assumptions. Annu Rev Statist Appl, 2021;8(1):89–107.

* View Article

* Google Scholar

21. 21. Thongsri T, Samart K. Development of imputation methods for missing data in multiple linear regression analysis. Lobachevskii J Math. 2022;43(11):3390–9.

* View Article

* Google Scholar

22. 22. Isabella Sayers J, Kerdmangmee P, Poomawong W. Missing value imputation in multiple regression analysis with multiple regression, regression-ratio-q1, regression-ratio-q3, stochastic regression and k-nearest stochastic regression with equivalent weighted. B.Sc. Special Problem. King Mongkut’s Institute of Technology Ladkrabang; 2022.

23. 23. Song L, Guo G. Full information multiple imputation for linear regression model with missing response variable. J Appl Math. 2024;54(1):77–81.

* View Article

* Google Scholar

24. 24. Glass GV, Hopkins KD. Statistics method in education and psychology, 2nd edn. New Jersey: Englewood Cliffs; 1984.

25. 25. Howell DC. Statistical methods for psychology. 6th edn. California: Duxbury Press; 2007.

26. 26. Minakshi , Rajan V, Gimpy . Missing value imputation in muiti attribute data set. Int J Comput Sci Inf Technol. 2014;5(4):5315–21.

* View Article

* Google Scholar

27. 27. Sujitta S, Chancharoen R, Jaruchat B, Chanchai C, Wilaiwan N. A comparison of estimation methods for missing data in multiple linear regression with two independent variables. Thailand Statist. 2006;4:13–26.

* View Article

* Google Scholar

28. 28. Lamjaisue R, Thongteeraparp A, Sinsomboonthong J. Comparison of missing data estimation methods for the multiple regression analysis with missing at random. Thai Sci Technol J. 2017;5:766–77.

* View Article

* Google Scholar

29. 29. Muhammad A, Klairung S. Imputation for multiple regression with missing heteroscedastic data. Thailand Statist. 2002;20(1):1–15.

* View Article

* Google Scholar

30. 30. Shao J. Estimation and imputation in linear regression with missing values in both response and covariate. Statist Interface. 2013;6(3):361–8.

* View Article

* Google Scholar

Citation: Sinsomboonthong J, Sinsomboonthong S (2025) New adjusted missing value imputation in multiple regression with simple random sampling and rank set sampling methods. PLoS ONE 20(3): e0316641. https://doi.org/10.1371/journal.pone.0316641

About the Authors:

Juthaphorn Sinsomboonthong

Roles: Investigation, Methodology, Software

Affiliation: Department of Statistics, Faculty of Science, Kasetsart University, Bangkok, Thailand

ORICD: https://orcid.org/0000-0002-3375-5982

Saichon Sinsomboonthong

Contributed equally to this work with: Saichon Sinsomboonthong

Roles: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Validation, Writing – original draft, Writing – review & editing

E-mail: [email protected]

Current address: Department of Statistics, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand.

Affiliation: Department of Statistics, School of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand

ORICD: https://orcid.org/0000-0002-9158-2178

[/RAW_REF_TEXT]

References

1. Jomprapan R. Missing value estimation in multiple linear regression analysis. M.Sc. Thesis. National Institute of Development Administration; 2012.

2. Suwatthi P. Sample survey: sampling designs and analysis. Bangkok: Academic Document Promotion and Development Project, National Institute of Development Administration; 2009.

3. Mcintyre GA. A. method for unbiased selective sampling using ranked sets. Austral J Agricult Res. 1952;3(4):385–90.

4. Takahasi K, Wakimoto K. On unbiased estimates of the population mean based on the sample stratified by means of ordering. Annals Inst Statist Math. 1968;20(1):1–31.

5. Samawi HM, Muttlak HA. Estimation of ratio using rank set sampling. Biometric J. 1996;36:753–64.

6. Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.

7. Graham JW, Cumsille PE, Elek-Fisk E. Methods for handling missing data. New York: Wiley; 2003.

8. Von Buuren S. Flexible imputation of missing data. New York: Chapman & Hall; 2018.

9. Little RJ, Rubin DB. Statistical analysis with missing data. New York: Wiley; 1987.

10. Little RJ. Regression with missing X’s: a review. J Am Statist Assoc1992;87:1227–37.

11. Kadilar C, Cingi H. A study on the chain ratio-type estimator. Hacet J Math Statist. 2003;32:105–8.

12. Sujitta S, Chancharoen R, Jaruchat B, Chanchai C and Wilaiwan N. A comparison of estimation methods for missing data in multiple linear regression with two independent variables. Thailand Statistician. 2006:13–26.

13. Al-Omari AI, Jemain AA, Ibrahim K. New ratio estimators of the mean using simple random sampling and rank set sampling methods. Revista Investigat Oper. 2009;30(2):97–108.

14. Feng SY, Ni JX, Zou GH. The theory and methods of sampling survey. Beijing: China Statistics Press; 1998.

15. Lu J. The chain ratio estimator and regression estimator with linear combination of two auxiliary variables. PLOS ONE 2013;8(11):e81085–4.

16. Chen Q, Ibrahim JG. A note on the relationships between multiple imputation, maximum likelihood and fully bayesian methods for missing responses in linear regression models. Statist Interface. 2013;6(3):315–24.

17. Jomprapan R, Siripanich P. Missing value estimation in multiple linear regression analysis. J Develop Administ. 2015;55(1):183–202.

18. Paglinawan DM. Comparison of regression estimator and ratio estimartor: a simulation study. Philippine Statist. 2017;66(1):49–57.

19. Hasan H, Ahmad S, Osman BM, Sapri S, Othman N. A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: a simulation study. AIP Conf Proc 2017;1870(1):060003.

20. Little RJ. Missing data assumptions. Annu Rev Statist Appl, 2021;8(1):89–107.

21. Thongsri T, Samart K. Development of imputation methods for missing data in multiple linear regression analysis. Lobachevskii J Math. 2022;43(11):3390–9.

22. Isabella Sayers J, Kerdmangmee P, Poomawong W. Missing value imputation in multiple regression analysis with multiple regression, regression-ratio-q1, regression-ratio-q3, stochastic regression and k-nearest stochastic regression with equivalent weighted. B.Sc. Special Problem. King Mongkut’s Institute of Technology Ladkrabang; 2022.

23. Song L, Guo G. Full information multiple imputation for linear regression model with missing response variable. J Appl Math. 2024;54(1):77–81.

24. Glass GV, Hopkins KD. Statistics method in education and psychology, 2nd edn. New Jersey: Englewood Cliffs; 1984.

25. Howell DC. Statistical methods for psychology. 6th edn. California: Duxbury Press; 2007.

26. Minakshi , Rajan V, Gimpy . Missing value imputation in muiti attribute data set. Int J Comput Sci Inf Technol. 2014;5(4):5315–21.

27. Sujitta S, Chancharoen R, Jaruchat B, Chanchai C, Wilaiwan N. A comparison of estimation methods for missing data in multiple linear regression with two independent variables. Thailand Statist. 2006;4:13–26.

28. Lamjaisue R, Thongteeraparp A, Sinsomboonthong J. Comparison of missing data estimation methods for the multiple regression analysis with missing at random. Thai Sci Technol J. 2017;5:766–77.

29. Muhammad A, Klairung S. Imputation for multiple regression with missing heteroscedastic data. Thailand Statist. 2002;20(1):1–15.

30. Shao J. Estimation and imputation in linear regression with missing values in both response and covariate. Statist Interface. 2013;6(3):361–8.

Word count: 9765

Show less

© 2025 Sinsomboonthong and Sinsomboonthong. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

New adjusted missing value imputation in multiple regression with simple random sampling and rank set sampling methods

Content area

Abstract

Full text

1. Introduction

2. Materials and methods

3. Results

3.1. Results of a study on simulated data

3.2. Results of a study on actual data

4. Discussion

5. Conclusion

Future recommendations

Supporting information

Acknowledgments

References