Regression Models in Complex Survey Sampling for

Full text

Turn on search term navigation

1. Introduction

Standard randomized response (RR) methods are mainly used in surveys that elicit a binary response to a sensitive question in order to estimate the proportion of the study population presenting a given (sensitive) characteristic. Warner’s study generated a rapidly expanding body of research literature on alternative techniques for eliciting suitable RR schemes in order to estimate such a population proportion ([1,2,3,4,5,6]).

Some studies addressed situations in which the response to a sensitive question results in a quantitative variable and when the researcher wishes to estimate a linear parameter as the mean or the total of the sensitive variable under study. In the method proposed by [7], the interviewee was asked to choose, by means of a randomization device, from two questions; one concerned the sensitive variable and the other was unrelated (both were of the same order of magnitude). Other important papers in this regard include [8,9,10,11,12,13,14,15,16,17,18,19,20,21], together with the contributions compiled by [22,23,24,25,26]. When dealing with quantitative sensitive variables, the idea is that respondents should not disclose the true value of the sensitive variable but rather provide a scrambled value, which is obtained by algebraically perturbing the true response. This is done by applying one or more scrambling random variables, independent from each other and from the sensitive variable, the distributions of which are fully known to the researcher.

RR methods were also been applied to examine relationships between a qualitative sensitive variable and other variables. Thus, reference [27] showed that logistic regression may be performed with RR data, and [28] developed multivariate regression logistic techniques for four RR designs. In addition, reference [29] considered the univariate logistic regression model for binary RR response variables and presented this model as a generalized linear model. The same research group also developed a multivariate logistic regression model for RR response variables. Under simple random sampling, reference [30] considered a generalized linear model and generalized linear mixed models for RR designs where the probability of obtaining a positive response can be written as a linear equation of the answer to the sensitive question. Finally, reference [31] presented a logistic regression model on RR data when the covariates for some subjects were randomly missing.

However, few prior studies were made of regression techniques for quantitative randomized response variables. reference [32] performed a linear regression analysis using the model presented in [10] for the simple random sampling case, from which the variance of the estimate was calculated. In a related paper, reference [33] discussed the maximum likelihood estimation of an independently and identically distributed normal linear regression model when some of the covariates are subject to RR.

In this paper, we address the question of regression techniques for quantitative RR data under a general sampling design. Specifically, we consider a general class of RR methods ([34]) for quantitative variables and show how the RR can be used as the outcome in regression models.

The rest of this paper is organized as follows. First, we review the unified RRT approach described by [21] to establish the framework, and clarify the notation used (Section 2). We then show how RR can be used as the outcome in regression models, present estimators for the regression coefficients and investigate their theoretical properties in Section 3. Based on the asymptotic variance, we propose an estimator for the variance and discuss two interesting resampling methods, jackknife and bootstrap. Simulation experiments were carried out to confirm the finite size sample properties of the proposed estimators. These simulations are discussed in Section 4, after which the method described is applied to a real-world situation, that of a survey focused on sensitive characteristics. Finally, in Section 6, we summarize the main findings obtained and the conclusions drawn.

2. Randomized Response Survey Designs for Quantitative Variables

LetU={1,…,i,…,N}be a finite population consisting of N different elements. Let_yibe the value of the sensitive aspect under study for the i-th population element.

In this case, y is a sensitive variable that cannot be observed directly. We consider the unified approach given by [21] because some important RR techniques [8,10,11,13] can be viewed as particular cases of this approach.

The respondent performs a random experiment with three possible outcomes. If the first result is obtained, the respondent reports the real value of variable; with the second result, the respondent reports the scrambled response_yi _S1i+_S2i, and otherwise the respondent reports a value of a variable_S3iwhere_S1,_S2and_S3are scramble variables whose distributions are known. In this randomization device, the distribution of the response given by person i is

_zi=_yiwithprobability_p1_yi _S1i+_S2iwithprobability_p2_S3iotherwise

_mjand_σj2denote the mean and the variance, respectively, of the variable_Sj(j=1,2,3).

The sample s of individuals is chosen according to a sampling designp(·)._πi=_∑s∋ip(s)and_πij=_∑s∋i,jp(s)wherei,j∈U are the first- and second-order inclusion probabilities. We assume that the sampling design and the randomization stage are independent of each other and that the randomization stage is performed on each selected individual independently ([35]).

The main study goal is usually to estimateY¯=1N_∑i=1N _yi. A design-unbiased estimator of the population meanY¯is given by the Horvitz-Thompson (HT) estimator:

_y¯rrt=1N∑i∈s_wi _ri

where_wi=1_πiis the sampling weight and

_ri=_zi−_p2 _m2−_p3 _m3_p1+_p2 _m1.

The variance of this estimator and an estimator of this variance are given in [21]. In cases where the population size N is unknown, is usual to consider the Hájek estimator (see [36,37]). The Hájek estimator is generally preferred to the Horvitz-Thompson estimator for the mean, although it is not considered in this paper.

3. Regression for RR Models

Consider a regression problem, in which the data that are collected on the i-th subject are the outcome variable_yiand a vector_xi=^{(_x1,_x2,…,_xK)′}of K covariates. Under this scenario, we can consider superpopulation models, in which it is assumed that the population under studyy=^{(_y1,…,_yN)′}constitutes a realization of superpopulation random variablesY=^{(_Y1,…,_YN)′}under a superpopulation model M. The value of the variable of interest, associated with the i-th unit of the population, has two terms: a deterministic element_μi=g(_xi′β)and a random element:

_Yi=_μi+_ei,i=1,…,N

whereg(·)is a specific function and the random vectore=(_e1,…,_eN)is assumed to have a zero mean and independent components.

Now, our aim is to estimate the regression coefficientsβ. To do so, let_μi=_EM(_Yi|_xi,β)denote the expectation under the model of_Yigiven the covariates andβ.

Because the values of_Yicannot be observed directly we need to relate the randomized response to the linear predictor of the sensitive question. This relation is given by:

E(_Zi|_xi,β)=_EM _ER(_Zi|_xi,β)=_EM(_Yi _p1+(_Yi _m1+_m2)_p2+_m3 _p3|_xi,β)

=g(_xi′β)(_p1+_m1 _p2)+_m2 _p2+_m3 _p3

where_ERdenotes the expectation under the RR mechanism.

A linear transformation of the observed values can then be performed:

_ri=_zi−_m2 _p2−_m3 _p3_p1+_m1 _p2

which can be considered a realization of the variables

_Ri=_Zi−_m2 _p2−_m3 _p3_p1+_m1 _p2.

Thus, we consider the new regression model_Ri=g(_xi′β)+_ϵi. The components of random vectorϵ=(_ϵ1,…,_ϵN)are supposed to be independent with a zero mean and a positive definite covariance matrix which is diagonal,E(_ϵi2|_xi)=^σ2 _vi=_σRi2. The_viare known constants depending on_xi. This model verifies thatE(_Ri)=g(_xi′β)=_EM(_Yi).

3.1. Estimation of the Regression Coefficients

Consider the population function:

U(β)=1N∑U_di_ri−g(_xi′β)_σRi2=1N∑Uu(_ri,_xi,β)

where_di=∂g(_xi′β)∂β.

The population regression coefficient_βNis obtained as the solution of the estimating equationsU(β)=0._βNis an estimate of the model parameterβif the census data set is known and_βNdefines a parameter for the survey population if it is unknown.

Given the values observed in the sample we consider the weighted estimation function

U^(β)=1N∑s_wi _di_ri−g(_xi′β)_σRi2

Let_β^Wbe a solution toU^(β)=0. We study the properties of_β^Was an estimator of_βN.

The usual asymptotic framework in survey sampling is adopted: the finite population U and the sampling designp(·)are embedded within a sequence of populations and designs indexed byν,{_Uν,_pν}, withν→∞. Stochastic order_Op(·)is with respect to the above sequence of designs. To confirm our results, the following technical assumptions are made:

A.1. The survey design satisfiesU^(β)−U(β)=_Op(^n−1/2)for anyβ∈Θ.
A.2. The survey design ensures thatU^(β)is asymptotically normally distributed with meanU(β)and entries of the variance-covariance matrix at the orderⁿ⁻¹for anyβ∈Θ.
A.3. The survey design satisfies∂U^∂β=_Op(1)and^∂2U^∂β∂^β′=_Op(1)for anyβ∈Θ.

Theorem 1.

Under assumptions A.1 and A.3, the solution toU^(β)=0provides a consistent estimator for the parameter_βN. If condition A.2 is also met, the weighted quasi-likelihood estimator_β^Wis asymptotically normally distributed with mean_βNand variance-covariance matrix

V(_β^W)=J^(_βN)−1V1N∑s_wi _di_ri−g(_xi′ _βN)_σRi2^J′ ^(_βN)−1

whereVis the design variance-covariance matrix andJ(β)=1N_∑U∂u(_ri,_xi,β)∂β.

Proof.

The estimating functionu(_ri,_xi,β)=_di_ri−g(_xi′β)_σRi2is twice differentiable with respect toβ . [38] showed that, under these conditions, a general parameter_θNgiven by the solution of the population equationU(θ)=0is consistently estimated byθ^the solution toU^(θ)=0. In our case_θN=_βNandU(θ)=1N_∑U _di_ri−g(_xi′β)_σRi2.

Consider the following Taylor series expansion

_β^W=_βN−J^(_βN)−1U^(_βN)+_Op(ⁿ⁻¹).

Thus,_β^Wis asymptotically normally distributed becauseU^(_βN)is asymptotically normally distributed under assumption A.2. The asymptotic variance-covariance matrix of_β^Wis easily derived:

J^(_βN)−1V(U^(_βN))^J′ ^(_βN)−1

and thus expression (2) is obtained. □

Remark 1.

Please note that in the RR setting there are two sources of randomness (if we do not account for the model variability), due to the sampling design, and to the randomization device that scrambles the variable of interest. Thus, the variances inV(U^(_βN))are composed of two terms.

Let_Edand_Vddenote the expectation and variance operators for any sampling design d. Taking into account the two sources of variability induced by the sampling design and the randomization device, we have the variance decomposition formula:

V1N∑s_wi∂g(_xi′β)∂_βk_ri−g(_xi′β)_σRi2=

1^N2_Ed _VR∑s_wi∂g(_xi′β)∂_βk_ri−g(_xi′β)_σRi2+1^N2_Vd _ER∑s_wi∂g(_xi′β)∂_βk_ri−g(_xi′β)_σRi2=

1^N2_Ed∑i∈s_wi2 _σRi4^{∂g(_xi′β)∂_βk2} _VR(_ri)+_Vd∑s_wi∂g(_xi′β)∂_βk_yi−g(_xi′β)_σRi2=

1^N2∑i∈U_wi _σRi4^{(∂g(_xi′β)∂_βk)2} _VR(_ri)+

∑i,j∈U(_wi _wj _πij−1)∂g(_xi′β)∂_βk∂g(_xj′β)∂_βk_yi−g(_xi′β)_σRi2_yj−g(_xj′β)_σRj2

where_ERand_VRare the expectation–variance operators over the RR device. A detailed expression of_VR(_ri) can be seen in ([21], formulae 3).

The expressions of the covariances are simpler since the randomization stage is performed on each selected individual independently (co_vR(_ri,_rj)=0).

Remark 2.

Software packages such as survey [39] in R with the function svyglm can be used to fit linear and generalized linear models incorporating the design weights and thus to calculate_β^Wfrom the randomized values_ri, but the reported variances and covariances are incorrect. Accordingly, the standard significance test based on these values is invalid and can lead to grossly misleading conclusions being drawn.

From (2) we can construct a design-based estimator for the variance-covariance matrix of_β^Wthrough the plug-in method:

v(_β^W)=^{J^−1}V^^{J^′−1}

where

J^=1N∑s_wi _{∂u∂ββ=_β^W}

and

V^=1^N2∑i,j∈s_u˜i _u˜j′_wi _wj _πij−1_πij

with_u˜i=_di_ri−g(_xi′ _β^W)_σ^Ri2and where_σ^Ri2is an estimator of_σRi2.

This variance estimator is not unbiased because it does not include the terms of variability induced by the randomization device; moreover, it is difficult to obtain because on many occasions it does not have an estimator of_σRi2. Furthermore, the estimator requires knowledge of second-order inclusion probabilities, which are often impossible to compute or are not available for complex sampling designs.

From a practical viewpoint therefore, it is better to use the jackknife ([40]) and bootstrap techniques ([41]), which are readily applicable under diverse conditions.

The application of the jackknife method to the regression coefficient under simple random sampling is given in Section 4.4 and its use in stratified sampling is given in Section 4.5 of [42]. We apply these methods to_rirather than_yi.

The jackknife estimation of variance of an estimator of the population mean based on a RR survey data is considered in [43,44]. The authors show that the jackknife estimator underestimates the variance of the Horvitz-Thompson estimator of the population mean and propose modifications of the conventional jackknife estimator. These modifications include an additional term that adds an estimate of the variance due to the randomization device that scrambles the variable of interest.

The bootstrap method developed by [41] has been adjusted for survey sampling and its sampling design is incorporated in several studies (see e.g., [45,46,47]). Direct applications of bootstrap methods for estimating the variance-covariance matrix (2) involve solving the equationU^(β)=0 repeatedly for each bootstrap sample. Multiplier bootstrap with estimating functions was proposed by [48]. We use this method with the_ri values to estimate the variance of the proposed estimator. See [49] for a detailed description of this bootstrap method, Section 10.3.1.

Obtaining jackknife and bootstrap estimators for the variance of_β^Wthat takes into account the randomness due to the RR process is a lot more complex than in the case of estimating means. Measuring the influence of the randomization mechanism on the variance estimation using jackknife or bootstrap is an open problem that requires further investigation.

3.2. The Homoscedastic Linear Model

Let us now consider the case of the homoscedastic linear model:_μi=_xi′βandvar(_Ri|_xi)=^σ2. In this case the weighted quasi-likelihood estimate_β^Wreduces to the weighted least squared estimator that is the solution to the equation:

U^(β)=∑s_wi _xi_ri−_xi′β^σ2=0

The solution is given by the design-weighted estimator:

_β^W=_∑s _wi _xi _ri_∑s _wi _xi _xi′

This estimator is model-unbiased and design-consistent.

For this linear model, matrixJis simplified, and takes the simple expression

J=1N∑U_xi _xi′^σ2,

Thus, an estimator of the asymptotic variance of_β^Wis given by:

var^(_β^W)=^{1N_∑s _wi _xi _xi′^{σ^2}−1}var^(U^(_β^W))^{1N_∑s _wi _xi _xi′^{σ^2}−1}

with^{σ^2}=_∑s_wi ^{(_ri−_xi′β^)2}_∑s _wiand wherevar^(U^(_β^W))is the estimated HT variance.

3.3. The Ratio Model

We now consider the case of a single auxiliary variable, x, and the following ratio model ([37])

E(_Ri)=β_xiandV(_Ri)=^σ2 _xi

The weighted quasi-likelihood estimate_βWcan be reduced to the solution of the simple equation:

U^(β)=∑s_wi_ri−_xiβ^σ2=0.

This solution is given by the design-weighted ratio estimator:

_β^R=_∑s _wi _ri_∑s _wi _xi=_y¯rrt _x¯HT

where_x¯HTis the HT estimator of the population meanX¯ . The estimator of the variance of a ratio estimator is straightforwardly obtained by Taylor linearization (see e.g., [42]):

V^(_β^R)=1_x¯HT2(V^(_y¯rrt)+_β^R2V^(_x¯HT)−2_β^Rcov^(_y¯rrt,_x¯HT))

where

V^(_y¯rrt)=1^N2∑i∈s_vi _wi2+∑i,j∈s_ri _rj_wi _wj _πij−1_πij

and where_vi=1^{(_p1+_p2 _μ1)2}(_ri2A+_riB+C) (see ([21]) and

V^(_x¯HT)=1^N2∑i,j∈s_xi _xj_wi _wj _πij−1_πij.

Since

cov(_y¯rrt,_x¯HT)=_Edco_vR(_y¯rrt,_x¯HT)+co_vd(_Er(_y¯rrt),_x¯HT)=0+co_vd((_y¯HT),_x¯HT)

an estimator for this covariance can be obtained as follows:

cov^(_y¯rrt,_x¯HT)=1^N2∑i,j∈s_xi _rj_wi _wj _πij−1_πij.

4. Simulation Study

This section describes an extensive simulation study, which was implemented in R. In the first study, the variables were simulated using the R-package simstudy ([50]) and the samples were selected with sampling package discussed in ([51]).

The population size wasN=2350. The main variable y and two auxiliary variables_x1and_x2were generated using the genCorData function. The means, the standard deviations and the correlation matrix were:

μ=(3,8,15),σ=(1,2.5,3)andρ=1.00.50.70.51.00.20.70.21.0

We use as sampling design stratified simple random sampling from a stratified population with six strata of sizes_Nh=1000, 500, 150, 250, 150 and 300. Three different combinations of sample sizes were drawn for the population, corresponding to the following number of units per stratum:

_n1=(70,35,27,38,26,54)=250.

_n2=(230,100,32,55,38,45)=500.

_n3=(310,215,27,65,40,93)=750.

Point estimators of the coefficient of regression were computed using the Eichhorn and Hayre (EH) and the Bar-Lev, Bobovitch and Boukai (BBB) models. For both models we let S as an innocuous quantitative variable unrelated to the sensitive variable and assume that its distribution is known. In Eichhorn and Hayre model the i-th respondent answer the truth multiplied by a generated number_sifrom S. In BBB model, the procedure is as follows, the i-th respondent is asked to answer the truth about the sensible variable with probability p and answer the truth multiplied by a generated number_sifrom S with probability1−p. In this study a_F20,20distribution was used for the scramble variable S, and in the BBB modelp=0.5was assumed. The use of the_Fn,n distribution as a scrambling distribution is justified by [10], who highlighted the protection it gives the respondent. For this reason, it is commonly used as a scramble variable in RRT simulation studies, see e.g., [17,21].

For each estimator_β^Wof the population coefficient of regression_βN, we computed the relative biasRB=_EMC(_β^W−_βN)/_βN×100% (in percent) and the relative mean squared errorRMSE=_EMC[^{(_β^W−_βN)2}]/_βN2×100% (in percent), where_EMCdenotes the average based on 1000 simulation runs.

The results for every possible combination are shown in Table 1.

The RMSE values in this table confirm that the estimators_β^W1and_β^W2 obtained using the EH method are less efficient than with BBB method. Moreover, on comparing the estimator_β^Wfor_βW1and for_βW2the estimates for the first parameter are worse.

The second simulation study examines the behaviour of variance estimators. In this study, we obtained the plug-in method based on the asymptotic variance formulae AV (described in Section 3.1), the jackknife JK and the bootstrap BS variance estimators. Table 2 shows the average length (L) of the95%confidence intervals based on a normal distribution, the simulated coverage (Cov) probability for each method, the absolute relative bias (|RB|) and the relative mean squared error (RMSE) in percent. In this case, and for each variance estimator, AV, JK, BS, RB and RMSE are calculated based on a simulated variance obtained as the average of 1000 independent runs.

The most important observation is that, in general, all the variance estimators and the associated confidence intervals present good levels of performance. The lengths of the confidence intervals are small and the coverage probabilities of the 95% confidence interval are close to the nominal coverage. The jackknife variance estimator has the smallest length, which means there is under-coverage for the confidence interval for some sample sizes. The bootstrap variance estimator provides a short length and the resulting coverage is very close to the nominal value. We start by noting that the percent relative bias of all variance estimators were small, (less than 0.667% in absolute value for estimator AV, less than 0.233% in absolute value for estimator JK and less than 0.141% in absolute value for estimator BS). The model used to randomize the response has a low impact on the relative bias. For all models and sample sizes, we observed that JK and BS estimators are similar in terms of relative mean squared error.

This study was then repeated with a sample sizen=500and considering also a_F5,5distribution of the distribution of scramble variable S. The dispersion of the_β^W1and_β^W2 values obtained for each randomization method and degrees of freedom are represented by boxplot graphics (Figure 1).

The figure shows that the values of_β^W2are higher and the dispersion is lower than with_β^W1for all randomization methods. Moreover, the variance of the scramble variable increases in line with the dispersion.

Following this example, the value of the plug-in method based on the asymptotic variance, the jackknife and bootstrap variances and the dispersion obtained for each randomization method and degrees of freedom considered are represented by boxplot graphics (Figure 2).

For each randomization method, we note that the greater the variance of the scramble variable S, the greater the dispersion. This behaviour is especially noticeable in the estimation of parameter_β^W1. This result is expected, since adding more noise makes the dispersion increase, but in practice it is not possible to use scramble variables with little variance, as this reduces the privacy protection obtained.

To compare regression-based RR model and ratio-based RR model, we conducted the third simulation study in which both models are included. We use as sampling design the simple random sampling under a population of sizeN=10,000. Three different combinations of sample sizes were drawn from the population,n=250,500,750. As in the previous study, point estimators of the coefficient of regression were computed using the Eichhorn and Hayre (EH) and the Bar-Lev, Bobovitch and Boukai (BBB) models. A_F20,20distribution was used for the scramble variable S, and in the BBB modelp=0.5was assumed. The main variable y and an auxiliary variables x were generated using the model_yi=β_xi+_ϵiwithE(_ϵi)=^σ2 _xi, in this casex∼N(30,2),σ=0.5,β=7and_ϵi∼N(0,^σ2 _xi).

For all randomization methods and in both models, regression and ratio, we can see (Table 3) how the values obtained from the relative bias and the relative mean squared error are small. Focusing on the RMSE, we observe that the value decreases as the sample size increases, as we expected, and we obtain a slightly better behavior of the ratio model compared to the regression model.

5. Real Application As a real application of the methods described above, we conducted a survey by stratified random sampling at the University of *** to investigate the consumption of alcohol and drugs among the university population (in a sample of 754 students).

The sensitive question in this case was, “Indicates the age at which you started drinking alcohol and using drugs” and the RR technique used was the model proposed by [11]. To apply this model, each student was asked to use used as a randomizing device the app “Baraja Española” (a deck of cards, composed of 40 cards, divided into four families or suits, each numbered one to seven plus three face cards). When the user touches the screen, a card is shown. When it is a face card, the sensitive question should be answered; otherwise, the real number should be given, multiplied by the number shown on the card. Thus, the design parameter of the BarLev model was 3/10.

After the study data was compiled, a regression model was performed, in which the sensitive variable was taken as the dependent variable and the variable “Indicate on a scale of 0 (very bad) to 10 (optimal), how would you rate your relationship with your parents?” was an independent variable. After obtaining the value of the parameter, the estimate of the variance was obtained by the jackknife technique and the corresponding 95% confidence interval. This approach produced the following results:

β^=2.392682,_v^J(β^)=9.45795^e−06andIC=[2.387;2.399].

In other words, the better the relationship with their parents, the higher the age at which these students began to consume alcohol and drugs. 6. Conclusions Indirect interview techniques effectively reduce voluntary bias in surveys referring to sensitive questions. In recent years, many new techniques emerged for the estimation of proportions, means or totals of sensitive variables, but few studies addressed the question of dependency parameters. In this paper, we propose a general scheme for a randomized response (RR) technique, under a general sampling design for estimating regression coefficients. We study the theoretical properties of the proposed estimators and we derive several estimators for their variances. To assess the accuracy of the proposed estimators, a simulation study was conducted using two RR techniques. In this simulation study, the proposed estimators obtained good results in terms of relative bias and relative mean squared error. The application of the proposed technique to a real survey enabled us to relate the age at which young people begin to consume alcohol and drugs with the perceived quality of the relationship with their parents.

	BBB Method				EH Method
	_β^W1		_β^W2		_β^W1		_β^W2
n	\|RB\|	RMSE	\|RB\|	RMSE	\|RB\|	RMSE	\|RB\|	RMSE
250	4.374	9.152	1.51	1.44	7.83	14.73	2.89	2.25
500	2.99	4.13	0.56	0.07	6.06	7.07	1.89	1.08
750	1.46	2.2	0.07	0.86	1.56	3.27	1.22	0.89

	Asymptotic Variance				Jackknife				Bootstrap
	_β^W1		_β^W2		_β^W1		_β^W2		_β^W1		_β^W2
n	L	Cov	L	Cov	L	Cov	L	Cov	L	Cov	L	Cov
	BBB method
250	0.161	0.967	0.085	0.952	0.122	0.936	0.066	0.931	0.129	0.954	0.070	0.940
500	0.116	0.969	0.060	0.965	0.085	0.926	0.045	0.924	0.095	0.950	0.051	0.953
750	0.082	0.982	0.043	0.971	0.058	0.911	0.031	0.905	0.070	0.960	0.038	0.966
	EH model
250	0.189	0.952	0.101	0.956	0.153	0.922	0.083	0.930	0.163	0.933	0.089	0.939
500	0.133	0.957	0.069	0.954	0.107	0.931	0.057	0.930	0.120	0.958	0.064	0.960
750	0.092	0.976	0.049	0.958	0.072	0.912	0.039	0.920	0.087	0.964	0.047	0.964
n	\|RB\|	RMSE	\|RB\|	RMSE	\|RB\|	RMSE	\|RB\|	RMSE	\|RB\|	RMSE	\|RB\|	RMSE
	BBB method
250	0.667	1.023	0.616	1.017	0.076	0.082	0.062	0.093	0.039	0.099	0.061	0.118
500	0.616	0.619	0.530	0.546	0.143	0.077	0.139	0.074	0.081	0.094	0.091	0.095
750	0.562	0.450	0.484	0.382	0.228	0.070	0.231	0.071	0.126	0.075	0.130	0.071
	EH model
250	0.391	0.489	0.397	0.534	0.109	0.043	0.071	0.044	0.009	0.048	0.057	0.061
500	0.353	0.251	0.303	0.238	0.129	0.042	0.119	0.039	0.094	0.052	0.109	0.053
750	0.263	0.145	0.244	0.149	0.233	0.040	0.222	0.032	0.121	0.046	0.141	0.050

	BBB Method				EH Method
	_β^R		_β^W		_β^R		_β^W
n	\|RB\|	RMSE	\|RB\|	RMSE	\|RB\|	RMSE	\|RB\|	RMSE
250	0.042	0.090	0.083	0.092	0.083	0.050	0.085	0.051
500	0.128	0.047	0.158	0.048	0.132	0.026	0.129	0.027
750	0.168	0.029	0.201	0.030	0.119	0.016	0.116	0.017

Author Contributions

Conceptualization, M.d.M.R.; Data curation, B.C.; Formal analysis, A.A.; Funding acquisition, M.d.M.R.; Investigation, M.d.M.R.; Methodology, A.A.; Software, B.C.; Writing-original draft, M.d.M.R.; Writing-review & editing, B.C.. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by Ministerio de Ciencia e Innovación of Spain [grant PID2019-106861RB-I00].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Word count: 4789

Show less

© 2021. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Randomized response (RR) techniques are widely used in research involving sensitive variables, such as drugs, violence or crime, especially when a population mean or prevalence must be estimated. However, they are not generally applied to examine relationships between a sensitive variable and other characteristics. This type of technique was initially applied to qualitative variables, and studies later showed that a logistic regression may be performed with RR data. Since many of the variables considered in this context are quantitative, RR techniques were extended to these cases to estimate the values required. Regression analysis is a valuable statistical tool for exploring relationships among variables and for establishing associations between responses and covariates. In this article, we propose a design-based regression analysis for complex sample designs based on the unified RR approach. We present estimators of the regression coefficients, study their theoretical properties and consider different ways to estimate their variance. The properties of these estimation techniques were simulated using various quantitative randomized models. The method proposed was also used to analyse the findings from a real-world survey.

Details

Title

Regression Models in Complex Survey Sampling for Sensitive Quantitative Variables

First page

609

Publication year

2021

Publication date

2021

Publisher

MDPI AG

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/math9060609

ProQuest document ID

2501907942

Regression Models in Complex Survey Sampling for Sensitive Quantitative Variables

Jump to:

Full text

Abstract

Details

Suggested sources