1. Introduction
Economic measures based on income levels of the residents of a specific region play an important role in social, economic and socio-economic sciences. They are used to quantify both the actual balance of the economy as well as the wealthiness and poverty of the people. One of the most prominent candidates is the (normalized) Gini index,
GF=GF(X)=2μ∫0∞xF(x)dF(x)−1,μ=E(X),
which quantifies the economic inequality of a region, state, country or the world. Here, the random variable X denotes the income level,F(x)its cumulative distribution function, andμ=E(X)its expected value. IfGF=0, then the economic system has maximal equality (e.g., everyone has the same income), whileGF=1 represents perfect inequality (e.g., one individual has everything while the rest have nothing). For example, according to the Organization for Economic Cooperation and Development (2017), the Gini indices of the USA, Germany and South Africa wereGF=0.39,0.29,0.62in 2017, respectively. These values suggest income inequality in these regions. Therefore, Gini index serves as a measure of economic balance that allows comparison across regions. Roughly speaking, income levels were more balanced (equal) in Germany than they were in the USA and in Brazil, respectively. Thus, the Gini index serves as an important measure in economics, social and political sciences. The estimation of the Gini indexGF of a country or a region, however, is a rather challenging task, because income is usually measured on household levels and thus in a clustered and stratified way. In most countries (e.g., United States, European Union, India and others), complex household surveys are conducted annually, the data of which can be used for the estimation of the Gini index as given in (1) see (Bhattacharya 2005, 2007).
The single computation of a point estimator ofGFas being reported in most available resources is, however, rather unsatisfactory, because neither the variability in the sample nor sample/cluster sizes visualize the estimator in an informative manner. Therefore, computing100(1−α)%confidence intervals forGF as point estimators are much more informative for making both descriptive as well as comparative conclusions. Binder and Kovacevic (1995) and Bhattacharya (2007) proposed point estimators of the Gini index as well as of their standard errors in such complex survey designs (see Section 2 for details), which can be used for the computation of100(1−α)%confidence intervals forGF . Furthermore, Peng (2011) proposed an empirical likelihood-based approach to construct such confidence intervals (as well as the confidence interval for the difference of two Gini indices). Clearly, for a desired confidence level, a narrower confidence interval will be more accurate about the parameter of interest. Therefore, it is the aim of the present article to develop confidence intervals for the Gini indexGFin complex survey designs that both control the nominal confidence level(1−α) and the confidence interval width. To guarantee that these criteria will be fulfilled, the optimal number of clusters will be computed using an innovative ‘learn-as-you-go’ or sequential procedure. We refer the readers to Ghosh and Sen (1991); Ghosh et al. (1997); Chattopadhyay and Kelley (2017); Kelley et al. (2018) and others for more on sequential analysis literature.
The first known application of sequential analysis in surveys was done by Mahalanobis (1940), who described the design and implementation of the method (in a different context) for estimating acreage of jute crop in the whole state of Bengal in undivided India. This was even before the seminal works of Stein (1945, 1949) on sequential analysis area. Kanninen (1993); Greene (1998); Arcidiacono and Jones (2003); Aguirregabiria and Mira (2007) and many others contributed to application of sequential analysis in the field of economics, data analysis, medicine, and other areas. Recently, Chattopadhyay and De (2016) and De and Chattopadhyay (2017) developed a sequential procedure for inference problems related to the Gini index under independent and identically distributed (i.i.d.) conditions, but the proposed methodology cannot be used for finding a sufficiently narrow100(1−α)%confidence interval for the population Gini index under a complex household survey design. We propose a two stage procedure and a purely sequential procedure to find an estimate of the minimum number of clusters which is required to find a sufficiently narrow confidence interval under a distribution-free scenario. Both the two-stage and purely sequential procedures are applied to the 64th round of household survey data collected in India. Further, a simulation study is carried out on observations collected in the Indian household survey data and from known income distributions to explore the properties of the procedures.
The remainder of this paper is organized as follows: Section 2 describes the sampling framework of the complex survey design that is considered in this work. In Section 3, we formulate the problem of finding a sufficiently narrow confidence interval for the Gini index and the reason for non-applicability of a procedure with fixed cluster size. In Section 4, we develop the purely sequential, as well as the two-stage, procedure followed by a discussion on the characteristics of our procedure in Section 5. Furthermore, an application of both of our procedures to real and synthetic data sets can be found in Section 6, while Section 7 describes an extension of the problem to the multivariate setup. We discuss the advantage and drawbacks of the proposed procedures in Section 8, and provide concluding comments in Section 9.
2. Survey Design and Point Estimation
In this section, the complex household survey design along with the used notations will be described: Assume that the population is divided intos=1,2,…,Sstrata, whereas thesthstratum is divided intocs=1,…,Hsclusters. Under thecsthcluster in stratum s, there is a group ofMscshouseholds withνscshindividuals or members,h=1,2,…,Mscs. Therefore, the total number of clusters in the population isH=∑s=1S Hs. The number of households in a stratum will beMs=∑cs=1Hs Mscsand the total number of households in the population is denoted byM=∑s=1S Ms=∑s=1S ∑cs=1Hs Mscs.
For estimation purpose in such complex survey designs, a sample ofnsclusters is selected from thesthstratum by simple random sampling with replacement. A simple random sample of k households is then considered (without replacement) from each of the selected clusters. Let the total number of clusters being selected from the population be denoted by
n=∑s=1Snswithns=asnandas=HsH.
Thus, the total number of households in the sample will bekn=k∑s=1S ns. For thehthhousehold in thecsthcluster from thesthstratum, the observed data (that is, the household monthly income, monthly expenditure, per capita income or others) are denoted asxscsh. With the presence of stratification and clustering, the households are assigned different weightsWscsh as the probability of inclusion in the sample will vary. The assigned weight to the selected household is computed as the inverse of the probability of inclusion of the household in the sample (see Binder and Kovacevic 1995; Horvitz and Thompson 1952; Lee and Forthofer 2006). If researchers wish to increase (or decrease) the representation of a subgroup of the population that is of interest, they can employ oversampling (or undersampling) procedures and use appropriate weighting techniques. Wells (1998) discussed several weighting methods for such cases. For our survey framework, weights are assigned to the data (xscsh) with respect to the number of observations in the population. The attached weight for all theνscshmembers of thehthhousehold belonging to thecsthsampled cluster from thesth stratum as given by Bhattacharya (2007) is
Wscsh=Mscsh Hsknsνscsh.
It should be noted that the computation of the sampling weights will change depending on the sampling design and also on whether the analysis is being done at the district-, household- or individual level (Bhattacharya 2007). If the cluster size is large, sampling with or without replacement will result in similar values for the weights. Moreover, Bhattacharya (2005, 2007) noted that using sampling with or without replacement does not affect the asymptotic results of this work as in most practical situations the number of clusters per stratum are usually large.
Under the above framework, let
W=∑s=1S∑cs=1ns∑h=1kWscsh
denote the total of per-household weights associated with the survey and define
wscsh=W−1 Wscsh
the normalized weights, which will be used in the estimation of the average incomeμ=E(X)and its cumulative distribution functionF(x)by
μ^=∑s=1S∑cs=1ns∑h=1kwscsh xscsh
and
F^x=∑i=1S∑j=1ns∑l=1kwijl1xijl≤x,
in order to take the relative household sizes into account. Then, under fairly mild conditions on the numbers of clusters, a consistent estimator of the Gini indexGF given in (1) is given by
G^n=1−2μ^∑s=1S∑cs=1ns∑h=1kwscsh xscsh1−F^xscsh.
It follows that, the estimated Gini index basically is a ratio of two weighted averages of the income levels, respectively (see Bhattacharya 2007). In the next section, we discuss the idea towards the construction of confidence intervals with bounded width.
3. Bounded Width Confidence Intervals
In order to derive bounded width confidence intervals, the (asymptotic) distribution of the empirical Gini indexG^n must be tackled. It has been shown by Bhattacharya (2007), that ifE|X||s<∞, and ifns→∞for each stratums=1,…,Sat the same rate, then
nG^n−GF⟶DN0,ξ2.
Here,ξ2denotes the (asymptotic) variance ofnG^n . Due to its quite involved representation, we refer to Bhattacharya (2007) for the specific variance formula. The asymptotic distribution, however, can now be used for the computation of100(1−α)%confidence intervals for the population Gini index the width of which does not exceed a pre-specified valueω, that is
PrG^n−zα/2ξn<GF<G^n+zα/2ξn≥1−α,
and
L=2zα/2ξn≤ω.
Here,zα/2is the100(1−α/2)th percentile of the standard normal distribution N(0,1). Thus, the actual arising task is the computation of n that will guarantee that the width of the confidence interval is bounded byω, i.e.,
ωn2ξ≥zα/2⇒n≥4zα/22 ξ2ω2=C.
Hence, C denotes the optimal total number of clusters from all strata needed such thatL≤ω. Therefore, the optimal number of clusters that will be required to be sampled from thesthstratum(s=1,2,…,S)will beCs=Cas . Here, the term optimal is used in the sense of minimum number of clusters to meet the requirements and not as in the sense of optimal allocation used in sample survey methods (see Cochran 1997). If C is known, one can find the sufficiently narrow confidence interval
G^C−zα/zξC,G^C+zα/zξC,
that satisfies (5). However without knowing the underlying distribution of the income (or assets or expenditure), the value ofξ2is unknown in practical scenarios. Thus, the optimal cluster size from all the S strata, C, is also unknown. We note that supposed value (or previous survey estimate) ofξ2may be used to obtain the value of C. However, a potential problem that may arise is that the supposed value ofξ2may be different from the actual value. Moreover, using previous survey estimates in many situations is not advised as that may not be applicable in the current population. This is because of a possible change in socio-economic conditions that may arise due to the change in distribution of income or expenditure as a result of change in economic policies or situations. Due to all these factors, the value of C may widely differ from what it would have been ifξ2 is known and will not guarantee that (5) is satisfied. The (asymptotic) varianceξ2of the estimated Gini index is, however, unknown in practical applications and must be estimated in an appropriate way. Consistent estimators will now be discussed below.
Estimation ofξ2
Several articles published in statistics and economics journals have proposed different estimators of the asymptotic variance parameter of the estimator of the Gini index under different sampling schemes. Zitikis and Gastwirth (2002) proposed explicit formulas for the asymptotic variance of a general class of the Gini index (i.e., the S-Gini index) for simple random sampling with observations coming from the Exponential and Pareto distributions. We refer to Langel and Tillé (2013) for a discussion on several techniques used in estimating the asymptotic variance of the Gini index for various sampling designs. Under the current framework, Binder and Kovacevic (1995) proposed an estimator ofξ2using the empirical variance
Vn,12=∑s=1Snsns−1∑cs=1ns(uscs−u¯s)2
of the values
uscs=2μ^∑h=1kwscshA(xscsh)xscsh+B(xscsh)−μ^2G^n+1.
Here,u¯s=ns−1 ∑cs=1ns uscsdenote the empirical mean ofuscsand
A(xscsh)=F^(xscsh)−G^n+12,andB(xscsh)=∑a=1S∑b=1ns∑c=1kwabc xabc1(xabc≥xscsh),
are weighted placements and averages of the income values obtained from n clusters, respectively. It should be noted that Bhattacharya (2007) proposed an alternative estimator ofξ2which is given by
Vn,22=∑s=1S∑cs=1ns∑h=1kwscsh2 ψ^scsh2+∑s=1S∑cs=1ns∑h=1k∑h′≠hwscsh ψ^scsh wscs h′ ψ^scs h′−∑s=1S1ns∑cs=1ns∑h=1kwscsh ψ^scsh2,
where
ψ^scsh=−2μ∑g=1knwgxscsh1(xscsh≤x(g))+x(g)F^(x(g))−1(xscsh≤x(g))+2μ^2∑g=1kn∑a=1S∑b=1ns∑c=1kwabc xabc1(xabc≤x(g))xscsh,kn=k∑s=1Snsisthetotalnumberofobservations,andx(g)isthegthorderedobservation(amongallxscsh).
However, Hoque and Clarke (2015) showed that the estimators in (6) and (7) are numerically the same, i.e.,Vn,12=Vn,22. We therefore choseVn,12as a consistent estimator ofξ2and drop the second subscript, without loss of generality (i.e., we useVn2as the estimator ofξ2). Having found a consistent estimator of the (asymptotic) varianceξ2 , it follows that the optimal number of clusters C defined in (5) that lead to the bounded width confidence interval can now be estimated from the data. In order to do so, different sequential methodologies will be discussed in the next section.
4. Sequential Methodology In this section, different sequential methodologies including two-stage and purely sequential approaches will be discussed to find the sufficiently narrow confidence interval. First, purely sequential methods will be introduced. 4.1. Purely Sequential Procedure
The purely sequential confidence interval computation is based on consecutive sampling until a certain stopping rule is met which ensured that the width of the confidence interval is smaller than or equal to the given bound. This sampling process begins with a pilot sample the sizes of which will be specified in Section 4.3. However, recall that computing a bounded width confidence interval requires at leastCsclusters from thesthstratum (s=1,2,…,S). Therefore, choose a pilot cluster size oftsfrom each stratum s, which results in a total number of clusters in the pilot stage oft=∑s=1S ts. Within each selected cluster, there are k randomly selected households (without replacement). Now, collect pilot observationsxs11,…,xs1k,…,xsts1,…,xstskon each stratums=1,…,S. Now, the estimatorVn2ofξ2is computed to examine the following stopping rule
N=Nω(≤H)isthesmallestintegern(≥t)suchthatn≥4zα/22ω2Vn2+1n=C^andns≥C^s=C^as,foralls.
If the condition in the stopping rule is not satisfied, the surveyor collects data from additionalm′(≥1)clusters, with k randomly chosen households, from each stratum that hasns≤C^s. Thenξ2is estimated based on all the observations collected up to that stage and the stopping condition is checked. This process is repeated until the condition in the stopping rule is satisfied. It should be noted thatm′(≥1)can be any integer that is appropriate, suitable or feasible for the survey.
The term1/n in (8) is a correction term incorporated to avoid early stopping of the sequential procedure asVn2(the estimator ofξ2 ) may be very small in the early stages. Without this term, the stopping rule in (8) can be satisfied for very small sample sizes due to sampling error. In general, any null-sequence, e.g.,1/nγ, whereγ(>0) is a fixed number, can be used as a correction term, because it does not affect the the consistency of the variance estimator (see Mukhopadhyay and De Silva 2009, p. 260, for more details). The use of a correction term can be seen in several articles, e.g., Chattopadhyay and De (2016), Chattopadhyay and Kelley (2017), and Kelley et al. (2019). The final cluster size N constitutesNsclusters from each stratum s where
Ns=Nas,fors=1,2,…,S.
Based on the sampled dataxscshand their corresponding standardized weightswscsh, wheres=1,…,S,cs=1,…,Ns, andh=1,…,k, the100(1−α)%bounded width confidence interval for the Gini indexGFis given by
G^N−zα/2VNN,G^N+zα/2VNN.
The purely sequential procedure may be numerically cumbersome due to the consecutive sampling and repeated computations of the variance estimators. Therefore, a less numerically intensive method—a two-stage procedure—will be examined in the next section.
4.2. Two-Stage Procedure
Unlike the purely sequential procedure, the two-stage procedure comprises of two stages. The first stage is called the pilot stage, wherein a sample is drawn from the population. That is, first a pilot sample of clusters,ts(with∑s=1S ts=t), is selected from each stratum s. Based on the sample from the pilot stage,ξ2 is estimated as in (6). Then, the total final cluster size from all strata can be estimated as
Q=minH,maxt,4zα/22ω2Vt2=minH,Q∗
whereQ∗is the (unbounded) optimal cluster size and⌈·⌉is the ceiling function, that is,⌈x⌉is the smallest integer that is greater than or equal to x. Thus, the estimated number of clusters to be sampled from thesthstratum is given by
Qs=min{Hs,[Qas]},
withas as defined in (2) and[·]being the nearest integer function. So, in the second stage, observations from k households will be collected fromQs−tsclusters from each stratum s. Using the combined data from the two stages, the estimator ofξ2is updated and the approximate100(1−α)%confidence interval for the Gini index is given by
G^Q−zα/2VQQ,G^Q+zα/2VQQ.
We note that the final cluster size using either the two-stage procedure or the purely sequential procedure can be shown to be always finite. In addition, the number of clusters per stratum are mutually dependent as they all depend on the same stopping rule. In the next subsection, we derive the pilot cluster size formula.
4.3. Pilot Cluster Size
Using (8) and proceeding along the lines of Chattopadhyay and De (2016), we have
n≥4zα/22ω2Vn2+1n≥4zα/22ω21n⇒n≥2zα/2ω.
Thus the total number of sampled clusters is at least2zα/2/ω. The maximum number of clusters from thesthstratum isHsand also the minimum number of clusters to estimateξ2 is 2. Considering all the constraints in (8), the number of clusters recommended to be sampled from thesthstratum at the pilot stage is
ts=minHs,max2,2as zα/2ω.
We note that this ensures that the minimum cluster size is met as well as the total possible cluster size is not exceeded.
5. Characteristics of the Procedures and Simulation Study The purely sequential procedure and the two-stage procedure for constructing a sufficiently narrow confidence interval for the Gini index—unlike fixed cluster size procedures—require cluster sizes which are obtained from data. So, the respective cluster sizes N and Q are random in nature. In the following subsection, we will look at the characteristics of the random cluster sizes viz. N and Q. 5.1. Characteristics
The following theorem provides some asymptotic properties (asω→0) of the final cluster sizes of the above procedures with sufficiently large H.
Theorem 1.
If the parent distribution(s) is(are) such thatE[Vn2]exists andHs(fixed) are sufficiently large for alls∈S, then asω→0,
(i)
NC→1in probability,
(ii)
QC→1in probability, and
(iii)
2zα/2 VNN≤ω.
Proof of Theorem 1.
(i)
The definition of stopping rule N associated with the purely sequential procedure in (8) yields
2zα/2ω2VN2≤N≤t1(N=t)+2zα/2ω2VN−12+1N−1.
SinceN→∞asω↓0andVn2→ξ2in probability asn→∞ , by applying Theorem 2.1 of Gut (2009),VN2→ξ2in probability.
Furthermore,tPr(N=t)/C≤t/C→0asω↓0 . Hence, dividing all sides of (14) by C and lettingω↓0, we proveN/C→1in probability asω↓0.
(ii)
The definition of final cluster size Q related to the two-stage procedure in (10) yields
2zα/2ω2Vt2≤Q≤t1(Q=t)+2zα/2ω2Vt2+1t.
Furthermore,tPr(Q=t)/C≤t/C→0asω↓0. Now,Vt2→ξ2in probability asω↓0 . Hence, dividing all sides of (15) by C and lettingω↓0, we proveQ/C→1in probability asω↓0.
(iii)
Using stopping rule N in (8) we have, for all N,
2zα/2ω2 VN2≤N⇒4zα/22NVN2≤ω2⇒2zα/2VNN≤ω
□
Parts (i) and (ii) of the theorem show that the final cluster size as obtained from the purely sequential and the two-stage procedure is a consistent estimator of the cluster size providedξ2is known. Part (iii) of the theorem shows that the sufficiently narrow confidence interval (that is length less than or equal toω) will be obtained by the purely sequential procedure. The same result cannot be proven for the two-stage procedure.
5.2. Simulation Study
We now use a detailed simulation study, presented in the Supplement, to illustrate and compare the properties of our purely sequential and the two stage procedures in constructing a100(1−α)%confidence interval for the Gini index under a complex survey whose width is less thanω . We presented two different simulation studies with 5000 simulation runs—(a) simulation using the NSS survey data as the population and (b) a Monte Carlo simulation in which the observations are drawn from three different populations, each of which has been drawn using three different distributions, namely; Pareto, Gamma and Lognormal distributions. The two simulation studies were performed in RStudio (RStudio Team 2018, version 1.2.1335) and codes are available upon request.
To begin with, we describe the simulation procedure for the purely sequential methodology. From the given populations,ts(s=1,2,…,S)clusters are randomly sampled from the sth stratum without replacement. From there, four households are selected from each cluster using simple random sampling without replacement and these households from all t clusters will constitute the pilot sample. From the collected pilot sample, the asymptotic variance of the Gini indexξ2 is estimated using (6), and from (8), the optimal number of clusters C is estimated. The stopping rule is checked and if it is satisfied, sampling is terminated. On the other hand, if the stopping rule is not satisfied, the strata whose number of clusters selected are less than the expected number, that is{s:ts<C^s}, are identified and additionalm′number of clusters are randomly selected without replacement. Here,m′is chosen to be either 1, 10 or 20. In each of the selectedm′clusters, four households are randomly selected without replacement. At this stage, with the total number of sampled clusters being n (say), the value ofVn2is updated and the stopping rule is checked. If the rule is met, sampling is stopped, otherwise the strata without enough clusters are identified again and additionalm′clusters are collected from each of them. This process is continued until and unless the stopping rule is met. At that point, based on N (say) numbers of clusters sampled from all strata, the100(1−α)% confidence interval for the Gini index is constructed as given in (9).
Unlike the purely sequential procedure described above, the two-stage procedure has only two stages. The simulation algorithm for the two-stage is as follows. From a given population,tsnumber of clusters are randomly selected without replacement from the sth stratum and four households are randomly sampled from each of the selected clusters without replacement. The per monthly capita expenditurexscshfrom the selected households, with their respective weightWscsh , are used to estimate the asymptotic variance of the Gini index (from (6)). This is followed by using (10) to obtain the optimal number of clusters Q needed to achieve the desired confidence level and width. IfQ>t, additionalQs−tsnumber of clusters are randomly selected without replacement from each stratum s. In each of the additional clusters, four households are also randomly selected without replacement. Finally, per capita monthly expenditure of all households from the Q number of clusters are used to construct the100(1−α)% confidence interval for the Gini index as stated in (11).
From the simulations, we find that the coverage probability for the confidence intervals for both purely sequential procedure and the two-stage procedure are approximately close to the desired confidence level provided that the cluster size (in all strata) is large, which is also a basic criterion while proving the asymptotic normality in (4). However, the width of the confidence intervals for the two stage procedure, unlike the purely sequential procedure, may result in confidence intervals of width larger than the pre-specified value ofω . For details, one may look at Tables S21–S24 of the supplementary material. This outcome is not surprising since the two-stage procedure is based on only the pilot sample which is usually taken to be small. So, the variability of the variance estimatorVn2is higher. The optimal cluster sizes obtained by the purely sequential procedure is less than the one obtained by the two-stage procedure. The newly developed methods can now be applied using real data. This will be explained in the next section.
6. Gini Index Estimation in India
We now apply the sequential procedures to construct bounded width confidence intervals for the Gini index in India using the per capita monthly expenditures obtained via the 64th Round National Sample Survey (NSS) (a stratified multi-staged survey design between July 2007 and June 2008). In 2008, the country was divided into 28 states and seven union territories thereof each was subdivided into districts. Within each district, two basic sectors were formed; all rural areas constituted the rural sector while all urban areas constituted the urban sector. Nonetheless, for the urban areas in a district, separate basic strata were formed for each town that had at least a population of 10 lakhs (1 lakh is 100,000). The remaining areas were grouped as another basic stratum (National Sample Survey Office 2007). For the rural sector, the sampling frame was made up of villages while for the urban sector, it was towns/blocks.1
Census villages and the Urban Frame Survey blocks were the first stage units (FSU) in the rural and urban sectors respectively. From each strata, FSUs are selected from the rural sector with probability proportional to size with replacement and from the urban sector by using simple random sampling without replacement. Within the FSU, the households in each sector were considered as the smallest unit of grouping, which is also referred to as the ultimate stage units. Households were selected by simple random sampling without replacement and various information about the households were recorded during the survey. Some of the information include the demographics, household size, expenditure on education, food, clothing, corresponding weights etc. A detailed description of the NSS Data can be found online at National Sample Survey Office (2015).
The “Stratum” variable in the 64th NSS data set will be used to stratify the states/sectors while “FSUno” (First Stage Unit Number) variable will be used to cluster the households under each stratum. We discuss the results obtained from applying the proposed sequential methodologies which were applied to the data collected from two of the most populous states in India, namely Uttar Pradesh and West Bengal. Additionally, the report includes the results for the whole state as well as rural and urban sectors of the state. Here, all the households in each cluster were considered since we are sampling from a survey that already has few number of households per cluster. However, the weight per household is adjusted at each sampling stage to reflect the actual weight that would be used during a survey.
In applying the sequential methodologies, the pilot cluster sizests for each stratum s are computed using (13). At the outset,tsnumber of clusters are selected from stratum s fors=1,…,S. Wheretsis same for both the purely sequential procedure and the two-stage methodology. We apply each of the procedures considering the survey data as our population.
6.1. Application of Purely Sequential Procedure (PSP)
The proposed purely sequential procedure, with observations from one cluster collected at each stage after the pilot stage, is applied to the NSS 64th round data. The results for different combinations of pre-specified width (ω∈0.020,0.025) and confidence level (1−α,α∈0.05,0.10 ) can be found in Table 1, Table 2, Table 3 and Table 4. The first column of the tables indicates the region on which we applied our procedure. The PSP was applied on the entire data from Uttar Pradesh (denoted as All) and then separately applied on the rural and urban sectors of Utter Pradesh (denoted as Rural and Urban respectively). The same process was also repeated for West Bengal. The second column of the tables shows the estimated Gini index (G^H) and its standard error (se(G^H)) using the entire number of clusters (H) available in the data set for that region (i.e., all of the state, rural sector of the state, or the urban sector of the state). In the third column is the total number of clusters(H)available in the data set for that region. The fourth column shows the value ofC^when the procedure ended,C^ being the estimated optimal cluster size as in (8). The fifth column of the tables shows the collected cluster size N using the stopping rule in (8) and the pilot cluster size t. The values ofG^Nandse(G^N) in the sixth column are the estimated Gini index and its standard error respectively based on N clusters. The next two columns are respectively the lower and upper limits of the confidence intervals obtained with the stopping rule in (8). The ninth column iswNwhich is the estimated width of the confidence interval. The last columnPr(Ns<C^s)shows the proportion of strata that had their collected cluster sizeNsfrom the purely sequential procedure being less than their estimated optimal cluster sizeC^s(Nsis the final number of clusters selected from stratum s whileC^sis the estimated optimal number of clusters to be sampled from stratum s).
In Table 1, Table 2, Table 3 and Table 4, it can be seen that, when the maximum available (to be drawn from) cluster size (Hs) per stratum are large, the purely sequential procedure is able to achieve desired precision, i.e., a narrow confidence interval, (wN≤ω) for the Gini index with relatively fewer number of clusters sampled while maintaining the desired confidence level. This is shown in the results whereN<Hfor all of Uttar Pradesh and West Bengal, as well as their individual rural sectors. The same cannot be said about their urban sectors as they do not have enough maximum available clusters from the onset. Thus, the procedure did not reach the optimal cluster size but stopped when there were no more clusters remaining to be sampled.
The results also show that, aside the fact that the entire urban regions did not have enough clusters (N=H<C), each of the strata in the regions also do not have enough clusters (that is,Pr(Ns<C^s)=1) to obtain a narrow confidence interval width. However, in the other regions (i.e., All and Rural for Uttar Pradesh and West Bengal), even thoughC^<N<H, some strata hadNs<C^s . This is because some strata have more than enough clusters while others do not and that offsets each other at the end. For example, it can be seen from Table 1 that in the rural sector of Uttar Pradesh, 40% of the strata did not have enough clusters even though, at the end, the confidence interval was 0.0186 wide which was less than the desired width of 0.02.
Next, the the results will be compared with the two-stage procedure as discussed in Section 4.2.
6.2. Application of Two-Stage Procedure
First, the estimatorVn2ofξ2is obtained from the pilot stage and then the final cluster sizeQ∗is computed.Q∗ is then adjusted to account for the limited availability of clusters per stratum in the NSS data to obtain the possible number of clusters Q that can be sampled (see (10)). Here, Q is distributed over S strata asQsfor stratum s; rounding off whereQsis not an integer. The sum ofQsgives the actual number of clusters,Q˜=∑s=1S Qs, that are sampled from all strata. UsingQ˜clusters, the Gini index andξ2are re-estimated (or updated) and a100(1−α)% confidence interval is constructed according to (11).
Similar to the application of the purely sequential procedure, the two-stage procedure is applied to the NSS 64th round data for different combinations of pre-specified precision (ω) and accuracy (1−α ) with the results shown in Table 5, Table 6, Table 7 and Table 8. The second column of the tables indicates the total number of clusters H in the unit (i.e., the whole state, rural sector, or urban sector) of the NSS data. The third column displays estimated optimal number of cluster (Q∗) that are required in order to achieve the desired precision and accuracy. BelowQ∗is the pilot number of clusters t. The next column shows the estimated optimal cluster sizes Q taking into account the total number of clusters available in the data, because the number of clusters are finite and limited. Furthermore,Q˜ is the actual number of clusters that can be sampled from all strata considering the fact that we can only sample integer number of clusters from each strata (i.e., rounding off where there are decimals in the number of clusters to be sampled from a stratum). Using (3) and (6), the Gini index estimate,G^H, for the unit is computed using all H clusters with its standard error asse(G^H)and these are shown in the fifth column. The selected clusters are used to estimate the Gini index and it is denoted asG^Q˜, with its standard error asse(G^Q˜), in the sixth column. In the seventh and eighth columns, Lower CI and Upper CI are the lower and upper limits of the100(1−α)%confidence interval of the Gini index usingQ˜clusters, respectively. The last column shows the length of the confidence interval,wQ˜. It must be noted thatQ∗is unbounded while on the other hand, Q andQ˜cannot exceed H.Q˜can be less than, equal to, or greater than Q depending on the rounding off.Q∗will be equal to Q if and only ifQ∗is less than or equal to H.
From Table 5, Table 6, Table 7 and Table 8, it can be observed that in all cases, except the urban sectors for both states, the confidence interval widths were less thanω. These results were achieved because the optimal number of clusters required(Q∗), according to the two-stage procedure, were less than the number available(H). On the other hand, in both Uttar Pradesh and West Bengal, the estimated optimal cluster sizesQ∗for the urban sector exceeded the available number of clusters H in the data. As a consequence of this, the confidence interval widths for the Gini index in the urban sectors were larger than the pre-specified bound, that iswQ˜>ω.
7. Extension: Narrow Confidence Region
The methodology presented in this article for the Gini Index parameter can be extended to a multi-parameter setup in which we would like to make an inference about a vector of parametersθ=(θ1,θ2,…,θp)⊤forp≥2. This situation arises when we are interested in making joint inference related to a number of welfare related measures computed from socio-economic survey data (e.g., household consumer expenditure survey conducted by National Sample Survey, India). Thus, instead of a sufficiently narrow confidence interval, we would like to construct a narrow confidence region for a vector of parameters. Let the vector of estimators be defined asTn=(T1n,…,Tpn)⊤ based on the data on n households collected using a complex household survey. We extend our proposed methodology for constructing the narrow confidence region in the spirit of Mukhopadhyay and De Silva (2009, pp. 284–89). We propose the following confidence region forθF:
ℜn=θ∈Rp:(Tn−θ)⊤(Tn−θ)≤ω2.
Using the regularity conditions by Bhattacharya (2005), we have,
nTn−θ→DN(0,Σ),i.e.,n(Tn−θ)⊤ Σ−1(Tn−θ)∼aχp2,
withΣbeing a positive definite matrix andχp2being a chi-squared distribution with p degrees of freedom. IfΣis a positive definite matrix then there exist an orthogonal matrixPand a diagonal matrixΔsuch thatP⊤ΣP=Δ. The diagonal elements ofΔcontains the eigen values ofΣ. If the positive eigen values ofΣbeλ1,…,λpthenΔ=diag(λ1,…,λp). Furthermore, let(PTn−Pθ)=(Y1,…,Yp)⊤andλ(p)is the maximum of the p eigen values ofΣ. So, we have
Tn−θ⊤ Σ−1Tn−θ=Tn−θ⊤ P⊤ Δ−1PTn−θ=PTn−Pθ⊤ Δ−1PTn−Pθ=∑i=1pYi2 λiλ(p) Tn−θ⊤ Σ−1Tn−θ≥∑i=1pYi2=PTn−Pθ⊤PTn−Pθ=Tn−θ⊤Tn−θ.
Thus, using (16), we say,
Prθ∈ℜn=PrTn−θ⊤Tn−θ≤ω2≥Prλ(p) Tn−θ⊤ Σ−1Tn−θ≤ω2=PrTn−θ⊤ Σ−1Tn−θ≤ω2 λ(p).
Providedχα;p2being the100(1−α)th percentile ofχp2, we claim that the coverage probability of the confidence regionℜnis more than(1−α)if
nω2λ(p)≥χα;p2,i.e.,n≥χα;p2 λ(p)ω2=C.
Here, C is the required optimal cluster size that should be used provided the covariance matrix (Σ) is known. If the parameterΣis known in advance, one could simply collect observations belonging to clusterCs,s=1,…,Sof the each of the S Strata. SinceΣis not known in practice, we can estimateΣ, using a consistent estimator (Vn , say) which can be obtained using the jackknife method. The consistency result of the jackknife estimator follows from Sen (1988). Thus using the jackknife estimator, we may propose either a two-stage or a sequential procedure. Similar results associated with the procedures described earlier is expected to hold under appropriate regularity conditions.
8. Discussion
At the outset, we would like to caution readers not to confuse two-stage sampling with the two-stage procedure discussed in Section 4.2, in the sequential sampling literature. For two-stage procedure, we refer Chattopadhyay and Mukhopadhyay (2013); Stein (1945) and others. A two-stage sampling (e.g., see Fuller (2009)) is a sampling technique in which a sample of clusters is selected and within those selected clusters, a sample of units are selected assuming the units to be independent of one another, and the selection rule depends only on the cluster. Under this two-stage sampling, Fuller (2009) discussed the use of Horvitz-Thompson estimator to estimate the total number and mean of the population and their respective variances. In addition, Fuller (2009) elaborated on the use of Horvitz–Thompson estimators and their (asymptotic) variances for functions of means and complex estimators, in general, under the assumption that the population distribution has a finite fourth moment. However, in the asymptotic framework of Fuller (2009), it was assumed that observations are independently and identically distributed (iid) which is a stronger assumption when compared to the framework of Bhattacharya (2007), also used in this work. Furthermore, Fuller (2009) also discussed the classical optimal sample allocation problem under the two-stage sampling technique for estimating the mean per element in a population. In his discussion, he assumed an equal number of units to be sampled from each cluster as well as an equal total number of units in each cluster and also known population variances of the cluster size and the sampling units. Under these assumptions, Fuller (2009) obtained the optimal number of units to be sampled per cluster by minimizing the variance of the mean per element subject to a cost constraint.
Our work is different from the survey procedures discussed in Fuller (2009). Our work, as indicated earlier, is based on the survey framework used in Bhattacharya (2007). In order to get such a confidence interval for Gini index, we are interested in estimating the unknown optimum number of clusters in each of the stratum, prefixing the number of strata. Apart from the survey framework, in our work, optimal cluster size depends on the data unlike the procedures discussed in Fuller (2009). The total cluster size (as well as the cluster size per each stratum) is a random variable that depends on a stopping criterion. This procedure also makes the estimated cluster sizes mutually dependent as they are all estimated based on the same stopping rule. Thus, the method discussed in Fuller (2009) or any other existing work can not be applied to find such a confidence interval.
We believe, this is the first work to make developments on having sufficiently narrow confidence interval of economic inequality index based on complex household survey. Now we discuss some issues or limitations of our proposed procedures because our proposed (a) procedures depend on the pre-specified number of households in each cluster (b) sequential procedure depends on pre-specifiedm′(c) procedures consider large cluster size scenario (d) procedures do not consider the sampling cost and/or a fixed budget.
To begin with, the purely sequential procedure requires observations from additionalm′clusters, after the first stage, every time the condition in the stopping rule is not met. Thus, there is a need to fix the value ofm′. In some situations, it is as easy to collect observations from more than one cluster as it is to collect observations from a single cluster at every stage. So, as per convenience, the value ofm′should be accordingly decided based on economic considerations. In fact, the purely sequential procedure is not affected by the choice ofm′, the larger the value ofm′, the fewer number of stages, and the higher the chances of overestimating the optimal number of clusters. On the other hand, the smaller the value ofm′, the more number of stages and the higher the chances of accurately stopping at the optimal number of clusters. Thus, there is a trade off between the number of stages and stopping accurately at the optimal cluster size when choosingm′.
Furthermore, our proposed procedures are based on the central limit theorem (when the cluster sizes per stratum are large). If the number of clusters is small, the confidence interval for Gini index cannot be constructed using Bhattacharya (2005, 2007) (fixed-cluster size method) and narrow confidence interval for Gini index using our proposed procedures. For smaller number of available clusters (Hs) for few strata, the sequence of the sampling distributions of the empirical Gini indices may not reach asymptotic limiting normal distribution. In a situation when limiting normality cannot be reached, our proposed procedures should not be applied. If one of our proposed procedures are applied, because of not having enough clusters in a few strata, one may not achieve desired confidence interval for the population Gini Index. This scenario was encountered in the application section of this work, for both the purely sequential and two-stage procedures, when there were not enough available clusters in the urban sectors, and as such, resulted in confidence intervals that were wider than desired.
Lastly, a very important question raised by the Bhattacharya (2005) was about developing a survey design taking the economic factors into account. Both our proposed procedures can be extended to include cost factors whereby optimization will be done at several levels for construction of a narrow confidence interval or confidence region under cost constraints. However, we do not explore that possibility in this article. A related issue is the fact that usually a budget is allocated by a country to its survey agency to carry out the survey. Under such budget constraints, the funding agency is not likely to willingly hand out more money if stopping rule is not met with the available amount. Without question, issue of budget constraint is important. Here, we do not discuss the estimation of cluster sizes under a fixed budget. We feel that our current work is a first step towards addressing the important issue in the sense of achieving a sufficiently narrow confidence interval or region and may yield different outcomes under cost constraints. We believe our work will lead to further research on this topic.
9. Conclusions
Working within the asymptotic purview for complex survey data, developed by Bhattacharya (2005, 2007), we have developed purely sequential and two-stage procedures for constructing sufficiently narrow confidence intervals for the Gini index which is one of the most popular measure of economic inequality. Our procedure may be applied for surveys when stratified clustered sample data are drawn from a large number of clusters per stratum, which is a reasonable assumption to make. More so, our procedure may also be applied to special cases of multi-stage survey designs including cases without stratification (i.e.,S=1), and those that have independent observations within clusters (interclass correlation is zero).
It is with no doubt that the two-stage procedure is practically more feasible under this survey design than the purely sequential procedure. The confidence intervals of both procedures yielded a coverage probability closer to the desired confidence coefficient, however, the purely sequential procedure produces confidence intervals whose width are always less than the desired boundω . The two-stage procedure is also known to over-estimate the optimal cluster size as compared to the purely sequential procedure Mukhopadhyay and De Silva (2009) and this property can be seen in results from the simulation (in the supplementary material) and the application to the NSS data. Furthermore, the estimated optimal cluster sizes have smaller standard errors under purely sequential procedure as compared to two-stage procedure.
Region | G^H | H | C^ | N | G^N | Lower CI | Upper CI | wN | Pr(Ns<C^s) |
---|---|---|---|---|---|---|---|---|---|
se(G^H) | (t) | se(G^N) | |||||||
Uttar Pradesh | |||||||||
All | 0.2163 | 1262 | 622 | 672 | 0.2116 | 0.2023 | 0.2209 | 0.0186 | 0.2138 |
(0.0042) | (321) | (0.0057) | |||||||
Rural | 0.1997 | 903 | 505 | 523 | 0.2024 | 0.1931 | 0.2117 | 0.0186 | 0.4 |
(0.0041) | (198) | (0.0057) | |||||||
Urban | 0.2229 | 359 | 903 | 359 | 0.2229 | 0.2077 | 0.2381 | 0.0304 | 1.0 |
(0.0092) | (180) | (0.0092) | |||||||
West Bengal | |||||||||
All | 0.2320 | 878 | 587 | 593 | 0.2334 | 0.2239 | 0.2430 | 0.0191 | 0.1282 |
(0.0051) | (190) | (0.0058) | |||||||
Rural | 0.1812 | 551 | 450 | 450 | 0.1816 | 0.1723 | 0.1909 | 0.0186 | 0.2353 |
(0.0048) | (172) | (0.0057) | |||||||
Urban | 0.2609 | 327 | 612 | 327 | 0.2609 | 0.2482 | 0.2736 | 0.0254 | 1.0 |
(0.0077) | (185) | (0.0077) |
Region | G^H | H | C^ | N | G^N | Lower CI | Upper CI | wN | Pr(Ns<C^s) |
---|---|---|---|---|---|---|---|---|---|
se(G^H) | (t) | se(G^N) | |||||||
Uttar Pradesh | |||||||||
All | 0.2163 | 1262 | 834 | 878 | 0.2117 | 0.2022 | 0.2212 | 0.0190 | 0.2138 |
(0.0042) | (333) | (0.0048) | |||||||
Rural | 0.1997 | 903 | 643 | 667 | 0.2024 | 0.1930 | 0.2117 | 0.0187 | 0.4 |
(0.0041) | (226) | (0.0048) | |||||||
Urban | 0.2229 | 359 | 1282 | 359 | 0.2229 | 0.2048 | 0.2410 | 0.0362 | 1.0 |
(0.0092) | (254) | (0.0092) | |||||||
West Bengal | |||||||||
All | 0.2320 | 878 | 906 | 878 | 0.2320 | 0.2221 | 0.2419 | 0.0198 | 1.0 |
(0.0051) | (223) | (0.0051) | |||||||
Rural | 0.181 | 551 | 552 | 551 | 0.1812 | 0.1719 | 0.1906 | 0.01871 | 1.0 |
(0.0048) | (203) | (0.0048) | |||||||
Urban | 0.2609 | 327 | 869 | 327 | 0.2609 | 0.2458 | 0.2761 | 0.0303 | 1.0 |
(0.0077) | (207) | (0.0077) |
Region | G^H | H | C^ | N | G^N | Lower CI | Upper CI | wN | Pr(Ns<C^s) |
---|---|---|---|---|---|---|---|---|---|
se(G^H) | (t) | se(G^N) | |||||||
Uttar Pradesh | |||||||||
All | 0.2163 | 1262 | 401 | 540 | 0.2138 | 0.2035 | 0.2242 | 0.0207 | 0.0 |
(0.0042) | (302) | (0.0063) | |||||||
Rural | 0.1997 | 903 | 386 | 400 | 0.2014 | 0.1899 | 0.2130 | 0.0231 | 0.1714 |
(0.0041) | (168) | (0.0070) | |||||||
Urban | 0.2229 | 359 | 578 | 359 | 0.2229 | 0.2077 | 0.2381 | 0.0304 | 1.0 |
(0.0092) | (168) | (0.0092) | |||||||
West Bengal | |||||||||
All | 0.2320 | 878 | 324 | 319 | 0.2288 | 0.2175 | 0.2401 | 0.0226 | 0.1795 |
(0.0051) | (158) | (0.0069) | |||||||
Rural | 0.1812 | 551 | 276 | 289 | 0.1829 | 0.1721 | 0.1937 | 0.0216 | 0.2353 |
(0.00477) | (138) | (0.0066) | |||||||
Urban | 0.2609 | 327 | 392 | 327 | 0.2609 | 0.2482 | 0.2736 | 0.0254 | 1.0 |
(0.0077) | (142) | (0.0077) |
Region | G^H | H | C^ | N | G^N | Lower CI | Upper CI | wN | Pr(Ns<C^s) |
---|---|---|---|---|---|---|---|---|---|
se(G^H) | (t) | se(G^N) | |||||||
Uttar Pradesh | |||||||||
All | 0.2163 | 1262 | 572 | 653 | 0.2123 | 0.2010 | 0.2236 | 0.0226 | 0.2138 |
(0.0042) | (728) | (0.0058) | |||||||
Rural | 0.1997 | 903 | 496 | 510 | 0.2010 | 0.1893 | 0.2128 | 0.0234 | 0.1714 |
(0.0041) | (197) | (0.0060) | |||||||
Urban | 0.2229 | 359 | 821 | 359 | 0.2229 | 0.2048 | 0.2410 | 0.0362 | 1.0 |
(0.0092) | (717) | (0.0092) | |||||||
West Bengal | |||||||||
All | 0.2320 | 878 | 517 | 519 | 0.2318 | 0.2199 | 0.2437 | 0.0238 | 0.1538 |
(0.0051) | (186) | (0.0061) | |||||||
Rural | 0.1812 | 551 | 351 | 352 | 0.1815 | 0.1703 | 0.1927 | 0.0223 | 0.2353 |
(0.0048) | (163) | (0.0057) | |||||||
Urban | 0.2609 | 327 | 556 | 327 | 0.2609 | 0.2458 | 0.2761 | 0.0303 | 1.0 |
(0.0077) | (162) | (0.0077) |
Region | H | Q∗ | Q˜ | G^H | G^Q˜ | Lower CI | Upper CI | wQ˜ |
---|---|---|---|---|---|---|---|---|
(t) | (Q) | (se(G^H)) | (se(G^Q˜)) | |||||
Uttar Pradesh | ||||||||
All | 1262 | 1146 | 1171 | 0.2163 | 0.2137 | 0.2072 | 0.2202 | 0.0131 |
(321) | (1146) | (0.0042) | (0.0040) | |||||
Rural | 903 | 398 | 406 | 0.1997 | 0.2027 | 0.1940 | 0.2114 | 0.0174 |
(198) | (398) | (0.0041) | (0.0053) | |||||
Urban | 359 | 1177 | 359 | 0.2229 | 0.2229 | 0.2077 | 0.2381 | 0.0304 |
(180) | (359) | (0.0092) | (0.0092) | |||||
West Bengal | ||||||||
All | 878 | 624 | 626 | 0.2320 | 0.2307 | 0.2216 | 0.2398 | 0.0182 |
(190) | (624) | (0.0051) | (0.0055) | |||||
Rural | 551 | 422 | 420 | 0.1812 | 0.1785 | 0.1707 | 0.1862 | 0.0155 |
(173) | (422) | (0.0048) | (0.0047) | |||||
Urban | 327 | 857 | 327 | 0.2609 | 0.2609 | 0.2482 | 0.2736 | 0.0254 |
(185) | (327) | (0.0077) | (0.0077) |
Region | H | Q∗ | Q˜ | G^H | G^Q˜ | Lower CI | Upper CI | wQ˜ |
---|---|---|---|---|---|---|---|---|
(t) | (Q) | (se(G^H)) | (se(G^Q˜)) | |||||
Uttar Pradesh | ||||||||
All | 1262 | 1665 | 1262 | 0.2163 | 0.2163 | 0.2081 | 0.2245 | 0.0164 |
(333) | (1262) | (0.0042) | (0.0042) | |||||
Rural | 903 | 593 | 595 | 0.2000 | 0.2000 | 0.1914 | 0.2085 | 0.0171 |
(226) | (593) | (0.0041) | (0.0044) | |||||
Urban | 359 | 1712 | 359 | 0.2229 | 0.2229 | 0.2048 | 0.2410 | 0.0362 |
(254) | (359) | (0.0092) | (0.0092) | |||||
West Bengal | ||||||||
All | 878 | 874 | 878 | 0.2320 | 0.2320 | 0.2221 | 0.2419 | 0.0198 |
(223) | (874) | (0.0051) | (0.0051) | |||||
Rural | 551 | 535 | 534 | 0.1812 | 0.1814 | 0.1719 | 0.1910 | 0.0191 |
(203) | (535) | (0.0048) | (0.0049) | |||||
Urban | 327 | 1110 | 327 | 0.2609 | 0.2609 | 0.2458 | 0.2761 | 0.0303 |
(207) | (327) | (0.0077) | (0.0077) |
Region | H | Q∗ | Q˜ | G^H | G^Q˜ | Lower CI | Upper CI | wQ˜ |
---|---|---|---|---|---|---|---|---|
(t) | (Q) | (se(G^H)) | (se(G^Q˜)) | |||||
Uttar Pradesh | ||||||||
All | 1262 | 688 | 680 | 0.2163 | 0.2104 | 0.2023 | 0.2185 | 0.0162 |
(302) | (688) | (0.0042) | (0.0049) | |||||
Rural | 903 | 299 | 308 | 0.1997 | 0.2026 | 0.1927 | 0.2126 | 0.0199 |
(168) | (299) | (0.0041) | (0.0061) | |||||
Urban | 359 | 1087 | 359 | 0.2229 | 0.2229 | 0.2077 | 0.2381 | 0.0304 |
(168) | (359) | (0.0092) | (0.0092) | |||||
West Bengal | ||||||||
All | 878 | 396 | 396 | 0.2320 | 0.2293 | 0.2171 | 0.2414 | 0.0243 |
(158) | (396) | (0.0051) | (0.0074) | |||||
Rural | 551 | 275 | 275 | 0.1812 | 0.1750 | 0.1660 | 0.1840 | 0.0180 |
(138) | (275) | (0.0048) | (0.0055) | |||||
Urban | 327 | 582 | 327 | 0.2609 | 0.2609 | 0.2482 | 0.2736 | 0.0254 |
(142) | (327) | (0.0077) | (0.0077) |
Region | H | Q∗ | Q˜ | G^H | G^Q˜ | Lower CI | Upper CI | wQ˜ |
---|---|---|---|---|---|---|---|---|
(t) | (Q) | (se(G^H)) | (se(G^Q˜)) | |||||
Uttar Pradesh | ||||||||
All | 1262 | 976 | 947 | 0.2163 | 0.2124 | 0.2041 | 0.2207 | 0.0166 |
(302) | (946) | (0.0042) | (0.0042) | |||||
Rural | 903 | 364 | 353 | 0.1997 | 0.2032 | 0.1922 | 0.2142 | 0.0220 |
(197) | (364) | (0.0041) | (0.0056) | |||||
Urban | 359 | 1081 | 359 | 0.2229 | 0.2229 | 0.2048 | 0.2410 | 0.0362 |
(177) | (359) | (0.0092) | (0.0092) | |||||
West Bengal | ||||||||
All | 878 | 607 | 608 | 0.2320 | 0.2315 | 0.2204 | 0.2427 | 0.0224 |
(186) | (607) | (0.0051) | (0.0057) | |||||
Rural | 551 | 391 | 392 | 0.1812 | 0.1759 | 0.1670 | 0.1849 | 0.0178 |
(163) | (391) | (0.0048) | (0.0045) | |||||
Urban | 327 | 754 | 327 | 0.2609 | 0.2609 | 0.2458 | 0.2761 | 0.0303 |
(162) | (327) | (0.0077) | (0.0077) |
Supplementary Materials
The following are available online at https://www.mdpi.com/2225-1146/8/2/26/s1.
Author Contributions
Conceptualization, B.C.; methodology, F.B.D. and B.C.; software, F.B.D.; validation, F.B.D., F.K. and B.C.; formal analysis, F.B.D., F.K. and B.C.; investigation, F.B.D. and B.C.; writing-original draft preparation, F.B.D., F.K. and B.C.; writing-review and editing, F.B.D., F.K. and B.C.; visualization, F.B.D., F.K. and B.C.; supervision, B.C.; project administration, B.C.; funding acquisition, B.C. All authors have read and agreed to the published version of the manuscript.
Funding
The research work of Bhargab Chattopadhyay was part of the project sanctioned by Science and Engineering Research Board, Government of India (ECR/2017/001213).
Acknowledgments
This author is also grateful to the Ministry of Statistics and Program Implementation, Government Of India for permitting the use of the household data related to the consumer expenditure for the year 2007-2008 (Round 64, Schedule 1.0).
Conflicts of Interest
The authors declare no conflict of interest. The funding agency had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
1. Aguirregabiria, Victor, and Pedro Mira. 2007. Sequential estimation of dynamic discrete games. Econometrica 75: 1-53.
2. Arcidiacono, Peter, and John Bailey Jones. 2003. Finite mixture distributions, sequential likelihood and the em algorithm. Econometrica 71: 933-46.
3. Bhattacharya, Debopam. 2005. Asymptotic inference from multi-stage samples. Journal of Econometrics 126: 145-71.
4. Bhattacharya, Debopam. 2007. Inference on inequality from household survey data. Journal of Econometrics 137: 674-707.
5. Binder, David A., and Milorad S. Kovacevic. 1995. Estimating some measures of income inequality from survey data: An application of the estimating equations approach. Survey Methodology 21: 137-46.
6. Chattopadhyay, Bhargab, and Shyamal Krishna De. 2016. Estimation of Gini index within pre-specified error bound. Econometrics 4: 30.
7. Chattopadhyay, Bhargab, and Ken Kelley. 2017. Estimating the standardized mean difference with minimum risk: Maximizing accuracy and minimizing cost with sequential estimation. Psychological Methods 22: 94-113.
8. Chattopadhyay, Bhargab, and Nitis Mukhopadhyay. 2013. Two-stage fixed-width confidence intervals for a normal mean in the presence of suspect outliers. Sequential Analysis 32: 134-57.
9. Cochran, William G. 1997. Sampling Techniques, 3rd ed. Hoboken: John Wiley & Sons.
10. De, Shyamal K., and Bhargab Chattopadhyay. 2017. Minimum risk point estimation of Gini index. Sankhya B 79: 247-277.
11. Fuller, Wayne A. 2009. Sampling Statistics. Hoboken: Wiley.
12. Ghosh, Bhaskar Kumar, and Pranab Kumar Sen. 1991. Handbook of Sequential Analysis. New York: CRC Press, vol. 118.
13. Ghosh, Malay, Nitis Mukhopadhyay, and Pranab K. Sen. 1997. Sequential Estimation. New York: John Wiley & Sons, Inc.
14. Greene, William H. 1998. Gender economics courses in liberal arts colleges: Further results. The Journal of Economic Education 29: 291-300.
15. Gut, Allan. 2009. Stopped Random Walks. New York: Springer.
16. Hoque, Ahmed Anisul, and Judith Anne Clarke. 2015. On variance estimation for a Gini coefficient estimator obtained from complex survey data. Communications in Statistics: Case Studies, Data Analysis and Applications 1: 39-58.
17. Horvitz, D. G., and D. J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47: 663-85.
18. Kanninen, Barbara J. 1993. Design of sequential experiments for contingent valuation studies. Journal of Environmental Economics and Management 25: S1-S11.
19. Kelley, Ken, Francis Bilson Darku, and Bhargab Chattopadhyay. 2018. Accuracy in parameter estimation for a general class of effect sizes: A sequential approach. Psychological Methods 23: 226-43.
20. Kelley, Ken, Francis Bilson Darku, and Bhargab Chattopadhyay. 2019. Sequential accuracy in parameter estimation for population correlation coefficients. Psychological Methods 24: 492-515.
21. Langel, Matti, and Yves Tillé. 2013. Variance estimation of the Gini index: Revisiting a result several times published. Journal of the Royal Statistical Society: Series A (Statistics in Society) 176: 521-40.
22. Lee, Eun, and Ronald Forthofer. 2006. Analyzing Complex Survey Data. New York: SAGE Publications, Inc.
23. Mahalanobis, Prasanta Chandra. 1940. A sample survey of the acreage under jute in Bengal. Sankhyā: The Indian Journal of Statistics 4: 511-530.
24. Mukhopadhyay, Nitis, and Basil M. De Silva. 2009. Sequential Methods and Their Applications. Boca Raton: CRC Press.
25. National Sample Survey Office. 2007. Note on Estimation Procedure of NSS 64th Round. Available online: http://catalog.ihsn.org/index.php/catalog/1906/download/35538 (accessed on 21 July 2019).
26. National Sample Survey Office. 2015. India-Household Consumer Expenditure Survey: 64th Round, Schedule 1.0, July 2007-June 2008. Available online: http://www.icssrdataservice.in/datarepository/index.php/catalog/4/study-description (accessed on 21 July 2019).
27. Organization for Economic Cooperation and Development. 2017. Income Inequality. Available online: https://data.oecd.org/inequality/income-inequality.htm (accessed on 21 July 2019).
28. Peng, Liang. 2011. Emperical likelihood methods for the Gini index. Australian & New Zealand Journal of Statistics 53: 131-39.
29. RStudio Team. 2018. RStudio: Integrated Development Environment for R. Boston: RStudio, Inc.
30. Sen, Pranab Kumar. 1988. Functional jackknifing: Rationality and general asymptotics. The Annals of Statistics 16: 450-69.
31. Stein, Charles. 1945. A two-sample test for a linear hypothesis whose power is independent of the variance. The Annals of Mathematical Statistics 16: 243-58.
32. Stein, Ch. 1949. Some problems in sequential estimation. Econometrica 17: 77-78.
33. Wells, J. 1998. Applications: Oversampling through households or other clusters: Comparisons of methods for weighting the oversampled elements. Australian & New Zealand Journal of Statistics 40: 269-78.
34. Zitikis, Ričardas, and Joseph L. Gastwirth. 2002. The asymptotic distribution of the S-Gini index. Australian & New Zealand Journal of Statistics 44: 439-46.
Francis Bilson Darku1,†, Frank Konietschke2,3 and Bhargab Chattopadhyay4,*
1Mendoza College of Business, University of Notre Dame, Notre Dame, IN 46556, USA
2Institute of Biometry and Clinical Epidemiology, Charité—Universitätsmedizin Berlin, 10117 Berlin, Germany
3Berlin Institute of Health, Anna-Louisa-Karsch-Straße 2, 10178 Berlin, Germany
4Department of Decision Sciences and Information Systems, Indian Institute of Management Visakhapatnam, Visakhapatnam, Andhra Pradesh 530003, India
*Author to whom correspondence should be addressed.
†This work is part of the final dissertation of Francis Bilson Darku that was submitted to the Department of Mathematical Sciences at The University of Texas at Dallas.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The Gini index, a widely used economic inequality measure, is computed using data whose designs involve clustering and stratification, generally known as complex household surveys. Under complex household survey, we develop two novel procedures for estimating Gini index with a pre-specified error bound and confidence level. The two proposed approaches are based on the concept of sequential analysis which is known to be economical in the sense of obtaining an optimal cluster size which reduces project cost (that is total sampling cost) thereby achieving the pre-specified error bound and the confidence level under reasonable assumptions. Some large sample properties of the proposed procedures are examined without assuming any specific distribution. Empirical illustrations of both procedures are provided using the consumption expenditure data obtained by National Sample Survey (NSS) Organization in India.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer