Content area
This study investigates the determination of stratification points for two study variables within the framework of simple random sampling, with a focus on estimating the population mean using a closely related auxiliary variable. Employing a superpopulation model, the research aims to minimize overall variance by deriving simplified equations that enhance the precision of parameter estimates. Instead of categorizing variables, the study emphasizes continuous variables to establish optimal strata boundaries (OSB), which are essential for creating homogeneous groups within each stratum. This stratification leads to more efficient sample sizes (SS) and improved accuracy in parameter estimation. However, achieving optimal OSB and SS poses challenges in scenarios with a fixed total sample size, such as survey designs constrained by limited budgets. To address this, the study proposes a robust methodology for calculating OSB and SS, leveraging knowledge of the survey’s per-unit stratum measurement costs or its probability density function. An empirical application of the method is demonstrated using breast cancer data, where the mean perimeter is estimated based on mean radius and mean texture. Additionally, hypothetical examples using Cauchy and standard power distributions are provided to illustrate the versatility of the proposed approach. The newly developed method has been integrated into the updated stratifyR package and implemented in LINGO software, facilitating its practical application. Comparative analysis reveals that this approach consistently outperforms or matches existing methods in enhancing the precision of population parameter estimation. Furthermore, simulation studies confirm its higher relative efficiency, making it a valuable contribution to the field of stratified sampling.
1. Introduction
The problem of optimal stratification was first explored in [1], laying the foundation for subsequent research. Building on this, [2] extended the investigation by considering variables as stratification factors under Neyman allocation, focusing on univariate cases as an extension of [1]. However, handling multiple features in estimation complicates achieving straightforward optimum allocation. Previous studies, such as [3,4], have examined proportional allocation techniques, particularly for cases involving two features. Numerous methods have been proposed across different contexts, as demonstrated by [4–9]. Addressing the multivariate stratification problem, [10] introduced an algorithm employing a penalized objective function optimized via the Simulated Annealing technique. Similarly, [11] proposed algorithms for stratifying asymmetric populations using power allocation to estimate sample sizes. In another notable approach, [12] utilized Dynamic Programming combined with Neyman allocation to address stratification issues, assuming a Weibull distribution for the stratification variable. Several R packages, such as GA4 Stratification (https://cran.r-project.org/src/contrib/Archive/GA4Stratification/), stratifyR ([13]), and sample (available on R CRAN), offer practical tools for stratification. To address inconsistencies between stratification and study variables, [14] introduced a method using two models. Over the decades, stratification methods have been broadly classified into approximation and optimization techniques [15]. Among notable developments, an exact resource allocation technique was presented in [16], employing algorithms such as BRKGA (Biased Random Key Genetic Algorithm) and GRASP (Greedy Randomized Adaptive Search Procedure) [17]. Additionally, [18] applied a dynamic programming approach to compute stratification points for two correlated variables. The study referenced stratification points derived through various methods, measuring accuracy using percentage relative efficiency based on variance estimates. Survey precision and implementation costs often require balancing, as decision-making is hindered by insufficient cost-related data. This study addresses these challenges by optimizing the variance function under basic cost constraints, formally linking survey precision with cost [19–21]. In cases of limited cost and variance information, approximate stratified designs based on cost considerations become necessary. Recent survey data, often the most representative, provides valuable insights for planning. Within total survey cost constraints, this technique minimizes population mean variance. Optimal stratification inherently involves allocating sample sizes effectively based on preconstructed strata [1,5]. The core principle lies in partitioning the population into Optimal Strata Boundaries (OSB) to minimize total stratum variance for a given sample size. The determination of OSB has been extensively studied, initially by Dalenius, who used the study variable as the primary stratification criterion [22–26]. Approximation methods, such as partitioning square root cumulative frequencies into equal intervals, have also been explored [27]. While the cumulative root frequency method remains popular, its arbitrary nature has been critiqued [28,29] presents a methodology for achieving approximately optimal stratification for sensitive quantitative variables in probability proportional to size (PPS) sampling. The proposed approach is demonstrated through theoretical formulation and empirical applications, improving the precision of estimates in stratified sampling designs. Alternatives, such as the Geometric method, were developed for skewed populations but lacked universal applicability [30–33]. In a recent development, [34] proposed a methodology for OSB and optimal sample size (OSS) determination, leveraging known per-unit measurement costs or the probability density function of the survey. They demonstrated this approach using Wave 18 of the HILDA Survey dataset to determine average annual disposable income in Australia, providing insights relevant to policy decisions. Further advancements in OSB determination include dynamic programming (DP) approaches, which divide populations into subsets to maximize precision [35,36]. Refinements by [37] explored precise OSB under Neyman allocation but excluded cost constraints, complicating practical implementation.
In this study, two study variables were employed as stratification variables for the target variable. Breast cancer patient data were analyzed to evaluate the precision of the proposed method.
2. Optimum strata boundaries and sample size allocation
Let L X M be strata of size , h = 1, 2,…,L, k = 1,2,…,M are formed from a population of size , which is , he population mean for the study variable involves calculating the sum of the products of the number of units in each stratum and subgroup, then dividing by the total population size as:
In case of SRSWOR having points, the desired average of any point in stratum can be calculated using the weighted mean as below:
The Unbiased Estimator of the Population Mean in the case of Stratified Sampling is
and its Variance is
(1)
in which denotes the weight of the stratum. While as
represents the Variance of the stratum and denotes the sample size to be taken from the stratum. In order to estimate the variance of the sample for the estimation of true variance, the stratum variance is represented by the sample variance.
There are several allocation methods for allocating the sample size for each stratum. Such as Equal allocation, Proportional allocation, Neyman allocation, and Optimum allocation. The best choice of allocation method will lead to the better precision and accuracy. So, it is challenging stage to choose the best way to allocate sample size. The allocation method will also depend on the total sample size, the variance and the cost associated with the stratum. The cost involved for each unit makes the allocation method difficult, if the cost allocated for each unit is same the Neyman allocation minimizes the variance (1) best, however if the cost varies for each and every unit, the optimum allocation may be utilized for allocation of sample size.
Let us consider the Cost function as
(2)
where represents the overhead cost, encompassing expenses related to administration, conducting interviews, training sessions, and other associated activities, and is the mean per unit cost of stratum as
where is cost associated to the unit in the stratum. It is to be noted here that the cost will differ from unit to unit. To estimate the unit cost the prior information may be utilized.
The average costs can be obtained as
where f(c) is the Probability Density Function of unit cost measurement.
In order to determine the optimum sample size to be taken from stratum, we optimize the defined in (1) subject to the total cost involved defined in (2).
Using Lagrangian multiplication technique, we can estimate the sample size
(3)
The Optimum Allocation method ensures that a larger sample size is selected within a stratum when there is greater heterogeneity and if the variance within the stratum is large.
While using (3) in (1), we have
(4)
Since we cannot minimize the overhead cost and total sample size, so in order to minimize (4) we can minimize (5), if FPC is ignored.
(5)
Hence, the total variance to be minimized is given in equation (5).
In Sample Surveys we can have various targeted variables and it is not necessary that the information available on the study variable is sufficient for stratification. However, a help taken from a related variables who are easily available may lead to better precision. Hence, let and are two auxiliary variables associated with the study variable . These two auxiliary variables will be utilized as stratification variables. and have the continuous distribution with the Probability Density Function and respectively. and . The total number of strata to be made , where and are the first and last values of and variables. Then the total range for and are
(6)
In nutshell we need to find the points between and so that variance defined in equation (5) is minimum.
Let’s consider the functional relationship between the study variable and the auxiliary variable be
where be the function of and which may be linear or non-linear and ‘e’ is the error defined as
Let the intermediate points for and be
and
and we have to obtain these points except as they will be known and fixed points we need to estimate subject to the minimization of equation (5).
The Probability Density Function of and should be be known for stratification variables, the estimated in equation (5) can be obtained by
(7)(8)
where
(9)
and , denotes the boundary points of stratum. Thus, it can be seen from equations (7)-(9) that all the terms can be expressed in terms of boundary points of stratum.
Thus equation (5) can be expressed as a function of boundary points and as
Such that the estimator of OSB can be calculated in finding the intermediate points and then the following problem
optimize
subject to the constraints
with and denotes the total width of stratum. Thus, the total deviation can be expressed in equation (6) can be denoted in strata width as:
and
We can define the strata point as:
The issue of estimating the Optimum Sample Size (OSB) can be approached as a quest for determining the Optimum Strata Width (OSW), represented by the sum of widths within strata and . This can be formulated as the following mathematical programming problem:
Optimize
Subject to the constraints
(10)
and ,
and .
With the given initial value , the first term of the objective function of (10) will be a function of and only. Once and is determined initially, then the next term ill be the function of and only and the same pattern will continue.
Due to the exceptional pattern of the MPP (10) we can solve it by using dynamic programming approach. Hence the MPP (10) can be written as the function of and only as follows
optimize
Subject to the constraints
(11)
and ,
and .
3. Solution procedure using dynamic programming
It can be observed that the problem defined in (11) is a problem of many stages in which the main objective function and the constants are sums of separable function of and . Considering the nature of the problem, dynamic programming emerges as a viable solution approach. Dynamic programming entails resolving optimization problems by decomposing them into more manageable subproblems and storing their solutions in a table for later retrieval. This method involves solving each subproblem just once and retaining its solution, thus eliminating the need for redundant computations. By efficiently combining the solutions to the overall problem to be found. This technique is widely applied in various fields such as algorithm design, artificial intelligence, economics and bioinformatics, providing an efficient approach to tackle complex problem into optimal solutions as given in (11).
Now let us assume a fraction of problem defined in (11) for
Optimize
Subject to the constraints
and .
where and the total deviation to be subdivided into strata. It is to be noted here that
and if and
and
which are the transformation function as:
Similarly, we have
Let denotes minimum value of the function (21), that is
and
with the above definition of , the MPP defined in equation (11) is equation to finding recursively by defining for and ,
and
For the fixed value of , .
Utilizing the same methodology, we can derive the forward recursive equation for the dynamic programming problem, ultimately leading us to determine the optimal strata boundaries.
Let’s consider the functional relationship between the study variable and auxiliary variable to be linear, as depicted below:
then
where
and denotes the weights and variances of and and are given by
(12)(13)(14)
where
4. Numerical illustration
To illustrate the computation details of the proper design, we can assume certain distributions of the auxiliary variables. The Cauchy distribution, named in honor of the mathematician Augustin-Louis Cauchy, is a probability distribution commonly encountered in statistics and physics. It finds significant applications in various domains, notably in fields like optics and signal processing. It is characterized by its heavy tails, meaning that it has a higher probability of extreme outcomes compared to other distributions with finite variance. The Cauchy distribution lacks a well-defined mean or variance due to its heavy tails, making it a challenge to work with in some statistical contexts. Despite its mathematical quirks, the Cauchy distribution finds applications in modelling phenomena with uncertain or heavy-tailed behaviour, such as financial markets and turbulent fluid dynamics. As for graphs, since the Cauchy distribution has heavy tails, plotting its probability density function often reveals a distribution that peaks at the centre and rapidly decays towards the tails, resembling a symmetric “bell” shape. Some of the examples of a probability density function (PDF) and a cumulative distribution function (CDF) graph of the Cauchy distribution which is presented in Fig 1 as below:
[Figure omitted. See PDF.]
These graphs illustrate the characteristic shape and behaviour of the Cauchy distribution, showcasing its heavy-tailed nature and symmetry around the mean.
Let follow a distribution named after Augustin Cauchy, which is continuous with a probability density function as provided below.
where represents the location parameter, indicating the peak location of the distribution, and denotes the scale parameter, representing the half-width at half maximum, the standard Cauchy Distribution emerges as a special case. When and , it’s referred to as the standard Cauchy Distribution. Its probability density function is expressed as follows:
The standard power distribution is a statistical model frequently utilized across diverse disciplines due to its versatility in capturing a wide array of phenomena. This distribution is characterized by two parameters: δ, which determines its shape, and θ, which sets its scale. The probability density function (PDF) of the standard power distribution is given by a formula that encapsulates the interplay between these parameters. Additionally, important statistical properties such as moments (mean, variance, etc.) are associated with this distribution, providing insights into its behavior and characteristics. The standard power distribution is applied in reliability engineering to model the lifetimes of components and systems. By analyzing failure rates and predicting reliability, engineers can optimize maintenance schedules and improve product performance. Survival analysis, prevalent in medical studies and epidemiology, investigates the time until an event (e.g., death) occurs. The standard power distribution proves useful in modeling survival times, aiding in understanding disease progression and treatment effectiveness. In queuing systems, this distribution can represent service times or interarrival times of customers. Queuing theory utilizes these models to optimize service levels, reduce waiting times, and enhance operational efficiency. The standard power distribution finds applications in modeling financial phenomena such as stock price movements and trading volumes. By understanding the distribution of duration between events, analysts gain insights into market dynamics and risk assessment. Environmental studies utilize the standard power distribution to analyze the timing of natural events like floods or species extinctions. By modeling these durations, researchers can assess environmental risks and develop strategies for mitigation and adaptation Fig 2 presents the distribution of standard power distribution of Cauchy distribution as below:
[Figure omitted. See PDF.]
let us assume that the other auxiliary variable follows standard power distribution as
using the equations (12) to (14), the values of and can be computed
(15)(16)(17)
where
and
Substitute the values obtained in equations (15) to (17) in the optimization, we get
Optimize
(18)
Subject to the constraints
and .
In order to the OSB we need to define the auxiliary variables in a finite interval such as defining in the interval , i.e., and the other variable with the same interval . Further, it is assumed that the value of and . Then the optimization problem defined in equation (18) can be written as
where
subject to the constraints
and .
Using the above optimization problem the solution was obtained using LINGO and the optimum strata boundaries are obtained presented in Tables 1 and 2 below for . The cost of the units has been assumed to be respectively .
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
5. An application on read data set of breast cancer
Breast cancer represents a significant health concern globally, particularly among women. It is characterized by the abnormal growth of cells in breast tissue, often leading to the formation of tumours. With diverse subtypes and varying degrees of aggressiveness, breast cancer poses complex challenges in diagnosis, treatment, and management. Early detection through screening methods such as mammography plays a crucial role in improving prognosis and survival rates. Additionally, advancements in treatment modalities, including surgery, chemotherapy, radiation therapy, and targeted therapies, have contributed to improved outcomes for many patients. Despite these advancements, breast cancer remains a leading cause of cancer-related mortality worldwide, highlighting the ongoing need for research, education, and advocacy efforts to enhance prevention, early detection, and treatment strategies. I am incorporating breast cancer data in my this as a pivotal component of the study. This is a data set from UC repository [38]. This is part of my self-learning journey, tried different ML algorithms over this data set. The data set has 31 predictor variables and one target variable. The target variable has 2 class, whether the tumor is cancerous and non-cancerous. The inclusion of this data allows for a comprehensive analysis of various factors related to breast cancer, including risk factors, diagnostic procedures, treatment outcomes, and survival rates. By leveraging breast cancer data, my study aims to contribute to a deeper understanding of the disease and its impact on affected individuals and communities. Furthermore, utilizing this dataset enables the exploration of novel approaches to early detection, personalized treatment strategies, and interventions aimed at improving patient outcomes and quality of life. Overall, the incorporation of breast cancer data serves as a cornerstone in advancing knowledge and informing evidence-based practices in the field of oncology and public health. In this dataset we are using mean perimeter as a study variable and mean radius and mean texture as the auxiliary variables. The presentation of the data are displayed in Figs 3–5.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
While fitting the model the estimated coefficients obtained are:
(19)
The Fig 6 demonstrates the behaviour of the model.
[Figure omitted. See PDF.]
Keeping under consideration the model obtained in equation (19) and substitute the equation (18), we obtained the strata boundaries along with the sample size of total 569 assuming the cost of units same for a total of 20 (5 × 4) strata presented in Table 3 along with total variance.
[Figure omitted. See PDF.]
Fig 7 demonstrates the variance corresponding to the number of strata and it can be observed the variance shows decreasing trend but up to the 25 number of strata only from which it shows increasing trend again. Hence it can be assessed from this graph that the number of strata should be taken under consideration while obtaining the Optimum strata boundaries.
[Figure omitted. See PDF.]
6. Simulation study
To validate the effectiveness of the proposed stratification method, a simulation study was conducted. The primary objective was to assess its precision and compare its performance against several established methods. This evaluation involved the application of dynamic programming and was carried out using the R statistical software. The comparative methods included in this study are as follows:
1. Dalenius and Hodges (1959) cumulative (cum) method [39]: A widely recognized method for stratification that divides cumulative frequencies into equal intervals.
2. Gunning and Horgan (2004) geometric method [40]: Known for its suitability in skewed populations by incorporating geometric intervals for stratification.
3. Lavallee-Hidiroglou (1988) method [41]: A stratification approach refined by Kozak (2004) [42]: to improve applicability to asymmetric data distributions.
4. Khan et al. (2018) mathematical programming approach [43]:: A contemporary optimization-based method for determining stratification boundaries.
5. Proposed method: A novel stratification method introduced in this study, leveraging dynamic programming to minimize total variance under cost constraints.
Simulation design and implementation
A synthetic dataset of 10,000 observations was generated for this study. The auxiliary variable used in the simulation followed a uniform distribution, a standard choice for stratification studies due to its simplicity and controlled variance properties. The parameters for the uniform distribution were specified as a = 0.002 and b = 1.92.
Key statistics derived from the generated data include:
* Minimum value (a): 0.00395
* Maximum value (b): 1.8962
* Total deviation (k): k = b − a = 1.89225
Using this dataset, stratification points were calculated for the proposed method and the comparative methods mentioned above. The primary evaluation criterion was total variance, a critical measure for determining stratification effectiveness.
Stratification points and variance
The stratification points identified by the proposed method were compared to those obtained using the other methods. Variances were calculated for each method under varying numbers of strata (L) to assess their relative performance. The results are summarized in Table 4 below:
[Figure omitted. See PDF.]
Results and analysis
From the Table 4 the Detailed Observations are:
1. Performance Across Strata Configurations:
2. The proposed method consistently achieved the lowest total variance for all configurations of LL, ranging from L = 2 to L = 6. For L = 2, the variance achieved by the proposed method (0.6212) was significantly lower than that of the cumulative method (0.9878) and the geometric method (1.0243), highlighting its efficiency even with fewer strata.
3. Comparison with Traditional Methods:
4. The cumulative method, while widely used, exhibited higher variances, particularly as L increased. The geometric method performed better than the cumulative method but still lagged behind the proposed approach.
5. Advanced Techniques
6. The Lavallee-Hidiroglou method, known for its robust application to asymmetric data, displayed variances higher than the proposed method, especially for L = 2 and L = 3. The Khan et al. (2018) mathematical programming approach, though more precise than traditional methods, was consistently outperformed by the proposed method.
7. Impact of Increased Strata:
8. Increasing the number of strata (L) generally reduced variance across all methods. However, the rate of improvement was most pronounced for the proposed method, demonstrating its adaptability to complex stratification scenarios.
Advantages of the Proposed Method:
* Efficiency: The proposed method achieves superior precision with lower variances across all tested scenarios.
* Scalability: Its performance remains consistent even as the number of strata increases, making it suitable for both simple and complex datasets.
* Cost-Effectiveness: By minimizing variance under cost constraints, the proposed method ensures practical applicability in survey design and resource allocation.
The simulation study highlights the effectiveness of the proposed method in achieving optimal stratification. By consistently outperforming traditional and advanced methods, the proposed approach demonstrates its potential for real-world applications requiring precise and cost-effective stratification. Its robust performance, coupled with the flexibility to handle varying strata configurations, makes it a valuable contribution to the field of stratified sampling and survey design.
7. Conclusion
This study explores a novel approach to optimal stratification for Cauchy and standard power distributions, focusing on determining stratification points and improving estimation precision. By leveraging closely related auxiliary variables, the proposed method enhances the accuracy of population estimates. It also facilitates Optimum Sample Allocation by considering the frequency distribution of auxiliary variables.Additionally, the study introduces a framework for determining Optimum Sample Size (OSB) and Sample Size (SS) within a stratified sampling design, accounting for budget constraints and varying per-unit measurement costs across strata. The stratification problem is formulated as a Mathematical Programming Problem, where stratum-wise costs are incorporated, and solutions are obtained using Dynamic Programming. The effectiveness of this approach is demonstrated through its application to breast cancer data, with similar promising results observed in simulations using Cauchy and standard power distributions. Comparative analysis shows that the proposed method performs as well as or better than conventional stratification techniques, making it a suitable choice for cost-sensitive survey planning. Unlike traditional methods, this research integrates financial constraints directly into OSB and SS estimation, ensuring practical applicability in real-world survey designs. Future research directions include developing advanced sampling strategies to further minimize data collection costs, optimizing sample sizes by balancing trade-offs between cost, mode, and statistical power, and exploring adaptive survey designs for cost efficiency. Additionally, Bayesian methods, leveraging prior information, could enhance the precision of sample allocation. The incorporation of neutrosophic statistics in complex data environments also presents a promising avenue, offering a flexible alternative to classical statistical approaches, offers a potential avenue for future research, building on existing studies in this area [44] and [45].
Acknowledgments
I sincerely thank the Editor and the anonymous reviewers for their valuable comments and constructive suggestions, which have significantly contributed to the improvement of this study.
References
1. 1. Dalenius T. The Problem of Optimum Stratification. Scandinavian Actuarial Journal. 1950;1950(3–4):203–13.
* View Article
* Google Scholar
2. 2. Sadasivan G, Aggarwal R. Optimum points of stratification in bivariate populations. Sankhya C. 1978;40:84–97.
* View Article
* Google Scholar
3. 3. Ghosh SP. Optimum stratification with two characters. Ann Math Statist. 1963;34:866–72.
* View Article
* Google Scholar
4. 4. Singh R. Approximately optimum stratification on the auxiliary variable. J Am Stat Assoc. 1971;66:829–33.
* View Article
* Google Scholar
5. 5. Dalenius T, Gurney M. The problem of optimum stratification. II. Scandinavian Actuarial Journal. 1951;1951(1–2):133–48.
* View Article
* Google Scholar
6. 6. Danish F, Rizvi SEH, Jeelani MI, Reashi JA. Obtaining Strata Boundaries under Proportional Allocation with Varying Cost of Every Unit. Pak.j.stat.oper.res. 2017;13(3):567.
* View Article
* Google Scholar
7. 7. Danish F, Rizvi SEH. Optimum stratification in bivariate auxiliary variables under Neyman allocation. J Mod Appl Stat Methods. 2018;17(1).
* View Article
* Google Scholar
8. 8. Danish F, Rizvi SEH. Approximately optimum strata boundaries for two concomitant stratification variables under proportional allocation. Statistics in Transition New Series. 2021;22(4):19–40.
* View Article
* Google Scholar
9. 9. Bhuwaneshwar K, Ahamed M. Optimum stratification for a generalized auxiliary variable proportional allocation under a superpopulation model. Commun Stat Theory Methods. 2020;1–16.
* View Article
* Google Scholar
10. 10. Kozak M, Verma M. Geometric versus optimization approach to stratification: a comparison of efficiency. Surv Methodol. 2006;32(2):157–63.
* View Article
* Google Scholar
11. 11. Kozak M. Comparison of random search method and genetic algorithm for stratification. Commun Stat Simul Comput. 2014;43(2):249–53.
* View Article
* Google Scholar
12. 12. Reddy KG, Khan MGM. Optimal stratification in stratified designs using weibull-distributed auxiliary information. Communications in Statistics - Theory and Methods. 2019;48(12):3136–52.
* View Article
* Google Scholar
13. 13. Reddy KG, Khan MGM. stratifyR: An R Package for optimal stratification and sample allocation for univariate populations. Aus NZ J of Statistics. 2020;62(3):383–405.
* View Article
* Google Scholar
14. 14. Rivest LP. A generalization of the Lavallée and Hidiroglou algorithm for stratification in business surveys. Surv Methodol. 2002;28(2):191–8.
* View Article
* Google Scholar
15. 15. Silva Semaan G, de Moura Brito JA, Machado Coelho I, Franco Silva E, Cesar Fadel A, Satoru Ochi L, et al. A Brief History of Heuristics: from Bounded Rationality to Intractability. IEEE Latin Am Trans. 2020;18(11):1975–86.
* View Article
* Google Scholar
16. 16. André Brito J, de Lima L, Henrique González P, Oliveira B, Maculan N. Heuristic approach applied to the optimum stratification problem. RAIRO-Oper Res. 2021;55(2):979–96.
* View Article
* Google Scholar
17. 17. Brito JA, Veiga T, Silva P. An optimization algorithm applied to the one-dimensional stratification problem. Surv Methodol. 2019;45(2):295–315.
* View Article
* Google Scholar
18. 18. Rizvi SEH, Danish F. Approximately optimum strata boundaries under super population model. IJMOR. 2022;1(1):1.
* View Article
* Google Scholar
19. 19. Cochran WG. Sampling techniques. 3rd ed. New York: John Wiley & Sons; 2007.
20. 20. Groves RM. Survey errors and survey costs. Hoboken: John Wiley & Sons; 2005.
21. 21. Kish L. Survey sampling. New York: John Wiley & Sons; 1965.
22. 22. Lavallée P. Two-way optimal stratification using dynamic programming. Proc Sect Surv Res Methods. 1988;646–651.
* View Article
* Google Scholar
23. 23. Mahalanobis P. Some aspects of the design of sample surveys. Sankhya, Indian J Stat. 1952;1–7.
* View Article
* Google Scholar
24. 24. Hansen M, Hurwitz W. On the theory of sampling from finite populations. Ann Math Stat. 1943;14(4):333–62.
* View Article
* Google Scholar
25. 25. Aoyama H. A study of the stratified random sampling. Ann Inst Stat Math. 1954;6(1):1–36.
* View Article
* Google Scholar
26. 26. Ekman G. Approximate expressions for the conditional mean and variance over small intervals of a continuous distribution. Ann Math Stat. 1959;30(4):1131–4.
* View Article
* Google Scholar
27. 27. Dalenius T, Hodges JJ. The choice of stratification points. Scand Actuar J. 1957;3–4:198–203.
* View Article
* Google Scholar
28. 28. Hedlin D. A procedure for stratification by an extended Ekman rule. J Off Stat. 2000;16(1):15.
* View Article
* Google Scholar
29. 29. Khan M, Reddy K, Rao D. Designing stratified sampling in economic and business surveys. J Appl Stat. 2015;42(10):2080–99.
* View Article
* Google Scholar
30. 30. Reddy KG, Khan MGM, Khan S. Optimum strata boundaries and sample sizes in health surveys using auxiliary variables. PLoS One. 2018;13(4):e0194787. pmid:29621265
* View Article
* PubMed/NCBI
* Google Scholar
31. 31. Reddy K, Khan M. Optimal stratification in stratified designs using Weibull-distributed auxiliary information. Commun Stat Theory Methods. 2018;1–20.
* View Article
* Google Scholar
32. 32. Reddy KG, Khan MGM. Constructing efficient strata boundaries in stratified sampling using survey cost. Heliyon. 2023;9(11):e21407. pmid:37964820
* View Article
* PubMed/NCBI
* Google Scholar
33. 33. Bühler W, Deutler T. Optimal stratification and grouping by dynamic programming. Metrika. 1975;22(1):161–75.
* View Article
* Google Scholar
34. 34. Khan M, Nand N, Ahmad N. Determining the optimum strata boundary points using dynamic programming. Surv Methodol. 2008;34(2):205–14.
* View Article
* Google Scholar
35. 35. Khan E, Khan M, Ahsan M. Optimum stratification: a mathematical programming approach. Calc Stat Assoc Bull. 2002;52:323–33.
* View Article
* Google Scholar
36. 36. Dua D, Graff C. UCI Machine Learning Repository [Internet]. Irvine, CA: University of California, School of Information and Computer Science; 2019. Available from: http://archive.ics.uci.edu/ml
37. 37. Martínez CR, German AH, Marvelio AM, Florentin S. Neutrosophy for survey analysis in social sciences. Neutrosophic Sets Syst. 2020;37(1).
* View Article
* Google Scholar
38. 38. Kozak M. Optimal stratification using mathematical programming. Surv Methodol. 2004;30(1):57–65.
* View Article
* Google Scholar
39. 39. Khan M, Ahmed S, Shabbir J. Mathematical programming approaches to stratified sampling design. Stat Oper Res Trans. 2018;42(2):139–64.
* View Article
* Google Scholar
40. 40. Valencia-Cruzaty LE, Reyes-Tomalá M, Castillo-Gallo CM, Smarandache F. A neutrosophic statistic method to predict tax time series in Ecuador. Neutrosophic Sets Syst. 2020;34:33–9.
* View Article
* Google Scholar
41. 41. Dalenius T, Hodges J. Minimum variance stratification. J Am Stat Assoc. 1959;54(285):88–101.
* View Article
* Google Scholar
42. 42. Gunning P, Horgan J. A new algorithm for the construction of strata boundaries in skewed populations. Surv Methodol. 2004;30(2):159–66.
* View Article
* Google Scholar
43. 43. Gunning P, Horgan J. A new algorithm for the construction of stratum boundaries in skewed populations. Surv Methodol. 2004;30(2):159–66.
* View Article
* Google Scholar
44. 44. Lavallée P, Hidiroglou M. On the stratification of skewed populations. Surv Methodol. 1988;14(1):33–43.
* View Article
* Google Scholar
45. 45. Verma MR, Singh Joorel JP, Agnihotri RK. Approximately optimum stratification for sensitive quantitative variables in pps sampling. Int J Agric Stat Sci. 2019;15(1).
* View Article
* Google Scholar
Citation: Danish F (2025) A design-based framework for optimal stratification using super-population models with application on real data set of breast cancer. PLoS One 20(5): e0323619. https://doi.org/10.1371/journal.pone.0323619
About the Authors:
Faizan Danish
Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing
E-mail: [email protected]
Affiliation: Department of Mathematics, School of Advanced Sciences, VIT-AP University, Inavolu, Beside AP Secretariat, Amaravati, Andhra Pradesh, India
ORICD: https://orcid.org/0000-0002-0476-9744
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
1. Dalenius T. The Problem of Optimum Stratification. Scandinavian Actuarial Journal. 1950;1950(3–4):203–13.
2. Sadasivan G, Aggarwal R. Optimum points of stratification in bivariate populations. Sankhya C. 1978;40:84–97.
3. Ghosh SP. Optimum stratification with two characters. Ann Math Statist. 1963;34:866–72.
4. Singh R. Approximately optimum stratification on the auxiliary variable. J Am Stat Assoc. 1971;66:829–33.
5. Dalenius T, Gurney M. The problem of optimum stratification. II. Scandinavian Actuarial Journal. 1951;1951(1–2):133–48.
6. Danish F, Rizvi SEH, Jeelani MI, Reashi JA. Obtaining Strata Boundaries under Proportional Allocation with Varying Cost of Every Unit. Pak.j.stat.oper.res. 2017;13(3):567.
7. Danish F, Rizvi SEH. Optimum stratification in bivariate auxiliary variables under Neyman allocation. J Mod Appl Stat Methods. 2018;17(1).
8. Danish F, Rizvi SEH. Approximately optimum strata boundaries for two concomitant stratification variables under proportional allocation. Statistics in Transition New Series. 2021;22(4):19–40.
9. Bhuwaneshwar K, Ahamed M. Optimum stratification for a generalized auxiliary variable proportional allocation under a superpopulation model. Commun Stat Theory Methods. 2020;1–16.
10. Kozak M, Verma M. Geometric versus optimization approach to stratification: a comparison of efficiency. Surv Methodol. 2006;32(2):157–63.
11. Kozak M. Comparison of random search method and genetic algorithm for stratification. Commun Stat Simul Comput. 2014;43(2):249–53.
12. Reddy KG, Khan MGM. Optimal stratification in stratified designs using weibull-distributed auxiliary information. Communications in Statistics - Theory and Methods. 2019;48(12):3136–52.
13. Reddy KG, Khan MGM. stratifyR: An R Package for optimal stratification and sample allocation for univariate populations. Aus NZ J of Statistics. 2020;62(3):383–405.
14. Rivest LP. A generalization of the Lavallée and Hidiroglou algorithm for stratification in business surveys. Surv Methodol. 2002;28(2):191–8.
15. Silva Semaan G, de Moura Brito JA, Machado Coelho I, Franco Silva E, Cesar Fadel A, Satoru Ochi L, et al. A Brief History of Heuristics: from Bounded Rationality to Intractability. IEEE Latin Am Trans. 2020;18(11):1975–86.
16. André Brito J, de Lima L, Henrique González P, Oliveira B, Maculan N. Heuristic approach applied to the optimum stratification problem. RAIRO-Oper Res. 2021;55(2):979–96.
17. Brito JA, Veiga T, Silva P. An optimization algorithm applied to the one-dimensional stratification problem. Surv Methodol. 2019;45(2):295–315.
18. Rizvi SEH, Danish F. Approximately optimum strata boundaries under super population model. IJMOR. 2022;1(1):1.
19. Cochran WG. Sampling techniques. 3rd ed. New York: John Wiley & Sons; 2007.
20. Groves RM. Survey errors and survey costs. Hoboken: John Wiley & Sons; 2005.
21. Kish L. Survey sampling. New York: John Wiley & Sons; 1965.
22. Lavallée P. Two-way optimal stratification using dynamic programming. Proc Sect Surv Res Methods. 1988;646–651.
23. Mahalanobis P. Some aspects of the design of sample surveys. Sankhya, Indian J Stat. 1952;1–7.
24. Hansen M, Hurwitz W. On the theory of sampling from finite populations. Ann Math Stat. 1943;14(4):333–62.
25. Aoyama H. A study of the stratified random sampling. Ann Inst Stat Math. 1954;6(1):1–36.
26. Ekman G. Approximate expressions for the conditional mean and variance over small intervals of a continuous distribution. Ann Math Stat. 1959;30(4):1131–4.
27. Dalenius T, Hodges JJ. The choice of stratification points. Scand Actuar J. 1957;3–4:198–203.
28. Hedlin D. A procedure for stratification by an extended Ekman rule. J Off Stat. 2000;16(1):15.
29. Khan M, Reddy K, Rao D. Designing stratified sampling in economic and business surveys. J Appl Stat. 2015;42(10):2080–99.
30. Reddy KG, Khan MGM, Khan S. Optimum strata boundaries and sample sizes in health surveys using auxiliary variables. PLoS One. 2018;13(4):e0194787. pmid:29621265
31. Reddy K, Khan M. Optimal stratification in stratified designs using Weibull-distributed auxiliary information. Commun Stat Theory Methods. 2018;1–20.
32. Reddy KG, Khan MGM. Constructing efficient strata boundaries in stratified sampling using survey cost. Heliyon. 2023;9(11):e21407. pmid:37964820
33. Bühler W, Deutler T. Optimal stratification and grouping by dynamic programming. Metrika. 1975;22(1):161–75.
34. Khan M, Nand N, Ahmad N. Determining the optimum strata boundary points using dynamic programming. Surv Methodol. 2008;34(2):205–14.
35. Khan E, Khan M, Ahsan M. Optimum stratification: a mathematical programming approach. Calc Stat Assoc Bull. 2002;52:323–33.
36. Dua D, Graff C. UCI Machine Learning Repository [Internet]. Irvine, CA: University of California, School of Information and Computer Science; 2019. Available from: http://archive.ics.uci.edu/ml
37. Martínez CR, German AH, Marvelio AM, Florentin S. Neutrosophy for survey analysis in social sciences. Neutrosophic Sets Syst. 2020;37(1).
38. Kozak M. Optimal stratification using mathematical programming. Surv Methodol. 2004;30(1):57–65.
39. Khan M, Ahmed S, Shabbir J. Mathematical programming approaches to stratified sampling design. Stat Oper Res Trans. 2018;42(2):139–64.
40. Valencia-Cruzaty LE, Reyes-Tomalá M, Castillo-Gallo CM, Smarandache F. A neutrosophic statistic method to predict tax time series in Ecuador. Neutrosophic Sets Syst. 2020;34:33–9.
41. Dalenius T, Hodges J. Minimum variance stratification. J Am Stat Assoc. 1959;54(285):88–101.
42. Gunning P, Horgan J. A new algorithm for the construction of strata boundaries in skewed populations. Surv Methodol. 2004;30(2):159–66.
43. Gunning P, Horgan J. A new algorithm for the construction of stratum boundaries in skewed populations. Surv Methodol. 2004;30(2):159–66.
44. Lavallée P, Hidiroglou M. On the stratification of skewed populations. Surv Methodol. 1988;14(1):33–43.
45. Verma MR, Singh Joorel JP, Agnihotri RK. Approximately optimum stratification for sensitive quantitative variables in pps sampling. Int J Agric Stat Sci. 2019;15(1).
© 2025 Faizan Danish. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.