Estimation of population parameters using sample

Full text

Turn on search term navigation

1 Introduction

We consider a data set where each observation is the minimum or maximum of a sample of independent, identically distributed (iid) random variables drawn from a population with unknown parameters. That is, our data consists of or for j = 1, …, m. Here m represents the number of known sample extremes, while n_j represents the sample size of the jth sample over which the minimum or maximum is computed. While the X_ij are iid draws from a population with unknown parameters, the X_ij values themselves are not directly observable. Rather, we consider the case where it is only possible to measure the minimum or maximum of each sample. We seek to use the minimum and maximum values in order to estimate the unknown population parameters.

In [1], estimators for the exponential and normal distributions were derived assuming the n_j sample sizes are constant. Those two population distributions were considered since they arise frequently in applications and since they allow for explicit derivations. However, constant sample sizes are unrealistic to occur in applications. For example, [2, 3] examined the fertilization process in the flowering plant Arabidopsis thaliana. Their experiments begin by placing pollen on the stigma of the plant; they then wait a certain time period, kill the plant, and image it in order to view the fertilization progress. In the images, the researchers are able to view the pollen tubes, which grow down from the stigma towards the plant ovules. When a pollen tube reaches an ovule, it can fertilize it, which will then develop into a seed. The average fertilization speed is of biological interest and can be calculated by dividing the average pollen tube length by the amount of time lapsed since the pollen were placed on the stigma. However, due to the high density of pollen tubes, it is only feasible for the researchers to measure the longest pollen tube within each plant, and yet they wish to use these longest measurements to estimate the overall average length. Hence, their data fits our framework of estimating a population parameter with sample extremes. The n_j sample sizes in this application represent the number of pollen tubes within the jth plant. It is not practical for researchers to place the same exact number of pollen grains on each plant, and hence the n_j values will naturally vary. While [1] estimated the average pollen tube length by assuming a constant sample size for the number of pollen tubes within each plant, we seek to improve the estimate by deriving a new estimator for the population mean that allows for the samples sizes themselves to be random variables.

In Section 2, which focuses on the case where the population is exponentially distributed, we derive new estimators for the population mean under increasingly more realistic assumptions for the n_j sample sizes. We also derive the variances of our new estimators, and consider cases where the n_j sample sizes follow either a uniform or Poisson distribution. We compare the accuracy and precision of our new estimators to the original estimator computed in [1] under the constant sample size assumption. When the population is normally distributed, it is not feasible to analytically derive unbiased estimators for the population parameters using sample extremes with nonconstant sample sizes. Hence, in Section 3, we focus solely on analyzing the performance of the original estimators from [1] for the normal distribution when the constant sample size assumption is violated and the average sample size is instead used. In Section 4, we compare our methodology to that of maximum likelihood estimation. Lastly, in Section 5, we apply our results to the plant pollination example.

2 Estimator for exponential distribution

In this section we assume that where β is the unknown population mean. We set and for j = 1, …, m.

2.1 Sample sizes n_j are all equal to a known value n

In [1], it was shown that in the case of constant sample size n, the estimator(1)is an unbiased estimator for β with(2)whereWhile the prior work did not consider the case of a sample of sample minima, it follows directly from properties of the exponential distribution [4] and the methods in [1] that the estimator(3)is also an unbiased estimator for β with(4)While there exists an unbiased estimator using either sample maxima or sample minima, we observe from Eqs (2) and (4) that the variance using sample maxima is strictly smaller with at rate proportional to as n → ∞, while is constant with respect to n. Hence, more precise estimates for the exponential distribution can be obtained from using maximum values compared to using minimum values.

2.2 Sample sizes n_j are unequal, but are all known values

As mentioned in Section 1, it is unrealistic for the sample sizes to be equal in applications. Hence, we begin to improve the estimation framework by allowing the sample sizes to vary, but with the restriction that the sample sizes are known values. It follows directly from properties of the exponential distribution [4] and the methods in [1] that the estimator(5)with variance(6)and the estimator(7)with variance(8)are unbiased for estimating the population mean β. The estimators introduced in Eqs (5) and (7) reduce to the previous estimators given in Eqs (1) and (3) when all the n_j are equal. Hence, these estimators are simply generalizations that allow them to be utilized in a broader range of applications. Yet, in some applications, not only do the sample sizes vary, but the sample sizes are also unknown. For example, in the plant pollination application described in Section 1, hundreds of pollen grains are placed on each plant and it is not possible for the researchers to count the exact numbers. Hence, the sample sizes are unknown, but the researchers do have an idea of the probability distribution of the sample sizes. In the following subsection, we further improve the estimation framework by considering the case where the sample sizes are themselves random variables.

2.3 Sample sizes n_j are random with known distribution

We now suppose that the n_j sample sizes are iid random variables. We assume that the n_j values are unknown, but come from some known probability distribution, such as the uniform or Poisson distribution. It immediately follows from the proof of Theorem 1 in [1] thatBy properties of conditional expectation and conditional variance [4],andThus, the estimator(9)with(10)is unbiased for estimating β. Likewise,andHence, the estimator(11)with(12)is also unbiased for estimating β.

We observe from comparing Eqs (10) and (12) with Eqs (2) and (4), respectively, that the variance of the estimators is higher when the sample sizes are random variables compared to when the sample sizes are constant. This relationship is unsurprising since it is intuitive that higher variation in the sample sizes will result in higher variation in the estimation of the population mean.

In order to actually calculate the estimators and , the values of and must first be computed. These expected values can be computed exactly if the probability distribution of the n_j is known. For example, if with known values for a and b, then ,andOr if the n_j follow a Poisson distribution (shifted to start at 1), i.e. with known λ, then E[n_j] = λ + 1,and

2.4 Sample sizes n_j are random with only known mean

If the full probability distribution of the n_j is not known, but the average sample size, E[n_j], is known, it would be reasonable to approximate and with and , respectively. Using these approximations yields the following estimators:(13)and(14)

Since is a concave function of n_j, by Jensen’s inequality [5]. Hence, is a biased estimator of β that tends to underestimate the true value of β. Conversely, since is a convex function of n_j, , which makes a biased estimator that tends to overestimate the true value of β.

Note that and are equivalent to using the original estimators and and replacing the constant sample size n with E[n_j]. In the next subsection, we compare the performance of the various estimators.

2.5 Comparison of estimators

The estimators , , and based upon a sample of sample maxima all reduce to the estimator when the sample sizes n_j are all equal to a constant value n. Likewise, the estimators , , and based upon a sample of sample minima all reduce to the estimator in the case of known equal sample sizes. As mentioned previously, the estimators and allow for unequal sample sizes, but still require that the sample sizes are known values. On the other hand, the estimators , , , and allow for the sample sizes to be unknown iid random variables, which is the most realistic scenario to occur in applications.

Although , , , and all allow for random sample sizes, and require that the full probability distribution of the sample sizes is known, while and only require the average sample size to be known. Thus, the estimators and , which are equivalent to the original estimators and when the constant sample size is simply replaced with the average sample size, have the advantage of applying in the broadest range of circumstances. Yet, they have the disadvantage of being the only biased estimators presented here.

Fig 1 illustrates the ratio between and , which is equal to the ratio between and , as well as the ratio between and , which is equal to the ratio between and E[n_j], in the case where the sample sizes follow a uniform distribution. We observe in Fig 1 that the ratio when using sample maxima () is always greater than one since tends to underestimate the true population mean. In contrast, the ratio when using sample minima () is always less than one since tends to overestimate the true population mean. In either case, the ratio between the two estimators is further from one when the width of the uniform distribution is larger since a wider interval results in more variability in the sample sizes. However, the ratio of the two estimators converges to one as the average sample size E[n_j] increases. The convergence is faster when using sample maxima compared to when using sample minima. Similar patterns can be observed when the sample sizes follow other probability distributions, such as the Poisson distribution. Overall, Fig 1 illustrates that although and , which simply use the average sample size, are biased, they can still result in good estimates, especially when the average sample size is large.

[Figure omitted. See PDF.]

3 Estimators for normal distribution

We now consider the case where Normal with unknown mean μ and unknown variance σ². We again set and for j = 1, …, m.

3.1 Sample sizes n_j are all equal to a known value n

The estimators given in [1] when using a sample of maximum values and assuming a constant sample size n are(15)Here and denote the sample mean and sample variance, respectively, of the Y_j’s, while k_n denotes the mean and c_n denotes the variance of the maximum of n iid Normal(0, 1) random variables. It was shown that is an unbiased estimator for σ², but that with as m → ∞.

Although [1] did not consider the case of a sample of minimum values, we can define the analogous estimators(16)where now and denote the sample mean and sample variance, respectively, of the W_j’s. The constants k_n and c_n appear in both sets of estimators since the mean of the minimum of n iid Normal(0, 1) random variables is equal to the negative of the mean of the maximum of n iid Normal(0, 1) random variables, while the variance is the same in both cases. The values for k_n and c_n can be approximated either analytically or through simulations [6].

Due to the symmetry of the normal distribution, the estimators using minimum values will have equivalent performance to the estimators using maximum values. More specifically, it follows from symmetry that is also an unbiased estimator for σ² and that with as m → ∞. This equivalent performance using either sample maxima or sample minima is in stark contrast to the case when the population is exponentially distributed. For an exponential population, it was shown in Section 2 that it is more advantageous to estimate the population parameters using a sample of maximum values due to the significantly smaller variance of the estimators in the maxima setting compared to the minima setting.

3.2 Sample sizes n_j are random with known mean

Due to the complexity of the normal probability density function, it is not feasible to derive unbiased estimators for μ and σ² in the case where the sample sizes are unequal known values or are random variables with a known distribution. However, if the n_j values are iid random variables with known average sample size, E[n_j], it is reasonable to replace the k_n and c_n values in Eqs (15) and (16) with and , respectively.

We evaluate the performance of the estimators when the constant sample size n is replaced with the average sample size E[n_j] through simulations. Although the true population mean μ and true population variance σ² are unknown, we set them equal to 0 and 1, respectively, for our simulations so that we can explicitly compare in our simulations how close and are to the true values. True values other than 0 or 1 would shift and rescale the distribution, but would not affect the relative performance of the estimators. Our simulations focus on and , which are based on sample maxima, but due to the symmetry of the normal distribution, and , which are based on sample minima, would have analogous performance.

Fig 2 compares the performance of both (top) and (bottom) when the distribution of the sample sizes n_j is constant, uniform with width 10, uniform with width 50, uniform with width 100, or Poisson. In all cases, m = 10 sample maxima are used, where each maximum is computed over a sample with average size E[n_j] = 100. We observe from Fig 2 that the accuracy and precision of the estimators is essentially identical regardless of whether the sample sizes are constant or randomly distributed. Increasing the variability of the sample sizes through a larger width of the uniform interval does not have a noticeable effect. While changing the values of m and E[n_j] does change the variability of the simulated distributions, it does not alter the relative performance across the various distributions of n_j. Thus, we conclude that when estimating the parameters of the Normal distribution using either sample maxima or sample minima, the constant sample size can be replaced with the average sample size with no significant change in the accuracy and precision of the estimation.

[Figure omitted. See PDF.]

The maxima are computed over samples with average size E[n_j] = 100 for various distributions of the n_j.

4 Comparison to maximum likelihood estimation

Suppose X_ij are drawn iid from an arbitrary population with probability density function f(x|Θ) and cumulative distribution function F(x|Θ), where Θ is the vector of unknown population parameters. Let represent the sample minima and represent the sample maxima, for j = 1, …, m.

When the sample sizes n_j are known, then the likelihood function is equal to(17)when using the sample maxima, and it is equal to(18)when using the sample minima.

When the sample sizes n_j are not known values, but rather are random variables drawn iid from a population with probability mass function p(n), then the likelihood function is equal to(19)when using the sample maxima, and it is equal to(20)when using the sample minima.

Maximum likelihood estimation chooses to estimate Θ with the value that maximizes the likelihood function L(Θ). This value of Θ, called the maximum likelihood estimator or MLE, is often found by differentiating the logarithm of the likelihood function and setting the derivative equal to zero. However, in many cases the equation cannot be solved exactly, and numerical methods must be used in order to approximate the maximum likelihood estimator.

In [7], data from high-voltage power lines was collected to estimate the variance of the corona noise (often heard as a crackling or hissing sound in power lines). Due to limitations of the instruments, only the maximum value of the corona noise was recorded in each three second time lapse, but each maximum value was computed over a sample of size exactly 400. An explicit formula for the maximum likelihood estimator was derived in [7] for the framework of data consisting of sample maxima with equal sample sizes. However, their formula is dependent upon their specific assumption for the population distribution of the corona noise in power lines.

When the population distribution of the X_ij is exponential, as considered in Section 2, then the maximum likelihood estimator can be computed explicitly when using sample minima with known sample sizes, and it is equivalent to our estimator given in Eq (3) when the sample sizes are all equal and is equivalent to our estimator given in Eq (7) when the sample sizes are unequal. When the population is exponentially distributed with known sample sizes, but the data consists of sample maxima, the maximum likelihood estimator cannot be solved for explicitly due to the complexity of the likelihood function. Likewise, when the population is normally distributed or when the sample sizes are unknown with some probability distribution, explicit formulas cannot be found for the maximum likelihood estimators. However, numerical methods can be used to approximate the maximum likelihood estimators in these cases.

Table 1 compares the estimates for the mean β of the exponential distribution when using our formula described in Eq (10) versus when approximating the maximum likelihood estimator using the built-in mle function in Matlab for a custom likelihood function of the form given in Eq (19). Likewise, Table 2 compares the estimates for the mean μ and variance σ² of the normal distribution when using our formulas described in Eq (17) versus when using maximun likelihood estimation. In all cases, 100 simulations were run, each with m = 10 sample maxima and with the raw data drawn from either an exponential distribution with true β = 3 or a normal distribution with true μ = 0 and σ² = 1. The n_j values were simulated from five possible probability distributions, all with E[n_j] set equal to 100. The tables display the mean and standard deviation of the parameter estimates across the 100 simulations for each case, as well as the mean square error (MSE). The mean square error is calculated as the bias (mean of the estimates minus the true parameter value) squared plus the variance of the estimates. Changing the values of β, μ, σ², m, and E[n_j] shifts and rescales the results, but does not change the relative performance of the estimates from our formulas compared to maximum likelihood estimation.

[Figure omitted. See PDF.]

The results displayed in Tables 1 and 2 indicate that the parameter estimates from our formulas tend to be centered closer to the true values on average, whereas maximum likelihood estimation consistently overestimates β and μ, while consistently underestimating σ². When the population is exponentially distributed, our formulas produce similar variability in the parameter estimates and a substantially smaller mean square error compared to maximum likelihood estimation. When the population is normally distributed, maximum likelihood estimation produces smaller variability in the parameter estimates and a slightly smaller mean square error compared to our formulas. However, the fact that our formulas almost always result in less bias indicates that our approach is a valuable alternative to maximum likelihood estimation in both cases. Moreover, the key advantage of our methodology is that we have presented explicit formulas for estimators that can be computed quickly by simply plugging in sample data.

5 Biological application

We return now to the example described in Section 1 regarding estimating the mean pollen tube length based upon measurements of the longest pollen tube length in a sample of plants. In the laboratory experiments performed by Swanson et al., either m = 8 or m = 9 individual plants were used for each time point (3, 6, 9, or 24 hours) for each accession (Columbia or Landsberg) of Arabidopsis thaliana. In [1], the mean pollen tube length was estimated assuming a constant n = 933 number of pollen tubes within each plant of the Columbia accession and a constant n = 727 number of pollen tubes within each plant of the Landsberg accession. The estimation was performed assuming that the individual pollen tube lengths are exponentially distributed, which was shown to be a reasonable fit for the data.

The number of pollen tubes is not in fact constant across all the plants, with the 933 and 727 values actually representing the average number of pollen tubes within each plant for the Columbia and Landsberg accessions, respectively. Data from [2] indicates that it is reasonable to assume that the number of pollen tubes within each plant follows a uniform distribution, with range (640, 1226) for the Columbia accession and range (342, 1112) for the Landsberg accession. Hence, we apply our estimation framework derived in Section 2.3 to estimate the overall mean pollen tube length taking into account the probability distribution of the sample sizes. The resulting estimates, along with the original estimates assuming a constant sample size, are listed in Table 3.

[Figure omitted. See PDF.]

The original estimates assuming a constant sample size n are equivalent to the estimates that arise when the sample sizes are random and we simply use the average sample size E[n_j] in place of n. However, it was shown in Section 2.4 that simply replacing the constant sample size with the mean sample size leads to a biased estimator that tends to underestimate the true population mean. Thus, although the two sets of estimates are fairly similar, the new estimates, which take into account the distribution of the sample sizes, are more likely to be closer to the true mean pollen tube lengths.

Acknowledgments

We would like to thank Alex Capaldi and Rob Swanson for providing the motivation for this research.

Citation: Kolba TN, Bruno A (2023) Estimation of population parameters using sample extremes from nonconstant sample sizes. PLoS ONE 18(1): e0280561. https://doi.org/10.1371/journal.pone.0280561

About the Authors:

Tiffany N. Kolba

Roles: Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

E-mail: [email protected]

Affiliation: Department of Mathematics and Statistics, Valparaiso University, Valparaiso, IN, United States of America

ORICD: https://orcid.org/0000-0003-0828-3968

Alexander Bruno

Roles: Software, Visualization

Affiliation: Department of Mathematics and Statistics, Valparaiso University, Valparaiso, IN, United States of America

References

1. Capaldi A, Kolba TN. Using the sample maximum to estimate the parameters of the underlying distribution. PLoS ONE. 2019 Apr; 14(4):e0215529. pmid:31022209

2. Swanson RJ, Hammond AT, Carlson AL, Gong H, Donovan TK. Pollen performance traits reveal prezygotic nonrandom mating and interference competition in Arabidopsis thaliana. American Journal of Botany. 2016 Feb; 103(3):498–513. pmid:26928008

3. Beckford C, Ferita M, Fucarino J, Elzinga DC, Bassett K, Carlson AL, et al. Pollen interference emerges as a property of agent-based modeling of pollen competition in Arabidopsis thaliana. in silico Plants. 2022 Aug; 4(2):1–12.

4. Wackerly D, Mendenhall W, Scheaffer RL. Mathematical statistics with applications. Cengage Learning; 2014.

5. Mood AM, Graybill FA, Boes DC. Introduction to the theory of statistics. McGraw-Hill; 1974.

6. Cramér H. Mathematical methods of statistics. Princeton University Press; 1946.

7. Kosir A, Mujcic A, Suljanovic N, Tasic JF. Noise variance estimation based on measured maximums of sampled subsets. Mathematics and Computers in Simulation. 2004 Feb; 65:629–639.

Word count: 3909

Show less

© 2023 Kolba, Bruno. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

We examine the accuracy and precision of parameter estimates for both the exponential and normal distributions when using only a collection of sample extremes. That is, we consider a collection of random variables, where each of the random variables is either the minimum or maximum of a sample of n_j independent, identically distributed random variables drawn from a normal or exponential distribution with unknown parameters. Previous work derived estimators for the population parameters assuming the n_j sample sizes are constant. Since sample sizes are often not constant in applications, we derive new unbiased estimators that take into account the varying sample sizes. We also perform simulations to assess how the previously derived estimators perform when the constant sample size is simply replaced with the average sample size. We explore how varying the mean, standard deviation, and probability distribution of the sample sizes affects the estimation error. Overall, our results demonstrate that using the average sample size in place of the constant sample size still results in reliable estimates for the population parameters, especially when the average sample size is large. Our estimation framework is applied to a biological example involving plant pollination.

Details

Title

Estimation of population parameters using sample extremes from nonconstant sample sizes

Author

Kolba, Tiffany N

; Alexander, Bruno

First page

e0280561

Section

Research Article

Publication year

2023

Publication date

Jan 2023

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0280561

ProQuest document ID

2767424672

Estimation of population parameters using sample extremes from nonconstant sample sizes

Jump to:

Full text

Abstract

Details

Suggested sources