1 Introduction
We consider a data set where each observation is the minimum or maximum of a sample of independent, identically distributed (iid) random variables drawn from a population with unknown parameters. That is, our data consists of or for j = 1, …, m. Here m represents the number of known sample extremes, while nj represents the sample size of the jth sample over which the minimum or maximum is computed. While the Xij are iid draws from a population with unknown parameters, the Xij values themselves are not directly observable. Rather, we consider the case where it is only possible to measure the minimum or maximum of each sample. We seek to use the minimum and maximum values in order to estimate the unknown population parameters.
In [1], estimators for the exponential and normal distributions were derived assuming the nj sample sizes are constant. Those two population distributions were considered since they arise frequently in applications and since they allow for explicit derivations. However, constant sample sizes are unrealistic to occur in applications. For example, [2, 3] examined the fertilization process in the flowering plant Arabidopsis thaliana. Their experiments begin by placing pollen on the stigma of the plant; they then wait a certain time period, kill the plant, and image it in order to view the fertilization progress. In the images, the researchers are able to view the pollen tubes, which grow down from the stigma towards the plant ovules. When a pollen tube reaches an ovule, it can fertilize it, which will then develop into a seed. The average fertilization speed is of biological interest and can be calculated by dividing the average pollen tube length by the amount of time lapsed since the pollen were placed on the stigma. However, due to the high density of pollen tubes, it is only feasible for the researchers to measure the longest pollen tube within each plant, and yet they wish to use these longest measurements to estimate the overall average length. Hence, their data fits our framework of estimating a population parameter with sample extremes. The nj sample sizes in this application represent the number of pollen tubes within the jth plant. It is not practical for researchers to place the same exact number of pollen grains on each plant, and hence the nj values will naturally vary. While [1] estimated the average pollen tube length by assuming a constant sample size for the number of pollen tubes within each plant, we seek to improve the estimate by deriving a new estimator for the population mean that allows for the samples sizes themselves to be random variables.
In Section 2, which focuses on the case where the population is exponentially distributed, we derive new estimators for the population mean under increasingly more realistic assumptions for the nj sample sizes. We also derive the variances of our new estimators, and consider cases where the nj sample sizes follow either a uniform or Poisson distribution. We compare the accuracy and precision of our new estimators to the original estimator computed in [1] under the constant sample size assumption. When the population is normally distributed, it is not feasible to analytically derive unbiased estimators for the population parameters using sample extremes with nonconstant sample sizes. Hence, in Section 3, we focus solely on analyzing the performance of the original estimators from [1] for the normal distribution when the constant sample size assumption is violated and the average sample size is instead used. In Section 4, we compare our methodology to that of maximum likelihood estimation. Lastly, in Section 5, we apply our results to the plant pollination example.
2 Estimator for exponential distribution
In this section we assume that where β is the unknown population mean. We set and for j = 1, …, m.
2.1 Sample sizes nj are all equal to a known value n
In [1], it was shown that in the case of constant sample size n, the estimator(1)is an unbiased estimator for β with(2)whereWhile the prior work did not consider the case of a sample of sample minima, it follows directly from properties of the exponential distribution [4] and the methods in [1] that the estimator(3)is also an unbiased estimator for β with(4)While there exists an unbiased estimator using either sample maxima or sample minima, we observe from Eqs (2) and (4) that the variance using sample maxima is strictly smaller with at rate proportional to as n → ∞, while is constant with respect to n. Hence, more precise estimates for the exponential distribution can be obtained from using maximum values compared to using minimum values.
2.2 Sample sizes nj are unequal, but are all known values
As mentioned in Section 1, it is unrealistic for the sample sizes to be equal in applications. Hence, we begin to improve the estimation framework by allowing the sample sizes to vary, but with the restriction that the sample sizes are known values. It follows directly from properties of the exponential distribution [4] and the methods in [1] that the estimator(5)with variance(6)and the estimator(7)with variance(8)are unbiased for estimating the population mean β. The estimators introduced in Eqs (5) and (7) reduce to the previous estimators given in Eqs (1) and (3) when all the nj are equal. Hence, these estimators are simply generalizations that allow them to be utilized in a broader range of applications. Yet, in some applications, not only do the sample sizes vary, but the sample sizes are also unknown. For example, in the plant pollination application described in Section 1, hundreds of pollen grains are placed on each plant and it is not possible for the researchers to count the exact numbers. Hence, the sample sizes are unknown, but the researchers do have an idea of the probability distribution of the sample sizes. In the following subsection, we further improve the estimation framework by considering the case where the sample sizes are themselves random variables.
2.3 Sample sizes nj are random with known distribution
We now suppose that the nj sample sizes are iid random variables. We assume that the nj values are unknown, but come from some known probability distribution, such as the uniform or Poisson distribution. It immediately follows from the proof of Theorem 1 in [1] thatBy properties of conditional expectation and conditional variance [4],andThus, the estimator(9)with(10)is unbiased for estimating β. Likewise,andHence, the estimator(11)with(12)is also unbiased for estimating β.
We observe from comparing Eqs (10) and (12) with Eqs (2) and (4), respectively, that the variance of the estimators is higher when the sample sizes are random variables compared to when the sample sizes are constant. This relationship is unsurprising since it is intuitive that higher variation in the sample sizes will result in higher variation in the estimation of the population mean.
In order to actually calculate the estimators and , the values of and must first be computed. These expected values can be computed exactly if the probability distribution of the nj is known. For example, if with known values for a and b, then ,andOr if the nj follow a Poisson distribution (shifted to start at 1), i.e. with known λ, then E[nj] = λ + 1,and
2.4 Sample sizes nj are random with only known mean
If the full probability distribution of the nj is not known, but the average sample size, E[nj], is known, it would be reasonable to approximate and with and , respectively. Using these approximations yields the following estimators:(13)and(14)
Since is a concave function of nj, by Jensen’s inequality [5]. Hence, is a biased estimator of β that tends to underestimate the true value of β. Conversely, since is a convex function of nj, , which makes a biased estimator that tends to overestimate the true value of β.
Note that and are equivalent to using the original estimators and and replacing the constant sample size n with E[nj]. In the next subsection, we compare the performance of the various estimators.
2.5 Comparison of estimators
The estimators , , and based upon a sample of sample maxima all reduce to the estimator when the sample sizes nj are all equal to a constant value n. Likewise, the estimators , , and based upon a sample of sample minima all reduce to the estimator in the case of known equal sample sizes. As mentioned previously, the estimators and allow for unequal sample sizes, but still require that the sample sizes are known values. On the other hand, the estimators , , , and allow for the sample sizes to be unknown iid random variables, which is the most realistic scenario to occur in applications.
Although , , , and all allow for random sample sizes, and require that the full probability distribution of the sample sizes is known, while and only require the average sample size to be known. Thus, the estimators and , which are equivalent to the original estimators and when the constant sample size is simply replaced with the average sample size, have the advantage of applying in the broadest range of circumstances. Yet, they have the disadvantage of being the only biased estimators presented here.
Fig 1 illustrates the ratio between and , which is equal to the ratio between and , as well as the ratio between and , which is equal to the ratio between and E[nj], in the case where the sample sizes follow a uniform distribution. We observe in Fig 1 that the ratio when using sample maxima () is always greater than one since tends to underestimate the true population mean. In contrast, the ratio when using sample minima () is always less than one since tends to overestimate the true population mean. In either case, the ratio between the two estimators is further from one when the width of the uniform distribution is larger since a wider interval results in more variability in the sample sizes. However, the ratio of the two estimators converges to one as the average sample size E[nj] increases. The convergence is faster when using sample maxima compared to when using sample minima. Similar patterns can be observed when the sample sizes follow other probability distributions, such as the Poisson distribution. Overall, Fig 1 illustrates that although and , which simply use the average sample size, are biased, they can still result in good estimates, especially when the average sample size is large.
[Figure omitted. See PDF.]
3 Estimators for normal distribution
We now consider the case where Normal with unknown mean μ and unknown variance σ2. We again set and for j = 1, …, m.
3.1 Sample sizes nj are all equal to a known value n
The estimators given in [1] when using a sample of maximum values and assuming a constant sample size n are(15)Here and denote the sample mean and sample variance, respectively, of the Yj’s, while kn denotes the mean and cn denotes the variance of the maximum of n iid Normal(0, 1) random variables. It was shown that is an unbiased estimator for σ2, but that with as m → ∞.
Although [1] did not consider the case of a sample of minimum values, we can define the analogous estimators(16)where now and denote the sample mean and sample variance, respectively, of the Wj’s. The constants kn and cn appear in both sets of estimators since the mean of the minimum of n iid Normal(0, 1) random variables is equal to the negative of the mean of the maximum of n iid Normal(0, 1) random variables, while the variance is the same in both cases. The values for kn and cn can be approximated either analytically or through simulations [6].
Due to the symmetry of the normal distribution, the estimators using minimum values will have equivalent performance to the estimators using maximum values. More specifically, it follows from symmetry that is also an unbiased estimator for σ2 and that with as m → ∞. This equivalent performance using either sample maxima or sample minima is in stark contrast to the case when the population is exponentially distributed. For an exponential population, it was shown in Section 2 that it is more advantageous to estimate the population parameters using a sample of maximum values due to the significantly smaller variance of the estimators in the maxima setting compared to the minima setting.
3.2 Sample sizes nj are random with known mean
Due to the complexity of the normal probability density function, it is not feasible to derive unbiased estimators for μ and σ2 in the case where the sample sizes are unequal known values or are random variables with a known distribution. However, if the nj values are iid random variables with known average sample size, E[nj], it is reasonable to replace the kn and cn values in Eqs (15) and (16) with and , respectively.
We evaluate the performance of the estimators when the constant sample size n is replaced with the average sample size E[nj] through simulations. Although the true population mean μ and true population variance σ2 are unknown, we set them equal to 0 and 1, respectively, for our simulations so that we can explicitly compare in our simulations how close and are to the true values. True values other than 0 or 1 would shift and rescale the distribution, but would not affect the relative performance of the estimators. Our simulations focus on and , which are based on sample maxima, but due to the symmetry of the normal distribution, and , which are based on sample minima, would have analogous performance.
Fig 2 compares the performance of both (top) and (bottom) when the distribution of the sample sizes nj is constant, uniform with width 10, uniform with width 50, uniform with width 100, or Poisson. In all cases, m = 10 sample maxima are used, where each maximum is computed over a sample with average size E[nj] = 100. We observe from Fig 2 that the accuracy and precision of the estimators is essentially identical regardless of whether the sample sizes are constant or randomly distributed. Increasing the variability of the sample sizes through a larger width of the uniform interval does not have a noticeable effect. While changing the values of m and E[nj] does change the variability of the simulated distributions, it does not alter the relative performance across the various distributions of nj. Thus, we conclude that when estimating the parameters of the Normal distribution using either sample maxima or sample minima, the constant sample size can be replaced with the average sample size with no significant change in the accuracy and precision of the estimation.
[Figure omitted. See PDF.]
The maxima are computed over samples with average size E[nj] = 100 for various distributions of the nj.
4 Comparison to maximum likelihood estimation
Suppose Xij are drawn iid from an arbitrary population with probability density function f(x|Θ) and cumulative distribution function F(x|Θ), where Θ is the vector of unknown population parameters. Let represent the sample minima and represent the sample maxima, for j = 1, …, m.
When the sample sizes nj are known, then the likelihood function is equal to(17)when using the sample maxima, and it is equal to(18)when using the sample minima.
When the sample sizes nj are not known values, but rather are random variables drawn iid from a population with probability mass function p(n), then the likelihood function is equal to(19)when using the sample maxima, and it is equal to(20)when using the sample minima.
Maximum likelihood estimation chooses to estimate Θ with the value that maximizes the likelihood function L(Θ). This value of Θ, called the maximum likelihood estimator or MLE, is often found by differentiating the logarithm of the likelihood function and setting the derivative equal to zero. However, in many cases the equation cannot be solved exactly, and numerical methods must be used in order to approximate the maximum likelihood estimator.
In [7], data from high-voltage power lines was collected to estimate the variance of the corona noise (often heard as a crackling or hissing sound in power lines). Due to limitations of the instruments, only the maximum value of the corona noise was recorded in each three second time lapse, but each maximum value was computed over a sample of size exactly 400. An explicit formula for the maximum likelihood estimator was derived in [7] for the framework of data consisting of sample maxima with equal sample sizes. However, their formula is dependent upon their specific assumption for the population distribution of the corona noise in power lines.
When the population distribution of the Xij is exponential, as considered in Section 2, then the maximum likelihood estimator can be computed explicitly when using sample minima with known sample sizes, and it is equivalent to our estimator given in Eq (3) when the sample sizes are all equal and is equivalent to our estimator given in Eq (7) when the sample sizes are unequal. When the population is exponentially distributed with known sample sizes, but the data consists of sample maxima, the maximum likelihood estimator cannot be solved for explicitly due to the complexity of the likelihood function. Likewise, when the population is normally distributed or when the sample sizes are unknown with some probability distribution, explicit formulas cannot be found for the maximum likelihood estimators. However, numerical methods can be used to approximate the maximum likelihood estimators in these cases.
Table 1 compares the estimates for the mean β of the exponential distribution when using our formula described in Eq (10) versus when approximating the maximum likelihood estimator using the built-in mle function in Matlab for a custom likelihood function of the form given in Eq (19). Likewise, Table 2 compares the estimates for the mean μ and variance σ2 of the normal distribution when using our formulas described in Eq (17) versus when using maximun likelihood estimation. In all cases, 100 simulations were run, each with m = 10 sample maxima and with the raw data drawn from either an exponential distribution with true β = 3 or a normal distribution with true μ = 0 and σ2 = 1. The nj values were simulated from five possible probability distributions, all with E[nj] set equal to 100. The tables display the mean and standard deviation of the parameter estimates across the 100 simulations for each case, as well as the mean square error (MSE). The mean square error is calculated as the bias (mean of the estimates minus the true parameter value) squared plus the variance of the estimates. Changing the values of β, μ, σ2, m, and E[nj] shifts and rescales the results, but does not change the relative performance of the estimates from our formulas compared to maximum likelihood estimation.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
The results displayed in Tables 1 and 2 indicate that the parameter estimates from our formulas tend to be centered closer to the true values on average, whereas maximum likelihood estimation consistently overestimates β and μ, while consistently underestimating σ2. When the population is exponentially distributed, our formulas produce similar variability in the parameter estimates and a substantially smaller mean square error compared to maximum likelihood estimation. When the population is normally distributed, maximum likelihood estimation produces smaller variability in the parameter estimates and a slightly smaller mean square error compared to our formulas. However, the fact that our formulas almost always result in less bias indicates that our approach is a valuable alternative to maximum likelihood estimation in both cases. Moreover, the key advantage of our methodology is that we have presented explicit formulas for estimators that can be computed quickly by simply plugging in sample data.
5 Biological application
We return now to the example described in Section 1 regarding estimating the mean pollen tube length based upon measurements of the longest pollen tube length in a sample of plants. In the laboratory experiments performed by Swanson et al., either m = 8 or m = 9 individual plants were used for each time point (3, 6, 9, or 24 hours) for each accession (Columbia or Landsberg) of Arabidopsis thaliana. In [1], the mean pollen tube length was estimated assuming a constant n = 933 number of pollen tubes within each plant of the Columbia accession and a constant n = 727 number of pollen tubes within each plant of the Landsberg accession. The estimation was performed assuming that the individual pollen tube lengths are exponentially distributed, which was shown to be a reasonable fit for the data.
The number of pollen tubes is not in fact constant across all the plants, with the 933 and 727 values actually representing the average number of pollen tubes within each plant for the Columbia and Landsberg accessions, respectively. Data from [2] indicates that it is reasonable to assume that the number of pollen tubes within each plant follows a uniform distribution, with range (640, 1226) for the Columbia accession and range (342, 1112) for the Landsberg accession. Hence, we apply our estimation framework derived in Section 2.3 to estimate the overall mean pollen tube length taking into account the probability distribution of the sample sizes. The resulting estimates, along with the original estimates assuming a constant sample size, are listed in Table 3.
[Figure omitted. See PDF.]
The original estimates assuming a constant sample size n are equivalent to the estimates that arise when the sample sizes are random and we simply use the average sample size E[nj] in place of n. However, it was shown in Section 2.4 that simply replacing the constant sample size with the mean sample size leads to a biased estimator that tends to underestimate the true population mean. Thus, although the two sets of estimates are fairly similar, the new estimates, which take into account the distribution of the sample sizes, are more likely to be closer to the true mean pollen tube lengths.
Acknowledgments
We would like to thank Alex Capaldi and Rob Swanson for providing the motivation for this research.
Citation: Kolba TN, Bruno A (2023) Estimation of population parameters using sample extremes from nonconstant sample sizes. PLoS ONE 18(1): e0280561. https://doi.org/10.1371/journal.pone.0280561
About the Authors:
Tiffany N. Kolba
Roles: Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing
E-mail: [email protected]
Affiliation: Department of Mathematics and Statistics, Valparaiso University, Valparaiso, IN, United States of America
ORICD: https://orcid.org/0000-0003-0828-3968
Alexander Bruno
Roles: Software, Visualization
Affiliation: Department of Mathematics and Statistics, Valparaiso University, Valparaiso, IN, United States of America
1. Capaldi A, Kolba TN. Using the sample maximum to estimate the parameters of the underlying distribution. PLoS ONE. 2019 Apr; 14(4):e0215529. pmid:31022209
2. Swanson RJ, Hammond AT, Carlson AL, Gong H, Donovan TK. Pollen performance traits reveal prezygotic nonrandom mating and interference competition in Arabidopsis thaliana. American Journal of Botany. 2016 Feb; 103(3):498–513. pmid:26928008
3. Beckford C, Ferita M, Fucarino J, Elzinga DC, Bassett K, Carlson AL, et al. Pollen interference emerges as a property of agent-based modeling of pollen competition in Arabidopsis thaliana. in silico Plants. 2022 Aug; 4(2):1–12.
4. Wackerly D, Mendenhall W, Scheaffer RL. Mathematical statistics with applications. Cengage Learning; 2014.
5. Mood AM, Graybill FA, Boes DC. Introduction to the theory of statistics. McGraw-Hill; 1974.
6. Cramér H. Mathematical methods of statistics. Princeton University Press; 1946.
7. Kosir A, Mujcic A, Suljanovic N, Tasic JF. Noise variance estimation based on measured maximums of sampled subsets. Mathematics and Computers in Simulation. 2004 Feb; 65:629–639.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2023 Kolba, Bruno. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
We examine the accuracy and precision of parameter estimates for both the exponential and normal distributions when using only a collection of sample extremes. That is, we consider a collection of random variables, where each of the random variables is either the minimum or maximum of a sample of nj independent, identically distributed random variables drawn from a normal or exponential distribution with unknown parameters. Previous work derived estimators for the population parameters assuming the nj sample sizes are constant. Since sample sizes are often not constant in applications, we derive new unbiased estimators that take into account the varying sample sizes. We also perform simulations to assess how the previously derived estimators perform when the constant sample size is simply replaced with the average sample size. We explore how varying the mean, standard deviation, and probability distribution of the sample sizes affects the estimation error. Overall, our results demonstrate that using the average sample size in place of the constant sample size still results in reliable estimates for the population parameters, especially when the average sample size is large. Our estimation framework is applied to a biological example involving plant pollination.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer