Improving the estimation of the Boyce index using statistical smoothing methods for evaluating species distribution models with presence‐only data

Abstract

Species distribution models (SDMs) underpin a wide range of decisions concerning biodiversity. Although SDMs can be built using presence‐only data, rigorous evaluation of these models remains challenging. One evaluation method is the Boyce index (BI), which uses the relative frequencies between presence sites and background sites within a series of bins or moving windows spanning the entire range of predicted values from the SDM. Obtaining accurate estimates of the BI using these methods relies upon having a large number of presences, which is often not feasible, particularly for rare or restricted species that are often the focus of modelling. Wider application of the BI requires a method that can accurately and reliably estimate the BI using small numbers of presence records. In this study, we investigated the effectiveness of five statistical smoothing methods (i.e. thin plate regression splines, cubic regression splines, B‐splines, P‐splines and adaptive smoothers) and the mean of these five methods (denoted as ‘mean') to estimate the BI. We simulated 600 species with varying prevalence and built distribution models using random forest and Maxent methods. For training data, we used two levels for the number of presences (NP_train: 20 and 500), along with 2 × NP_train and 10000 random points (i.e. random background sites) for each modelling method. We used the number of presences at four levels (NP_bi: 1000, 200, 50 and 10) to investigate its effect, together with 5000 random points to calculate the BI. Our results indicate that the BI estimates from the binning and moving window methods are severely affected by the decrease of NP_bi, but all the estimates of the BI from smoothing‐based methods were almost always unbiased for realistic situations. Hence, we recommend these methods for estimating the BI for evaluating SDMs when verified absence data are unavailable.

Full text

Turn on search term navigation

Introduction

As primary biodiversity data have become increasingly accessible from various databases, species distribution models (SDMs) have been more widely developed and utilized in conservation management, as well as for ecological and evolutionary studies (Lissovsky et al. 2021). Necessarily, the successful development of distribution models relies on the quality of the data (Fei and Yu 2016, Aubry et al. 2017). While SDMs benefit from having both presence and absence data, the latter are rarely available. Consequently, models are often developed using presence data in conjunction with background data rather than true absence data.

While various methods have been employed to build models using presence-only data (Elith et al. 2006, Barbet-Massin et al. 2012, Qiao et al. 2015, Liu et al. 2019), effective techniques for evaluating such models with presence-only data remain limited (Hirzel et al. 2006, Phillips et al. 2006, Peterson et al. 2008, Liu et al. 2013b, Jiménez and Soberón 2020). Boyce et al. (2002) introduced a method to evaluate models with presence-only data using ratios of the relative frequencies of presences to background sites (i.e. predicted/expected ratios or P/E ratios) across a series of bins (i.e. intervals of predicted values from SDMs). The Spearman correlation coefficient is then calculated between these ratios and the midpoints of the bins. Hirzel et al. (2006) subsequently named this metric the Boyce index (BI). Their rationale is that a good model will have a higher P/E ratio associated with higher levels of habitat suitability. Recognizing that this relationship may not be linear, they used the Spearman correlation coefficient between the P/E ratios and the mid-values of the bins as the index. The BI ranges from a maximum value of 1 for an ideal model with the highest discrimination to a minimum value of −1 for a model completely opposite to the ideal model, with 0 indicating a random model. For clarity, the BI estimate derived from this original binning method will be referred to as the original BI (OBI).

Furthermore, Hirzel et al. (2006) identified that the OBI is influenced by the number (i.e. effectively the width) and the starting location of bins. They subsequently developed a modified method that utilizes moving windows or overlapping bins, resulting in the continuous Boyce index (i.e. CBI). However, their findings revealed that the CBI is still influenced by the width of the moving window, as evidenced by the decrease in the Pearson correlation coefficients between the CBI(0.1) (using window width of 0.1) and the CBI(w) (with window width of w = 0.2, 0.25 and 0.5). This presents a limitation of the CBI.

Both the OBI and the CBI rely on the frequencies of both presences and background sites within a series of bins or windows. Accurately estimating the ratios of the two series of frequencies requires a large number of data points (Phillips and Elith 2010). While large numbers of background points can be easily generated, presence records are generally limited, especially for rare species that are often the focus of modelling studies. Using a small sample of species records to estimate the ratios could lead to significant sampling error, thereby decreasing the estimated value of the BI. Hence, it would be advantageous to identify a method that can accurately and reliably estimate the BI using a relatively small number of presences.

The P/E ratios essentially proportionally estimate the probabilities of species presence within a series of bins (or windows). Therefore, the BI measures the extent to which the probability of species presence increases as predicted suitability increases. Since the BI uses the Spearman correlation coefficient, it only assesses the monotonic relationship between the probability of species presence and predicted suitability. Thus, any (increasingly) monotonic transformation of them will yield the same conclusion.

While we cannot directly estimate the probability of species presence using presence-only data (Pearce and Boyce 2006), we can construct a model (e.g. using smoothing methods) to estimate a quantity that is increasingly monotonically related to the probability of species presence. By calculating the Spearman correlation coefficient between this quantity and the corresponding predicted suitability from the original SDM, we can effectively estimate the BI. This approach circumvents the use of bins and moving windows and mitigates associated issues.

In reality, obtaining a large number of reliable and representative observations for individual species is often problematic, which poses challenges for many methodological investigations. As an alternative, virtual species have become an increasingly important tool for this purpose (Miller 2014). Given that the true distributions of virtual species are known, we can conduct experiments with various designs to observe the effectiveness of different methods.

In this study, we utilized virtual species to investigate how presence sample size affected the OBI and CBI, and to evaluate the accuracy of various smoothing techniques in estimating the BI with two sets of simulations. In simulation 1, we generated species with prevalences ranging from 0.05 to 0.9. In simulation 2, we focused on three specific prevalence levels: (0.05, 0.1), (0.2, 0.3) and (0.7, 0.8). The results from these simulations demonstrated the effectiveness of our proposed methods in improving the BI estimation and addressing challenges associated with varying presence sample sizes.

Methods

Boyce index (BI)

The logic of the BI is that if an SDM is reasonable, then the relative frequency of presence points should increase with the suitability values predicted by the model (Boyce et al. 2002). Assume there are n₁ presences, n₀ random points (i.e. random background sites, or sites randomly selected within the study area) and the predicted suitability values range from 0 to 1; then, divide the interval [0, 1] into m bins of width w, and obtain two series of frequencies n_1i and n_0i, which are the number of presences and number of random points in the i-th bin, respectively; and further obtain a ratio series (i.e. P/E ratio) r_i = (n_1i/n₁)/(n_0i/n₀) = (n_1i/n_0i)(n₀/n₁) (i = 1, 2, …, m); finally, calculate the Spearman rank correlation coefficient between the two series i and r_i (i = 1, 2, …, m) and obtain the BI. As n₀/n₁ remains constant, it can be factored out of the calculation of the ratio, so utilizing r_i = n_1i/n_0i (i = 1, 2, …, m) yields an equivalent value for the BI.

Hirzel et al. (2006) modified the algorithm by introducing a moving window approach (with successive windows partially overlapping), utilizing a window of width w. Let the first window cover [0, w], and count the numbers of evaluation presences and background points falling into the window (n₁₁ and n₀₁); then move the window forward with a step size h (h < w), which covers [h, w + h], and count the two numbers n₁₂ and n₀₂; and so on; until the last window which covers [1 – w, 1], count the two numbers n_1m and n_0m, where m is the total number of windows used; then, calculate the ratio r_i (i = 1, 2, …, m); finally, calculate the BI as above, resulting in their continuous version termed the CBI (continuous Boyce index).

Actually, r_i = n_1i/n_0i is an estimate of cP(y = 1|V_i) (i = 1, 2, …, m), where V_i is the mid value of the ith bin or window, c = n₁/(n₀p) is a constant, p is species prevalence, P(y = 1|V_i) is the probability of species presence given the value V_i, and y denotes species occurrence (either presence or absence).

In addition, using the n₁ presences and the n₀ random points, we can construct another model where the predicted value V from the original model serves as the independent variable. Essentially, this is a presence-only model with the original model prediction V as the single independent variable. It predicts the probability that a site belongs to the presence stratum given the model predicted value V, denoted as P(s = 1|V), employing the following formulation:1 $\ln (\frac{P (s = 1 | V)}{1 - P (s = 1 | V)}) = f (V) .$

Here s denotes stratum (either presence stratum 1 or background stratum 0). We used smoothing functions to represent f(V).

Phillips et al. (2009) has shown that:2 $P (s = 1 |V) = \frac{1}{1 + q / P (y = 1 | V)},$

where q = pn₀/n₁. It means that P(s = 1|V) is a monotonically increasing function of P(y = 1|V). Therefore, the rank correlation between P(s = 1|V) and V is the same as that between P(y = 1|V) and V, and is also the same as that between {r_i|i = 1, 2, …, m} and {V_i|i = 1, 2, …, m} (and equivalently {i|i = 1, 2, …, m}). To estimate the BI, we just need to estimate P(s = 1|V), which was carried out using generalized additive modelling.

Smoothing methods

Dataset smoothing is used to create a smooth function that approximates the data, while removing noise. Smoothing splines are a common method used for dataset smoothing (Luo and Wahba 1997). They are piecewise polynomials going through given data points and satisfying certain continuity conditions (Taavitsainen 2009). Specifically, ‘knots' are placed within the data range to identify the points where adjacent functional pieces join each other, and smooth functional pieces (usually low-order polynomials) are chosen to fit the data between two consecutive knots (Perperoglou et al. 2019). The type of polynomial and the number and placement of knots define the type of spline (Perperoglou et al. 2019).

In this study we used penalized regression splines, which offer a balance between flexibility and smoothness (Wood 2017). They allow for the estimation of complex relationships in the data while preventing excessive fluctuations that might arise from overfitting. The methods we used include thin plate regression splines (tp), cubic regression splines (cr), B-splines (bs), P-splines (ps) and adaptive smoothers (ad). They were implemented using the gam function with binomial family in the R ‘mgcv' package (Wood 2003, 2017). We also calculated the mean of the predictions from these five spline models, referred to as ‘mean'.

The regression splines were fitted by using the predicted suitability from the original SDM as the independent variable and species occurrence (either presence with value 1 or background with value 0) as the dependent variable. The data for this fitting included the predicted suitabilities from the original SDM for NP_bi presences and NRP_bi random points, where NP_bi and NRP_bi denote the number of presences and random points used to calculate the BI, respectively. We then selected NRP_bi equally spaced points (ESPs) across the range of the predicted values from the original SDM and applied the fitted smoothers to these ESPs to obtain the predictions from these smoothers. It is important to note that these ESPs are not necessarily corresponding to real sites within the study area. For instance, if the predicted suitabilities range from 0 to 1, we can take 0, 0.002, 0.004, 0.006, …, 0.998, and 1 as the ESPs if NRP_bi = 501. If the predicted suitabilities are not in the range [0, 1], we can rescale them using the transformation (x – min(x))/(max(x) −min(x)), where x is the predicted suitability. This linear transformation ensures that the discrimination ability of the model remains unchanged. This rescaling method also applies when using the binning and moving window method to calculate the BI.

We repeated the above procedure for all the five smoothing methods, and subsequently averaged the predictions from these methods to obtain mean predictions. Finally, for each method (including the ‘mean'), we calculated the Spearman correlation coefficient between the predictions from each smoothing method (or the mean predictions) and the corresponding values of the NRP_bi ESPs to obtain an estimate of the BI. Our custom R function, sfbi() (included in the Supporting information), was used to estimate the BI with this approach. We denote the BI estimate using smoothing methods as SBI, and SBI_tp, SBI_cr, SBI_bs, SBI_ps, SBI_ad and SBI_m are used to denote BI estimates using the individual smoothers and the ‘mean'.

Creation of virtual species

We utilized the first six principal components (PCs) of 18 environmental variables, encompassing bioclimatic, physiographic and radiometric factors, across a 250 km × 250 km region in western central Victoria, Australia (Supporting information). These variables were at a resolution of 1 × 1 km. Together, these six PCs explained more than 85% of the total variation in the 18 original variables. The original variables have been used to construct distributional models for 523 vertebrate species within Victoria (Liu et al. 2013b). The six PCs have been used to study the methodological issues related to threshold selection and sample size in SDM (Liu et al. 2013a, 2016, 2019).

There are many methods to create virtual species (Miller 2014, Qiao et al. 2016, Meynard et al. 2019). One method involves combining the initial suitability functions for each environmental variable through addition or multiplication operations. Subsequently, the combined suitability function undergoes transformation via a logistic function to derive the theoretical probability of species occurrence. Finally, the theoretical probability is converted to presence/absence by comparing it to a random number drawn from a uniform distribution over the interval (0, 1) (Meynard and Kaplan 2012). In this study, we employed this methodology.

We defined the combined initial suitability function as:3 $f_{k} = c + \sum_{j = 1}^{6} a_{j} {(x_{kj} - b_{j})}^{2} .$

Here, k (where k = 1, 2, …, 62 500) represents the site ID; {x_kj | j = 1, 2, …, 6} denotes the values of the six environmental variables at site k. Parameters c, a_j and b_j (j = 1, 2, …, 6) were randomly selected from (0, 100), (−2, 0) and (min (x_j), max (x_j)), respectively. A negative value of parameter a_j caused the related parabola to open downward. The absolute value of a_j served as a weight for the variable x_j. Instead of fixing b_j at the center of the values of variable x_j, it could be positioned anywhere within the range of the variable. This approach caused the marginal response (i.e. species responded only to this variable when other variables were fixed) to this variable to become skewed, with the length of the curve on either side of the vertex differing. Parameter c acted like a scaling factor.

The theoretical probability of species presence was derived through the following transformation:4 ${prob}_{k} = 1 / (1 + e^{- f_{k}}) .$

Subsequently, we selected random numbers {q_k|k = 1, 2, …, 62 500}. If q_k < prob_k, site k was labelled as presence; otherwise, it was labelled as absence. This resulted in the realized distribution. We provided an example of the created virtual species (Supporting information).

Design of simulations

We used two modelling techniques: random forest (RF), implemented using the R package ‘randomForest' (Liaw and Wiener 2002), and Maxent (ver. 3.3.3k), implemented with the R package ‘dismo' (Hijmans et al. 2021), in conjunction with Java code provided by Phillips et al. (2006). These are two commonly used modelling techniques known for their high performance (Elith et al. 2006, Valavi et al. 2021).

To explore the impact of the number of presences in the evaluation data on estimating the BI, we conducted two sets of simulations. In the first set (simulation 1), we generated species with prevalences ranging from 0.05 to 0.9. Using the virtual species creation method described above, if the prevalence (p) fell outside the specified interval, it was abandoned, and a new species was created until the prevalence requirement was fulfilled. For each simulated species meeting this criterion, we allocated 2000 points for test data. Among these, 2000p and 2000(1 − p) were randomly sampled from all presences and all absences, respectively, where p represents species prevalence. We designated two levels for the number of presences in our training data (NP_train: 500 and 20). The corresponding number of random points (i.e. random background sites) for our training data (NRP_train) was set to 2 × NP_train for RF and fixed at 10000 for Maxent. For each level of NP_train, we selected training presences (after test presences were chosen) and the required number of random points (which were selected completely randomly), constructed RF and Maxent models, applied the models to the test sites, and then computed accuracy metrics, including the area under the receiver's operating characteristic curve (AUC), sensitivity, specificity and true skill statistic (TSS). To calculate the latter three metrics, which are utilized with binary predictions, we determined the optimal threshold by maximizing the sum of sensitivity and specificity (Liu et al. 2005) from the test data and applied this threshold to transform the model predictions from continuous to binary form. Definitions for these accuracy measures can be found in Liu et al. (2011).

We compiled another set of evaluation data to calculate the BI. We defined four levels for the number of presences in estimating the BI (NP_bi: 1000, 200, 50, 10), each paired with 5000 random points (NRP_bi). For each NP_bi level, we selected the requisite number of presences (after selecting test and training presences) and NRP_bi random points, and applied the SDMs to them. Using these presence and random points data, we computed both the OBI and the CBI, and fitted models using the five smoothing methods mentioned earlier. Then, we selected 5000 ESPs across the range of the predicted values from the original SDM (actually it was [0, 1] since they had already been rescaled), and obtained estimates of the BI (corresponding to the five smoothing methods and the ‘mean'), i.e. SBI_tp, SBI_cr, SBI_bs, SBI_ps, SBI_ad and SBI_m. This entire procedure was repeated 300 times.

To discern variations among species with differing prevalences, we conducted an additional set of simulations (simulation 2) comprising three groups of virtual species with prevalences falling within the intervals (0.05, 0.1), (0.2, 0.3) and (0.7, 0.8), respectively. For each prevalence level, the procedure mirrored that of simulation 1, with the exception of repeating the process 100 times. We provided the R code to perform these simulations (Supporting information).

In our analysis, we adhered to the recommendations of Boyce et al. (2002) and Hirzel et al. (2006) by employing a bin and window width of 0.1 for calculating the OBI and the CBI. But Hirzel et al. (2006) did not specify the step size, and only stated that ‘the moving window is shifted from a small amount upwards'. We adopted a step size (h) of 0.0001 in CBI calculations. Additionally, we utilized a basis dimension (k) of 10 for the smoothers in calculating the SBI. To ensure the robustness and appropriateness of these parameter choices, we conducted supplementary simulations (Supporting information).

In order to calculate the OBI, CBI and SBI, we must provide predicted values from a model for both some random points and some presence points. While presence points are constrained within the species' range, the area from which the random points are selected is relatively flexible. How did the calculated values of the BI (and other commonly used accuracy measures) respond to the selected area? We carried out additional simulations to explore this by selecting some random points from an extended unsuitable area (Supporting information).

Results

Examples using two simple hypothetical species

We illustrate our method using two simple hypothetical species. Both species are influenced by a single environmental variable x∈ [−10, 5]. We assume a total of 10 000 sites that are equally spaced within the range [−10, 5]. Their relationships with the environment, at the logit scale, are linear for species 1 (SP1) and quadratic for species 2 (SP2), respectively:5 $f_{k} = x_{k}$

and6 $f_{k} = 3 - 0.3 {(x_{k} + 2)}^{2},$

where k = 1, 2, …, 10 000..

They are equivalently represented as:7 ${prob}_{k} = 1 / (1 + exp (- x_{k}))$

and8 ${prob}_{k} = 1 / (1 + exp (- 3 + 0.3 {(x_{k} + 2)}^{2})),$

where k = 1, 2, …, 10 000.

These functions are depicted in Fig. 1(a, e) by solid black lines. We establish the realized distribution for each species using the same method as described above. The presence sites (3327 for SP1 and 3974 for SP2) are indicated by rug marks at the top of the plots, while the absence sites (6673 for SP1 and 6026 for SP2) are represented at the bottom. Their prevalences are 0.3327 and 0.3974, respectively.

As the logistic function is a monotonically increasing function, the monotonicity of the two functions is retained after logistic transformation, although their exact shapes have altered. Specifically, the parabola opening downward has transformed into a bell shape (Fig. 1(e)), and the straight line has changed into an S-shape (similar to the left half of the bell shape) (Fig. 1(a)).

For each species, we randomly selected NP_train = 1000 presence sites and NRP_train = 5000 random sites to construct a Maxent model. The model was then applied to all sites to obtain their predicted suitabilities (depicted by green dashed lines in Fig. 1(a, e)) along with their corresponding theoretical probabilities displayed by black solid lines. The Spearman rank correlation coefficients between theoretical probabilities and predicted suitabilities are 0.9986 for SP1 and 0.9959 for SP2. This indicates that theoretical probabilities and predicted suitabilities are strongly monotonically related. Their relations are shown in Fig. 1(b, f). Subsequently, we selected 2000 sites (excluding the training presence sites) to test the models and obtained AUC values of 0.9662 for SP1 and 0.9544 for SP2.

Additionally, we selected 1000 presence sites (excluding the training and test presence sites) and 5000 random sites to calculate the OBI and the CBI, as well as fit a smoother (thin plate regression spline). The P/E ratios corresponding to the binning, moving window and smoothing methods are illustrated in Fig. 1(c, g). Further, the OBI values obtained were 0.9879 for both SP1 and SP2, with CBI values of 0.9733 for SP1 and 0.9812 for SP2. We applied the fitted smoother to 5000 ESPs across the range of the predicted values from the original SDM, and obtained the BI by calculating the Spearman correlation coefficient between the predictions of the smoothers on these 5000 ESPs and the corresponding values of the ESPs: SBI_tp = 1 for SP1 and 0.9951 for SP2.

To assess the effect of the number of presences on the BI estimation, we randomly selected 10 sites from the existing 1000 presence sites. The OBI and CBI were calculated using these selected presence sites and the same random sites as before. The P/E ratios corresponding to the binning, moving window and smoothing methods are depicted in Fig. 1(d, h). The obtained OBI values were 0.6282 for SP1 and 0.6833 for SP2, with CBI values of 0.5355 for SP1 and 0.66 for SP2. Additionally, the BI values calculated from the smoothing method (i.e. SBI_tp) were 1 for SP1 and 0.9884 for SP2.

A clear decrease in the OBI and CBI was observed with a decrease in the number of presences, whereas the SBI values remained relatively unchanged. Given the highly correlated predicted suitabilities and theoretical probabilities, any fluctuations observed in the P/E ratios represent errors that impair BI estimation. Consequently, the SBI outperforms the OBI and CBI. As shown in Fig. 1(h), the smoothing method principally captures the trend. However, the moving window method fluctuates, while the binning method has only two points with non-zero values, one of which is in the wrong order. These factors render the binning and moving window methods less effective compared to the smoothing method.

We also provided a more realistic example using a virtual species with a prevalence of 0.262 (Supporting information).

The effect of the number of presences on the estimation of the BI

In simulation 1 (for species with prevalence ranging from 0.05 to 0.9), most models exhibited high discrimination ability (Supporting information). Out of 1200 models, 1074 showed an AUC greater than 0.8, and 979 models had a TSS exceeding 0.6. Only 10 models had an AUC below 0.5, and only one model had a TSS below 0. Despite this, we retained all the models for subsequent analyses to include a variety of ‘typical' SDMs.

SBI_tp, SBI_cr, SBI_bs, SBI_ps, SBI_ad and SBI_m were generally unaffected by low numbers of presences (NP_bi) (Fig. 2), except the increased variation with decreased NP_bi. There was no obvious difference between smoothing methods. The variation was bigger for random forest models than for Maxent models. However, the OBI and CBI were influenced by low NP_bi, with their estimated BI decreasing as NP_bi decreased.

[IMAGE OMITTED. SEE PDF]

In simulation 2 (where three levels of prevalence were set: 0.05–0.1, 0.2–0.3 and 0.7–0.8), most models demonstrated high discrimination ability (Supporting information). For instance, out of 1200 models, 1103 had an AUC above 0.8, sensitivity exceeded 0.8 for 1035 models, specificity was above 0.8 for 1015 models and TSS surpassed 0.6 for 1013 models. However, five models had an AUC below 0.5, indicating performance worse than random. Despite this, all models had TSS above 0, suggesting that the binary models, transformed from continuous predictions using a selected threshold, still retained some discrimination ability. We included all the models in subsequent analyses.

For all three levels of prevalence, both the OBI and CBI were affected by low NP_bi, resulting in decreased estimated BI values as NP_bi decreased (Fig. 3, 4, 5). However, for the first two levels of prevalence (0.05–0.1 and 0.2–0.3), the SBI was almost not impacted by low NP_bi, except for a slight increase in variation with decreased NP_bi. Conversely, for species with prevalence set at 0.7–0.8, the SBI for Maxent models was largely unaffected by low NP_bi, except for an increased variation with decreased NP_bi when NP_train = 20. Low NP_bi did affect the SBI when evaluating random forest models. In most situations, the BI was overestimated, and variation was large when NP_bi was low. However, the BI could be estimated almost unbiasedly with larger NP_bi (> 50).

[IMAGE OMITTED. SEE PDF]

Discussion

Through simulations, we have demonstrated that both the original calculation of the BI through binning (OBI) and its modified version employing a moving window (i.e. the continuous BI or CBI) are severely affected by low numbers of presences (NP_bi) used in the calculation. Generally, if the NP_bi is not large enough, the BI is very likely to be underestimated by these two methods. This highlights the need to improve methods for calculating the BI, specifically to identify or derive a method to accurately calculate the BI using relatively small values of NP_bi.

A challenge in evaluating the performance of the methods to calculate the BI is that we do not know the true value of the BI for a particular model, and hence we must use a series of large NP_bi to calculate the BI and then find the asymptote, which can be considered an approximation of the true value for the focal model. In our simulations, we know the identities of all the points (both presences and absences, although only the identities of the presence points are needed in calculating the BI), and therefore can use a large NP_bi (up to a few thousand). For simplicity, we used 1000 as the largest NP_bi for the species simulated. With this level of NP_bi, it appears that the asymptote was reached for the two lower levels of prevalence (0.05–0.1 and 0.2–0.3). For the highest level of prevalence (0.7–0.8), the asymptote was also reached for realistic situations, e.g. when there are more presence sites for model training (NP_train > 20) and for calculating the BI (NP_bi > 50).

For the moving window method, through simulation (with nine levels of step size: 0.05, 0.02, 0.01, 0.005, 0.002, 0.001, 0.0005, 0.0002, 0.0001), we have found that using the first few levels of step size, the BI was overestimated and was highly variable, and the estimated BI values became stable from 0.001 (Supporting Information). We used the lowest level of step size (0.0001) to ensure that we obtain a stable estimation of the BI each time.

Our results have demonstrated that all the smoothing-based methods are better than the binning and moving window methods, i.e. the SBI outperforms the OBI and CBI. The reason is that the binning and moving window methods only use local data (i.e. the data within a bin or a window) to estimate the P/E ratio. When the NP_bi is low, the estimated P/E ratios will fluctuate. They may be overestimated in some bins (or windows) and underestimated in others. Since the BI is the Spearman correlation coefficient between the P/E ratios and the mid- or average suitability values within the bins (or windows) or even the rank of the bins (or windows), even if the general trend maintains, a single fluctuation will reduce the estimated value of the BI. This is the nature of correlation coefficient, which is different from fitting curves or calculating average values where positive bias and negative bias can be cancelled. On the contrary, when fitting the smoothers, all evaluation data are used in estimating the parameters of the smoothing functions. It is more likely that the general trend will be captured by the smoothing functions. Therefore, it is more likely to obtain a better estimation of the BI. Although all the six smoothing-based methods (SBI_tp, SBI_cr, SBI_bs, SBI_ps, SBI_ad and SBI_m) performed equally well in this study, SBI_m is expected to exhibit greater robustness due to its ensemble modelling nature (Araújo and New 2007, Pintelas and Livieris 2020).

Our findings are important for real-world applications, as most species do not typically demonstrate high levels of prevalence within a given geographical area. For these species, using the smoothing methods is effective for estimating the BI. However, for species with high levels of prevalence, there are generally more presence records available (often up to thousands). In these cases, we would not be concerned with the issue of low NP_bi and could use a large number of presence records for both the model training (NP_train) and the BI calculation (NP_bi), each of which facilitates the estimation of the BI. The amount of data required for an accurate estimation of the BI also depends on model accuracy. For a highly accurate model, as few as 10 presence sites can provide a sufficiently accurate estimate of the BI. The Maxent models in this study have achieved this level of accuracy. By implementing a commensurate modelling strategy, we can also build very high-quality models using random forest method (e.g. by ensembling multiple random forests). However, if possible, a large number of presence sites (for both NP_train and NP_bi) is always preferred. Therefore, for any species, the continued collection of species records is strongly advocated.

We have demonstrated that AUC increases as unsuitable area (from which the random points for calculating the BI were selected) increases (Supporting information), which is consistent with previous findings (Lobo et al. 2008, Jiménez-Valverde 2012, Jiménez and Soberón 2020). We have also found that sensitivity, specificity and TSS also increase as unsuitable area increases. This means that these accuracy measures could be artificially inflated by extending background area to include more unsuitable areas. It seems interesting for sensitivity since presence sites only exist in the small area, i.e. the two areas (both large area and small area) have the same presence sites. Why would the same model have produced different sensitivity values for the same presence sites? This difference comes from the threshold selection, where specificity is involved. The calculation of specificity uses the absence data, which depends entirely on the bounds of the study area. However, our results have shown that the OBI, CBI and SBI remain unchanged as the unsuitable area increases (Supporting information). This can be explained through simple logical reasoning. As we previously mentioned, the BI is the Spearman correlation coefficient between the P/E ratio r_i = (n_1i/n₁)/(n_0i/n₀) = (n_1i/n_0i)(n₀/n₁) and i (i = 1, 2,…, m). This is equivalent to the correlation between n_1i/n_0i and i (i = 1, 2, …, m), since n₀/n₁ is a constant for a given evaluation dataset. Suppose we enlarge the evaluation dataset with t sites randomly selected from the extended unsuitable area. For reasonable models, predicted suitabilities for these sites will be very low and will most likely fall into the first few bins (e.g. the first three bins with t₁, t₁ and t₃ sites satisfying t = t₁ + t₂ + t₃). Usually, no presences will fall into these bins, i.e. n_1i = 0, resulting in n_1i/n_0i = n_1i/(n_0i + t_i) = 0 (i = 1, 2, 3). The ratios for the remaining bins (i = 4, 5, …, m) do not change at all. Therefore, the extended unsuitable sites will not affect the BI. This is an advantageous property of the BI, as there is no universal rule for selecting the background area. However, this property is only useful for model evaluation with the BI. For model training, we still need to carefully select the background area. Even for model evaluation, it is better to use a carefully selected background area; arbitrarily choosing the background area is not recommended.

We acknowledge that our investigation was conducted using random sampling of presences for both training and evaluation data. While bias in the training data can impact model performance (O'Toole et al. 2021), bias in the evaluation data can affect the assessment of the model's performance (Bean et al. 2012). The extent to which these biases may influence our conclusions will require further investigation in the future.

Acknowledgements

– The authors thank Dr Michael P. Scroggie for comments which improved this manuscript.

Funding

– The authors were funded by Victoria Department of Energy, Environment and Climate Action.

Author contributions

Canran Liu: Conceptualization (lead); Formal analysis (lead); Methodology (lead); Writing - original draft (lead). Graeme Newell: Conceptualization (supporting); Writing - review and editing (equal). Matt White: Conceptualization (supporting); Writing - review and editing (equal). Josephine Machunter: Conceptualization (supporting); Writing - review and editing (equal).

Transparent peer review

The peer review history for this article is available at .

Data availability statement

Data are available from the Mendeley Data: (Liu et al. 2018).

References

Araújo, M. B. and New, M. 2007. Ensemble forecasting of species distributions. – Trends Ecol. Evol. 22: 42–47.

Aubry, K. B., Raley, C. M. and McKelvey, K. S. 2017. The importance of data quality for generating reliable distribution models for rare, elusive, and cryptic species. – PLoS One 12: [eLocator: e0179152].

Barbet‐Massin, M., Jiguet, F., Albert, C. H. and Thuiller, W. 2012. Selecting pseudo‐absences for species distribution models: how, where and how many? – Methods Ecol. Evol. 3: 327–338.

Bean, W. T., Stafford, R. and Brashares, J. S. 2012. The effects of small sample size and sample bias on threshold selection and accuracy assessment of species distribution models. – Ecography 35: 250–258.

Boyce, M. S., Vernier, P. R., Nielsen, S. E. and Schmiegelow, F. K. 2002. Evaluating resource selection functions. – Ecol. Modell. 157: 281–300.

Elith, J. et al. 2006. Novel methods improve prediction of species' distributions from occurrence data. – Ecography 29: 129–151.

Fei, S. and Yu, F. 2016. Quality of presence data determines species distribution model performance: a novel index to evaluate data quality. – Landscape Ecol. 31: 31–42.

Hijmans, R. J., Phillips, S., Leathwick, J. and Elith, J. 2021. dismo: species distribution modeling. – R package ver. 1.3‐5, https://CRAN.R‐project.org/package=dismo.

Hirzel, A. H., Le Lay, G., Helfer, V., Randin, C. and Guisan, A. 2006. Evaluating the ability of habitat suitability models to predict species presences. – Ecol. Modell. 199: 142–152.

Huang, Y., Li, W., Macheret, F., Gabriel, R. A. and Ohno‐Machado, L. 2020. A tutorial on calibration measurements and calibration models for clinical prediction models. – J. Am. Med. Inform. Assoc. 27: 621–633.

Jiménez, L. and Soberón, J. 2020. Leaving the area under the receiving operating characteristic curve behind: an evaluation method for species distribution modelling applications based on presence‐only data. – Methods Ecol. Evol. 11: 1571–1586.

Jiménez‐Valverde, A. 2012. Insights into the area under the receiver operating characteristic curve AUC as a discrimination measure in species distribution modelling. – Global Ecol. Biogeogr. 21: 498–507.

Liaw, A. and Wiener, M. 2002. Classification and regression by randomForest. – R News 2: 18–22.

Lissovsky, A. A., Dudov, S. V. and Obolenskaya, E. V. 2021. Species‐distribution modeling: advantages and limitations of its application. 1. General approaches. – Biol. Bull. Rev. 11: 254–264.

Liu, C., Berry, P. M., Dawson, T. P. and Pearson, R. G. 2005. Selecting thresholds of occurrence in the prediction of species distributions. – Ecography 28: 385–393.

Liu, C., White, M. and Newell, G. 2011. Measuring and comparing the accuracy of species distribution models with presence–absence data. – Ecography 34: 232–243.

Liu, C., White, M. and Newell, G. 2013a. Selecting thresholds for the prediction of species occurrence with presence‐only data. – J. Biogeogr. 40: 778–789.

Liu, C., White, M., Newell, G. and Griffioen, P. 2013b. Species distribution modelling for conservation planning in Victoria, Australia. – Ecol. Modell. 249: 68–74.

Liu, C., Newell, G. and White, M. 2016. On the selection of thresholds for predicting species occurrence with presence‐only data. – Ecol. Evol. 6: 337–348.

Liu, C., White, M. and Newell, G. 2018. Data from: The first six principal components derived from eighteen environmental variables. – Mendeley Data, http://dx.doi.org/10.17632/kyy8f8w79j.1.

Liu, C., Newell, G. and White, M. 2019. The effect of sample size on the accuracy of species distribution models: considering both presences and pseudo‐absences or background sites. – Ecography 42: 535–548.

Lobo, J. M., Jiménez‐Valverde, A. and Real, R. 2008. AUC: a misleading measure of the performance of predictive distribution models. – Global Ecol. Biogeogr. 17: 145–151.

Luo, Z. and Wahba, G. 1997. Hybrid adaptive splines. – J. Am. Stat. Assoc. 92: 107–116.

Meynard, C. N. and Kaplan, D. M. 2012. The effect of a gradual response to the environment on species distribution modeling performance. – Ecography 35: 499–509.

Meynard, C. N., Leroy, B. and Kaplan, D. M. 2019. Testing methods in species distribution modelling using virtual species: what have we learnt and what are we missing? – Ecography 42: 2021–2036.

Miller, J. A. 2014. Virtual species distribution models: using simulated data to evaluate aspects of model performance. – Prog. Phys. Geogr. 38: 117–128.

O'Toole, M., Queiroz, N., Humphries, N. E., Sims, D. W. and Sequeira, A. M. 2021. Quantifying effects of tracking data bias on species distribution models. – Methods Ecol. Evol. 12: 170–181.

Pearce, J. L. and Boyce, M. S. 2006. Modelling distribution and abundance with presence‐only data. – J. Appl. Ecol. 43: 405–412.

Perperoglou, A., Sauerbrei, W., Abrahamowicz, M. and Schmid, M. 2019. A review of spline function procedures in R. – BMC Med. Res. Methodol. 19: 46.

Peterson, A. T., Papeş, M. and Soberón, J. 2008. Rethinking receiver operating characteristic analysis applications in ecological niche modeling. – Ecol. Modell. 213: 63–72.

Phillips, S. J. and Elith, J. 2010. POC plots: calibrating species distribution models with presence‐only data. – Ecology 91: 2476–2484.

Phillips, S. J., Anderson, R. P. and Schapire, R. E. 2006. Maximum entropy modeling of species geographic distributions. – Ecol. Modell. 190: 231–259.

Phillips, S. J., Dudík, M., Elith, J., Graham, C. H., Lehmann, A., Leathwick, J. and Ferrier, S. 2009. Sample selection bias and presence‐only distribution models: implications for background and pseudo‐absence data. – Ecol. Appl. 19: 181–197.

Pintelas, P. and Livieris, I. E. 2020. Special issue on ensemble learning and applications. – Algorithms 13: 140.

Qiao, H., Soberón, J. and Peterson, A. T. 2015. No silver bullets in correlative ecological niche modelling: insights from testing among many potential algorithms for niche estimation. – Methods Ecol. Evol. 6: 1126–1136.

Qiao, H., Peterson, A. T., Campbell, L. P., Soberón, J., Ji, L. and Escobar, L. E. 2016. NicheA: creating virtual species and ecological niches in multivariate environmental scenarios. – Ecography 39: 805–813.

Taavitsainen, V.‐M. 2009. Denoising and signal‐to‐noise ratio enhancement: splines. – In: Brown, S. D., Tauler, R. and Walczak, B. (eds), Comprehensive chemometrics. Elsevier, pp. 67–83.

Valavi, R., Elith, J., Lahoz‐Monfort, J. J. and Guillera‐Arroita, G. 2021. Modelling species presence‐only data with random forests. – Ecography 44: 1731–1742.

Wood, S. N. 2003. Thin‐plate regression splines. – Proc. R. Soc. B 65: 95–114.

Wood, S. N. 2017. Generalized additive models: an introduction with R, 2nd edn. – Chapman and Hall/CRC.

Word count: 6539

Show less

© 2025. This work is published under http://creativecommons.org/licenses/by/3.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Improving the estimation of the Boyce index using statistical smoothing methods for evaluating species distribution models with presence‐only data

Content area

Abstract

Full text