1 Introduction
1.1 Epistemology of scientific inquiry: “model-based” data analysis
Most of the methods reported in handbooks of applied statistics have been developed under the assumption of independence, distributional identity and stationarity, and well-behaving bell-shaped or exponential distributions. Of course, there is a wide array of statistical literature focused on dependence and the related stochastic processes, the lack of distributional identity and nonstationarity, and skewed sub-exponential distributions. However, in some applied sciences such as hydro-climatology, analysts have often neglected the fact that moving from the former set of assumptions (commonly reported in introductory handbooks) to the latter is not just moving to a more general model from one that can be considered to be a special case. A typical example is the generalized extreme value (GEV) distribution, widely used in the study of hydro-climatic extreme events, which converges to the Gumbel distribution as the shape parameter converges to zero. While Gumbel is mathematically a special case of the GEV, assuming that Gumbel is the distribution of choice has relevant consequences as it means assuming exponential tails instead of super-exponential (upper-bounded) or sub-exponential (possibly heavy) tails. In particular, high values of skewness and heavy tails imply possible non-existence of the moments of high order, as well as bias and/or high variability in the estimates of the moments themselves, including variance, covariance, and autocorrelation, as well as long-range dependence
Similar remarks hold for nonstationarity, which denotes the dependence of the (joint) distribution function of a set of generic random variables on the parameter () via well-specified functions of
As for the effects of nonstationarity and heavy tails, there is a wide array of literature on the effects of the assumption of dependence on statistical inference. Generally, dependence implies information redundancy and a reduced effective sample size, along with variance inflation and bias in the standard estimators of summary statistics such as marginal and joint moments
The foregoing examples highlight that the statistical inference cannot be reduced to the usage of multiple competing models and methods as every aspect of statistical analysis and its interpretation depends on the underlying assumptions according to the rationale of statistical inference. Even the simplest diagnostic diagram relies on underlying assumptions and models
-
“A statistical model is adopted that supposedly describes both the stochastic characteristics of the observed process and the properties of the method of observation. It is important to be aware of the models implicit in the chosen statistical method and the constraints those models necessarily impose on the extraction and interpretation of information.”
-
“The observations are analysed in the context of the adopted statistical model.”
Based on the foregoing remarks, appropriate statistical inference (and scientific learning) is an iterative “model-based” procedure, which can be summarized as follows:
-
Make assumptions that are deemed to be reasonable for the data and facts at hand.
-
Build tentative theories and models and make inferences accounting for the effect and consequences of the underlying assumptions.
-
Interpret results according to the nature of the adopted assumptions and models.
-
Retain or change/update assumptions and models based on the agreement or disagreement of the developed theories and models with (new) data and facts.
1.2 Box's cookbookery and mathematistry: “test-based” data analysis
Being the prerequisite to any sound scientific investigation, the epistemological principles described in Sect. should be obvious and well-known. However, this does not seem to be the case in some applied sciences where data analysis and modeling often neglect or ignore such principles, thus calling into question the scientific validity of results and conclusions.
In this respect, the statistician George E. P. Box highlighted that scientific progress results from the feedback between theory and practice, and this feedback requires a closed loop. Therefore, when the loop is open, progress stops. referred to the main consequences of a lack of feedback between theory and practice as maladies called “cookbookery” and “mathematistry”, claiming that “The symptoms of the former are a tendency to force all problems into the molds of one or two routine techniques, insufficient thought being given to the real objectives of the investigation or to the relevance of the assumptions implied by the imposed method… Mathematistry is characterized by development of theory for theory's sake, which since it seldom touches down with practice, has a tendency to redefine the problem rather than solve it… In such areas as sociology, psychology, education, and even, I sadly say, engineering, investigators who are not themselves statisticians sometimes take mathematistry seriously. Overawed by what they do not understand, they mistakenly distrust their own common sense and adopt inappropriate procedures devised by mathematicians with no scientific experience”.
In the context of climate science, raised similar remarks in the preface of their book: “Cookbook recipes for a variety of standard statistical situations are not offered by this book because they are dangerous for anyone who does not understand the basic concepts of statistics”.
This problem is not new in hydrology and hydro-climatology either and was already stressed by and , who discussed some misconceptions concerning the study and interpretation of hydrological processes and variables by methods borrowed from other disciplines such as systems and decision theory, mathematics, and statistics. For instance, often, stochastic processes are no longer considered to be convenient descriptors of hydrological processes for practical purposes, but the former are identified with the latter and vice versa, thus generating confusion and a questionable approach to data analysis and modeling that contrasts with the logic of statistical inference recalled in Sect. .
A large body of the literature on data analysis of unrepeatable hydro-climatic processes seems to neglect or ignore epistemological principles, thus confusing the role of observations, assumptions, and models. This results in fallacious procedures that share the following general structure, which we call the “test-based” method:
-
Select several models and methods based on different and often incompatible assumptions.
-
Make inferences neglecting the constraints imposed by different underlying assumptions.
-
Interpret results attempting to prove or disprove models' assumptions, which are often (if not always) erroneously attributed to physical processes, whereas they refer to models used to describe such processes.
1.3 Aims and organization of this study
While and provided extemporaneous commentaries about the foregoing issues, we address the problem from a different perspective. Instead of discussing several misconceptions from a general and purely conceptual point of view, we focus on a specific issue (here, the role of dependence in the statistical analysis of the occurrence of extreme precipitation ()), focusing on theoretical inconsistencies and showing practical consequences by performing a detailed data analysis. In this way, theoretical remarks are complemented by a side-by-side comparison of the output of model-based and test-based methods, emphasizing the concrete effect of conceptual mistakes. Therefore, this work is a proper neutral validation and/or falsification study
Focusing on the assumption of spatio-temporal dependence in the analysis of extreme- frequency, we attempt to show the practical consequences of underestimating and not properly considering and interpreting the effects of dependence, as well as the logical fallacies of the test-based method in this context. In particular, we re-analyze a worldwide precipitation data set comparing the output of a model-based framework (relying on theoretically informed stochastic modeling and diagnostic plots) with that of a test-based approach that led to conclude that “accounting or not for the possible presence of serial correlation has a very limited impact on the assessment of trend significance in the context of the model and range of autocorrelations considered here” and “Accounting for serial correlation in observed extreme precipitation frequency has limited impact on statistical trend analyses”. Therefore, this study double-checks the role of scientific logic and spatio-temporal dependence in the analysis and characterization of extreme- frequency at various spatio-temporal scales. Eventually, we compare the two methodologies (model-based and test-based) in terms of their rationale and output to better understand why and how epistemological fallacies and the consequent improper use of statistical analysis and the treatment of dependence might result in misleading conclusions.
This study is organized according to its specific purpose. Therefore, it does not follow the standard structure of problem–model–application–results. In particular, all technical details of models and methods are reported in the Supplement. Indeed, the aim is to emphasize the practical importance of epistemology in data analysis rather than to focus on models' technicalities, which is one of the conceptual mistakes affecting test-based analysis. In this respect, the specific models and methods used are secondary and replaceable with others, whereas the logical reasoning leading the analysis (but systematically neglected in the test-based approach) stays unchanged.
Based on the foregoing remarks, Sect. introduces the data set. Section presents and discusses the test-based methodology, highlighting shortcomings and pitfalls that lead us to introduce the rationale of a model-based approach, which is described in Sect. . In Sect. , we analyze the frequency data, focusing on (i) the marginal distribution of the annual number () of over-threshold (OT) events, (ii) the relationship between the lag-1 autocorrelation and the slope of linear trends estimated based on the time series, and (iii) trend analysis of the time series. In this respect, we expand the analysis reported by by focusing on the role of spatio-temporal dependence at various spatio-temporal scales
2 Data
We analyze daily precipitation time series from a sub-set of gauges extracted from more than 100,000 stations of the Global Historical Climatology Network-Daily (GHCN-D) database (
Figure 1
Map of GHCN rain gauges used in this study, with the four sub-regions (denoted as North America, Eurasia, northwestern Europe, and Australia) that are discussed in the subsequent analysis.
[Figure omitted. See PDF]
To allow a fair comparison with the existing literature, we followed and selected extreme to be the values exceeding given percentage thresholds according to the at-site empirical cumulative distribution function (ECDF) (including zeros). For each station, the annual number of OT exceedances forms the time series of extreme- frequencies. Note that the “exceedances on consecutive days are counted as separate events” . considered several percentage threshold from 90 % to 97.5 % and different sub-sets (i.e., the most recent 30 and 50 years, as well as the complete sequences of 100 years). Since their results are consistent across different thresholds, we limit our analysis to the 95 % and 99.5 % thresholds as the former is the one discussed more extensively by , while the latter serves to highlight the behavior of the occurrence process of exceedances over a high threshold. As far as the number of years is of concern, we only use 100 and 50 years as shorter time series of 30 annual data points do not provide reliable information on occurrence processes in terms of spatio-temporal properties. The GHCN-D data set was retrieved and handled by the R-contributed package rnoaa .
3 Test-based methodology3.1 Assumptions and statistical tests
Before introducing the model-based methodology, we firstly present the test-based approach to data analysis. The aim is two-fold: (i) to explain the motivation of moving from test-based to model-based analysis and (ii) to better understand the discussion in Sect. concerning the differences between the outputs of the two methodologies. For the sake of comparison, we apply the same test-based procedure used by . It consists of the following steps:
-
Firstly, Kolmogorov–Smirnov (KS) and goodness-of-fit tests are used to check whether follow a Poisson distribution.
-
Based on the outcome of the first step, two competing models are selected: (i) the non-homogeneous Poisson (NHP) process describing a collection of independent Poisson random variables, with the rate of occurrence linearly varying with time, and (ii) the first-order Poisson integer autoregressive (Poisson-INAR(1)) process, which is a specific kind of stationary and temporally correlated process with a Poisson marginal distribution (see Sect. S1 in the Supplement). These models are used to study the effect of serial correlation.
-
Therefore, various statistical tests for trend detection are applied. Due to similar performance of parametric and nonparametric tests, retained and discussed only one test for each category, that is, a nonparametric test based on the Mann–Kendall (MK) test statistic and a parametric test based on the slope parameter of a Poisson regression (PR).
The aim of this trend analysis of is to investigate the impact of serial correlation on trend detection and the effect of performing multiple tests. Even though technicalities (such as the selection of a particular test or model) can change from case to case, it is easy to recognize that the foregoing procedure follows the same test-based rationale of trend analyses reported in the majority of the literature on this topic. However, are we sure that such a way of analyzing data is compliant with the scientific method and its logic? Can we consider test-based analysis to be theoretically consistent solely due to its widespread application? In the next section, we start to answer these questions.
3.2 Remarks on logical fallacies of the test-based methodology
As mentioned in Sect. , every statistical analysis (including diagnostic plots) relies on some assumptions, and these assumptions lead the interpretation of results. In this respect, while the simultaneous application of the techniques or tests listed in the previous section seems to be reasonable, a closer look reveals that the assumptions behind these statistical methods are not compatible with each other and might yield logical contradictions.
For example, KS and goodness-of-fit tests, which are used to check the suitability of the Poisson distribution for , are valid under the assumption that the underlying process is independent and identically distributed. Therefore, if these tests pointed to a lack of rejection of Poisson distributions, this would also imply independence and distributional identity (i.e., stationarity). In fact, if the process were not independent, dependence would generate information redundancy and over-dispersion (variance larger than the mean) of the observed finite-size samples of , thus making the Poisson model unsuitable from a theoretical standpoint. On the other hand, if the process were time-nonstationary (i.e., not identically distributed), we could not conclude that the distribution of is a single, specific distribution, such as Poisson. In that case, the distribution of can be, at most, conditionally Poisson (i.e., a Poisson distribution describes the process resulting from filtering nonstationarity out) or compound Poisson (i.e., a compound Poisson distribution describes the process resulting from incorporating nonstationarity by model averaging over time). Therefore, if the Poisson assumption is not rejected according to tests that imply independence and stationarity, we must necessarily exclude the alternative assumptions of dependence and nonstationarity, thus concluding that there is no reason to further proceed with any subsequent analysis of trends and/or dependence. This is the first logical contradiction of the test-based procedure described in Sect. .
Nonetheless, let us overlook the foregoing contradiction and move to the second step of the procedure. If we assume that the process is NHP, this model implies that the data follow a different Poisson distribution at each time step (see Sect. S1 in the Supplement). In this case, the overall distribution of the observed is a compound Poisson, and, more importantly, the nonstationarity of NHP hinders the application of KS and goodness-of-fit tests as the process does not fulfill the underlying assumptions of these tests. Indeed, these tests can, at most, be applied to conditional processes – that is, to values resulting from filtering the effect of “non-homogeneity” out
Similar remarks hold for the assumption of dependence. It is well known that dependence introduces information redundancy that impacts goodness-of-fit tests, inflating the variance of their test statistics, which must be adjusted accordingly. Variance inflation affects any sampling summary statistics, including sample variance and auto- or cross-correlation, as well as the shape of the distribution of , which is expected to be over-dispersed . Therefore, under dependence, the marginal distribution of cannot be Poisson, and the Poisson-INAR(1) models are likely to be unsuitable models for . In other words, the preliminary application of goodness-of-fit tests neglecting the subsequent assumption of dependence yields results that are also theoretically invalid in this case because (i) the distribution of the test statistics under independence is not valid under dependence, and (ii) the Poisson distribution is not a valid candidate model for under dependence. Therefore, assuming Poisson-INAR(1) models is incompatible not only with the first step of the test-based procedure but also with its own assumption of dependence.
These remarks should clarify why adding or relaxing fundamental assumptions such as (in)dependence and (non)stationarity cannot be reduced to just introducing or removing (or setting to zero) some parameters. Changing these assumptions deeply changes the inferential framework, as well as the expected properties of the observed processes. Generally, conclusions and results obtained under a set of assumptions cannot be used to support further analysis and models that are valid under a different set of hypotheses . In fact, the results under might not be valid under and vice versa. In some cases, models and tests need to be adjusted, and results need to be updated accordingly. In other cases, models and tests used under might not even exist under .
The foregoing discussion indicates that an analysis based on models or tests relying on mixed assumptions is prone to severe logical inconsistencies and misleading conclusions, thus suggesting that a proper statistical analysis should rely on well-specified assumptions, adopting inference procedures and methods that agree with those assumptions, thus guaranteeing a coherent interpretation of results according to the genuine scientific logic recalled in Sect. . In the next sections, we introduce this kind of approach and further investigate the aforementioned issues and their practical consequences in real-world data analysis.
4 Model-based methodology: recovering the seemingly forgotten scientific method
The approach described in Sect. was referred to as test-based as it generally involves extensive application of several statistical tests and massive use of Monte Carlo simulations, with little or no attention to exploratory data analysis and theoretical assumptions and, thus, their consequences on the interpretation of results. Therefore, we move from statistical tests and their binary and often uninformative outputs (see further discussion in Sect. ) to a model-based approach supported by preliminary theoretical considerations and simple but effective graphical exploratory analysis. The underlying idea is to avoid moving among different and possibly incompatible assumptions, focusing on a single one that is considered to be realistic for the process studied. Thus, we build theoretically supported models and methods that fulfill that set of assumptions, and we check if the framework is able to reproduce key properties of the process of interest (here, ) over a reasonable range of spatio-temporal scales. Note that this approach is nothing but the standard procedure of scientific inquiry , which has, however, seemingly been forgotten in a large body of literature dealing with statistical analysis of hydro-climatic data.
As recalled in Sect. , the model-based approach implies the following steps: (i) introduce reasonable assumptions based on preliminary observations and knowledge of the process of interest, (ii) deduce models and diagnostic tools that are consistent with those assumptions, (iii) compare model outputs and observations, (iv) update assumptions and/or models based on the outcome of stage (iii), and (v) iterate the procedure if required.
In the present case, the first two steps of the foregoing procedure specialize as follows:
-
Since the precipitation process exhibits recognizable spatio-temporal patterns evolving over various spatio-temporal scales, the assumptions of independence and distributional identity are reasonably untenable. Therefore, we relax the assumption of independence for while retaining that of stationarity. The aim is to keep the modeling task as simple as possible and to check whether models and methods incorporating spatio-temporal dependence provide a reasonable description of .
-
Based on the assumption of dependence, we introduce models for marginal distributions and temporal dependence, attempting to balance parsimony and generality. These models are complemented with diagnostic diagrams and statistical tests purposely selected to be consistent with the assumption of dependence. Note that the statistical tests are introduced in the model-based framework only to allow comparison with the test-based procedure, although they are not even applicable to data from unrepeatable hydro-climatic processes (see further discussion in Sect. ).
4.1 Modeling marginal distributions
The choice of potential distributions for should consider four factors: (i) the size of the blocks of observations over which we compute values, (ii) the finite sample size of records, (iii) the threshold used to select OT events, and (iv) the effect of dependence.
The number of OT events is calculated over 365 d time windows, meaning that can be interpreted as the number of successes or failures occurring over a finite number of Bernoulli trials. The sample size of the resulting time series of is, at most, 100, which is the number of available years of records, excluding possible missing values. Moreover, assuming a realistic average probability of zero daily equal to and OT probability in the ECDF (including zeros), the corresponding non-exceedance probability of non-zero is , which is not a very high threshold for if one would like to focus on extreme values. For , the probability becomes, at most, for , which is quite a reasonable value for in wet climates
More importantly, spatio-temporal dependence affects the marginal distribution of and the (inter-)arrival times of OT events over finite-size blocks of observations (such as the 365 d forming 1-year blocks). As mentioned in Sect. , spatio-temporal dependence results in information redundancy and over-dispersion so that the distribution of (inter-)arrival times is expected to depart from the exponential (which is instead valid for independent events), becoming sub-exponential and Weibull-like
Recalling that a Poisson distribution is characterized by equi-dispersion (i.e., equality of mean and variance), plotting sample variances () versus means () can provide an effective diagnostic plot to check whether a Poisson distribution can be a valid model for .
To allow a direct and fair comparison with the test-based methodology described in Sect. , we complemented the KS test with two additional tests whose test statistics are (i) the Pearson product moment correlation coefficient (PPMCC) on the probability–probability plots
These tests are purposely chosen to highlight the inconsistencies resulting from a test-based methodology if we neglect the rationale of the tests, as well as informative exploratory analysis. In fact, as the PPMCC test relies on empirical and theoretical frequencies, that is, standardized ranks, it misses the information about the absolute values of , and it is therefore the less powerful test among the three. On the other hand, the KS test includes such information. However, it is also a general test devised for any distribution, while the VMR test focuses on a specific property characterizing the Poisson distribution, meaning that it is specifically tailored for the problem at hand. Therefore, the VMR test tends to have higher discrimination power under the expected over-dispersion of correlated OT events. Indeed, it was found to be the most powerful among several alternatives in these circumstances .
For each location, the distributions of the test statistics under are estimated by simulating samples with the same size of the observed time series from a Poisson distribution, with the rate parameter being equal to the observed rate of occurrence.
4.2 Modeling dependence and nonstationarity: distinguishing assumptions and models
The aim of the test-based methodology described in Sect. is to investigate the nature of possible trends in time series. The underlying idea is that such trends could be a spurious effect of serial dependence or, vice versa, that the serial dependence could be a spurious effect of deterministic trends. Therefore, the Poisson-INAR(1) model is used to check the former case, while NHP is used to check the latter.
While this approach seems to be reasonable at first glance, it suffers from technical problems related to neglecting formal definitions of stationarity and trend, along with lacking a clear distinction between population and sample properties. Note that the word “stationarity” used throughout this study refers to the formal definition given by and , which is the basis of theoretical derivations in mathematical statistics. The word “stationarity” is often used with different meanings in hydro-climatological literature. However, such informal definitions do not apply to statistical inference and might generate confusion. These issues are discussed in depth by ,
As mentioned in Sects. and , a selected statistical model should describe the stochastic properties of the observed process, and results should be interpreted according to the constraints posed by model assumptions. In the case of , the underlying question is whether possible “monotonic” fluctuations are deterministic (resulting from a well-identifiable generating mechanism) or stochastic (as an effect of dependence, for instance). In the former case, we work under the assumption of independence and nonstationarity, whereas, in the latter case, we work under the assumption of dependence and stationarity. Both assumptions are very general and correspond to a virtually infinite set of possible model classes and structures. Therefore, we need to recall the following:
-
Every model developed under a specific set of assumptions is only valid under its own set of assumptions and cannot be used to validate the assumptions it relies on as it cannot exist under different assumptions. For example, Poisson-INAR(1) cannot be used to assess the stationarity assumption as it is not defined (it does not exist) under nonstationarity.
-
No specific model can be representative of the infinite types of models complying with the same set of assumptions. For example, if Poisson-INAR(1) models do not provide a good description of , this does not exclude the fact that other dependent and stationarity models with different marginal distributions and linear or nonlinear dependence structures can faithfully describe .
-
Every model developed under a specific set of assumptions cannot provide information about different sets of assumptions. For example, if we assume independence and nonstationarity and show that the NHP models describe all the properties of interest of an observed process, we can conclude that the NHP models provide a good description of data, but we cannot say anything about the performance of models complying with the assumption of dependence and stationarity and vice versa.
-
Stationarity and nonstationarity are not properties of the observed hydro-climatic processes (finite-size observed time series) but are instead assumptions of the models we deem to be suitable to describe physical processes.
-
Poisson-INAR(1) and NHP are valid only under their own assumptions and do not represent the entire class of possible stationary and nonstationary models. Therefore, discarding one of them does not imply invalidating their own assumptions (and the corresponding infinite classes of models), and, for sure, it does not allow invalidation of the assumptions of the alternative model as each model could not even exist under the assumptions of the other one.
In this study, we use so-called “surrogate” data to represent the class of stationary dependent models, minimizing the number of additional assumptions and constraints. In particular, we apply the iterative amplitude-adjusted Fourier transform (IAAFT), which is a simulation framework devised to preserve (almost) exactly the observed marginal distribution and power spectrum – and therefore the autocorrelation function (ACF) – under the assumption of stationarity (see Sect. S4 in the Supplement). In the present context, IAAFT can be considered to be a “semi-parametric” approach. Indeed, it does not make any specific assumption regarding the shape of the distribution of (preserving the empirical one), while a parametric dependence structure is needed to correct the bias of the empirical periodogram caused by temporal dependence (see Sect. S4 in the Supplement).
If the model of choice is well devised, it is expected to mimic the observed fluctuations of , including apparent trends. It follows that the statistical tests for trends used in the test-based methodology are expected to yield a rejection rate close the nominal significance level. Here, we use IAAFT samples as a more general stationary alternative closer to the observed time series in terms of marginal distribution and ACF. IAAFT simulations are used to derive the sampling distributions of the MK and PR tests under the assumption of temporal dependence.
4.3 Field significance under dependence
The FDR approach is devised to control the rate of false rejections with respect to the number of rejections rather than the significance level – that is, the rate of false rejections with respect to the total number of performed tests. As a consequence, highlighted that “The power of all the methods [Bonferroni, Hochberg, and Benjamini–Hochberg methods] decreases when the number of hypotheses tested increases – this is the cost of multiplicity control”. However, the decreased power does not justify the recommendation of going back to at-site results, as suggested in the literature , thus overlooking de facto the field significance. In fact, field significance might be affected by unknown factors generating spurious results in terms of statistical tests at the local, regional, or global scale. In this respect, looking at clusters of rejections of the hypothesis of “no trend”, whereby trends have the same sign in a given area, as suggested, for instance, by , might be misleading as this is exactly the expected behavior under spatio-temporal dependence or dependence on an exogenous (common) forcing factor
Technically, the output of all statistical tests is analyzed in terms of values and FDR diagrams reporting the sorted values versus their ranks
4.4 Summary of model-based analysis
Summarizing Sect. , , and , in a model-based approach, marginal distributions are parameterized by models (when needed), which are consistent with the assumption of spatio-temporal dependence. Goodness of fit is checked by suitable diagnostic diagrams, such as plots of versus and probability plots, and statistical tests purposely devised to discriminate under-, equi-, or over-dispersion.
For the sake of comparison with results reported in the existing literature, IAAFT is used to simulate synthetic samples of . This simulation method is semi-parametric in the sense that it uses the empirical marginal distributions of , whereas a Hurst–Kolmogorov parametric dependence structure
IAAFT samples are used to build the empirical distribution of the test statistics of MK and PR tests, accounting for the effect of dependence in trend analysis. Such tests are performed both locally and globally, accounting for the test's multiplicity via FDR. Note that the use of statistical tests is not very meaningful in the context of unrepeatable processes. These are used for the sake of comparison with the test-based approach to highlight their inherent redundancy and/or inconsistency.
Finally, we stress once again that, in a model-based approach (i.e., standard statistical inference, as it should be), the choice of the foregoing candidate models and methods was based on theoretical considerations about the effect of a finite sample size, threshold selection, and dependence, as discussed in Sect. , , and . This contrasts with the test-based approach, which relies instead on several distributions, diagnostic plots, and statistical tests that correspond to heterogeneous assumptions. Such test-based methods are generally selected without paying attention to their being fit for purpose, and they are used while neglecting the effect of their assumptions on the inference procedure. This approach yields contradictory results that are further discussed in Sect. .
5 Data analysis
5.1 Marginal distribution of extreme- occurrences: Poissonian?
As mentioned in Sect. , preliminary exploratory analysis based on simple but effective diagnostic plots is often neglected even though it might be more informative than the binary output of statistical tests. Graphical diagnostics are a key step of the model-based approach. Concerning the marginal distribution of , the most obvious preliminary analysis consists of checking under-, equi-, and/or over-dispersion.
Figure a–f show the diagrams of variance versus mean ( versus ), comparing the scatterplots corresponding to observed OT values over the 95 % and 99.5 % thresholds with those corresponding to samples of the same size drawn from Poisson, NHP, and distributions with parameters estimated based on the observed samples. In more detail, for each of the 1106 records, 100 synthetic time series of are simulated from Poisson, NHP, and models, thus calculating the ensemble averages of the 100 sample means and variances estimated for each simulated sample of . Figure a–f display such ensemble averages as circles, along with horizontal and vertical segments denoting the range of mean and variance values obtained over 100 simulations for each of the 1106 records.
As expected, the variance and mean of observed are not aligned along the theoretical line (dashed line) characterizing the Poisson behavior, not even when considering the sampling uncertainty (Fig. a–f). The patterns of observed means and variances of OT data do not even match those of NHP samples, which under-represent the expected and remarkable over-dispersion (Fig. b and e). On the other hand, the distribution provides variance values closer to the observed ones, indicating that the variance inflation is consistent with the assumption of temporally dependent occurrences, as expected from preliminary theoretical considerations.
Figure 2
Diagrams of sample variance versus sample mean of the annual occurrences of OT values and POT for the 95 % and 99.5 % thresholds. Observed values (in black) are compared with those corresponding to simulated samples from Poisson, NHP, and beta-binomial () distributions (in orange). Orange circles denote ensemble averages, while horizontal and vertical segments denote the range of mean and variance values obtained over 100 simulations for each of the 1106 records. Dashed gray lines indicate the lines representing the equality of mean and variance characteristics of Poisson distributions.
[Figure omitted. See PDF]
For the sake of completeness, we also considered the peaks over threshold (POT) – that is, the maximum values of independent clusters of OT values, where each cluster is considered to be a single event. Clusters are identified as sequences of positive values of daily precipitation separated by one or more dry days. Different inter-arrival times for cluster identification can be used, but this parameter is secondary in the present context.
The Poisson distribution might be a valid asymptotic model for peaks over high thresholds under suitable conditions
Since Poisson and NHP provide similar results in terms of VMR for all cases, we retain the former and use probability plots (probability versus quantiles) to further check the agreement between observed data and models. We compare ECDFs with the CDFs and of the Poisson and models, respectively. Probability plots are complemented with diagrams of the differences versus . For the 95 % threshold, Fig. a and b show that the Poisson distribution cannot account for the observed variability in the empirical distributions, while models cover the range of variability thanks to the additional parameter summarizing the intra-block autocorrelation. This is consistent with the interpretation of over-dispersion as an effect of temporal dependence already highlighted in Fig. . The diagrams of versus in Fig. e and f confirm that the models are closer to the empirical distributions for a wider range of values compared with Poisson. For the 99.5 % threshold, we have similar results (Fig. c, d, g, and h), indicating that the temporal dependence of the generating processes still plays a role despite the apparently low correlation of time series.
Figure 3
Panels (a)–(d) show the ECDFs of for the 95 % and 99.5 % thresholds and for each of the 1106 rain gauges. ECDFs are complemented with the median, lower, and upper limits of the ensemble of the Poisson and beta-binomial () models corresponding to each rain gauge. Panels (e)–(h) show the median, lower, and upper limits of the differences between ECDFs and Poisson and distributions. Median, lower, and upper values are computed point-wise for each value of the number of OT events .
[Figure omitted. See PDF]
The outcome of our exploratory analysis disagrees with results of goodness-of-fit tests reported by , who concluded that the hypothesis of the Poisson distribution for cannot be rejected in more than 95 % of the gauges at for all values of the percentage threshold and number of years. Therefore, we applied the three goodness-of-fit tests described in Sect. to double-check the results of our exploratory analysis and to allow a direct comparison. Figure shows the FDR diagrams reporting the sorted values versus their ranks
Figure 4
Illustration of FDR criterion for (diagonal gray line) corresponding to . Plotted points are the sorted values of 1106 local tests for KS, PPMCC, and VMR tests. Points below the diagonal lines represent significant results (i.e., rejections of ) according to FDR control level.
[Figure omitted. See PDF]
Therefore, the theoretical arguments discussed in Sect. and the foregoing exploratory analysis indicate that the distribution might be a good candidate distribution to describe , while neither Poisson nor NHP are suitable options. This calls into question the use of the Poisson-INAR(1) model as a stationary dependent reference to be used in the subsequent trend analysis. Note that the emergence of the distribution is inherently related to the assumption of serial dependence. In other words, even though we can build autocorrelated processes with whatever marginal distribution, and even though these models (e.g., Poisson-INAR(1)) can be technically correct from a mathematical standpoint, they are not necessarily consistent with the studied process. In the present case, theoretical reasoning tells us that the distribution of over finite-size blocks under the assumption of temporal dependence is over-dispersed and cannot be Poissonian. It follows that autocorrelated processes with Poisson marginal distributions are known in advance to be unsuitable for these OT processes; albeit, they can be mathematically correct and suitable in other circumstances.
It is also worth noting how simple diagnostic plots supported by theoretical arguments concerning the stochastic properties of the studied process provide more information than the binary output (rejection or no rejection) of whatever statistical test and also help with the identification of consistent models. On the other hand, statistical tests might be misleading. They suffer not only from several logical and technical inconsistencies but also from trivial problems related to the their choice (see discussion in Sect. ). In fact, while KS, PPMCC, and VMR tests seem to be suitable choices, they have very different power for the specific problem at hand and can lead to contrasting conclusions without providing insights into the actual nature of the investigated process.
5.2 Stationary or nonstationary models?5.2.1 Linear correlation versus linear trends: practical consequences of confusing assumptions with models
The test-based approach led to discard temporal dependence as a possible cause of apparent trends in time series based on disagreement between trend slopes estimated using observed data and Poisson-INAR(1)-simulated samples. In particular, their conclusion is based on diagrams of the lag-1 autocorrelation coefficient versus the slope of the linear trend estimated based on observed time series and sequences simulated by NHP and Poisson-INAR(1). Since the foregoing exploratory analysis indicated that the NHP and Poisson-INAR(1) models are not consistent with the marginal distributions of , we used IAAFT samples as a more realistic alternative.
Figure a compares the observed pairs with those resulting from the Poisson-INAR(1) models. Figure a is similar to Fig. 4a in , which compares observed pairs with those corresponding to independent Poisson variables. Figure a confirms that the Poisson-INAR(1) models do not provide a good description of the observed behavior, as expected for models that cannot even reproduce marginal distributions. On the other hand, the pattern of the pairs estimated from IAAFT samples matches that of the observed pairs much better (Fig. b).
Figure 5
Scatterplots of the pairs for the 1106 observed time series over the 95 % threshold and over 100 years (1916–2015), along with pairs corresponding to Poisson-INAR(1) samples (a), pairs corresponding to IAAFT samples (b), 95 % CIs of for IAAFT and NHP (c), and 95 % CIs of for IAAFT and Poisson-INAR(1) (d).
[Figure omitted. See PDF]
Following , we also simulated 10 000 samples from (i) the NHP model for fixed values of ranging between and 0.2, (ii) Poisson-INAR(1) for ranging between 0 and 0.8, and (iii) IAAFT. The 10,000 samples from the three models allow the estimation of two different conditional probabilities – that is, and , respectively. Therefore, we can estimate the confidence intervals (CIs) of the conditional variables for NHP and for Poisson-INAR(1). Figure c and d show these point-wise CIs. Even though their comparison in unfair since they refer to different conditional distributions, discarded Poisson-INAR(1) as the CIs of for NHP cover the observed pairs better than the CIs of .
However, as mentioned in Sect. , if a specific model in the class of stationary models does not fit well, this does not enable us to discard the entire class. In fact, the conditional CIs of built from the IAAFT samples indicate that alternative stationary processes can yield results similar to NHP. On the other hand, IAAFT CIs of are much wider than those from Poisson-INAR(1) samples, thus confirming that such a model is clearly inappropriate to describe . Therefore, the class of dependent stationary models and the assumption of stationarity cannot be discarded based on the poor performance of a single misspecified stationary model that does not even reproduce the marginal distribution of the observed data.
The foregoing analysis is complemented by an additional one focusing on sub-samples. Following , we sampled values recorded every 4 years, thus extracting four sub-series of size 25 and therefore removing the effect of potential autocorrelation at lags from 1 to 3 years. For each sub-sample, we estimated the linear trend slopes (with ) and plotted these against the slope estimated based on the full series.
Figure a and b reproduce Fig. 5 in . Figure a shows the scatterplot of the pairs of the observed , while Fig. b displays the pairs for synthetic series from the NHP model. The similarity of the patterns led to conclude that the serial dependence does not play any significant role, and observed trends are consistent with NHP behavior. However, the additional Fig. c shows that the pairs from stationary and temporally correlated IAAFT samples also exhibit behavior similar to that of the observed , thus contradicting the foregoing conclusion.
Figure 6
(a) Scatterplot of the pairs , where is the slope of the linear trend estimated based on the observed time series for the 95 % threshold and 100 years, while (with ) is the slope estimated based on four sub-samples of lengths equal to 25 years. (b, c) Similar to (a) but for time series simulated from NHP and IAAFT, respectively.
[Figure omitted. See PDF]
Analogously to the analysis of pairs , the analysis of the pairs seems to be reasonable at first glance. However, it suffers from similar inconsistencies:
-
The 4-year-lagged sub-samples are supposed to be approximately independent based on the belief that the correlation is weak and that the value of ACF terms is generally low. However, this assumption misses the fact that, under dependence, the ACF estimates based on finite samples are negatively biased and need to be adjusted according to a parametric model of choice. In this respect, Fig. and the corresponding Fig. 4 in are not even consistent because all panels show the values obtained by standard estimators that are only valid under independence, whereas the panels referring to Poisson-INAR(1) and IAAFT should show values adjusted for estimation bias. This further confirms that assumptions like (in)dependence and (non)stationarity influence not only the model parameterization (e.g., Poisson with or without linear trend) but also the entire inference procedure, including diagnostic plots and their interpretation.
-
The performance of a specific independent nonstationary model (NHP) is incorrectly considered to be informative about the performance of an entire alternative class of dependent stationary models, while NHP is not even defined under those alternative assumptions. The ability of NHP to reproduce the patterns of cannot exclude the existence of equally good or better models based on different assumptions. The only way to understand if a class of models (and the underlying assumptions) is suitable for a given data set is to use credible members of such a class. This conceptual mistake is the same one affecting statistical tests, where the rejection of is often misinterpreted and leads us to uncritically embrace the alternative hypothesis , while rejection can be due to unknown factors that are not included in either or (see discussion in Sect. ).
After analyzing marginal distributions and temporal dependence, we study spatio-temporal fluctuations of , comparing model-based and test-based methods. We expand the analysis of (number of annual OT values at each gauging location) considering the number of daily and annual OT events aggregated over the five regions described in Sect. and shown Fig. . This allows us to check if the assumption of dependence and the corresponding models provide a reasonable description of OT frequency over a range of spatio-temporal scales.
5.3.1 Trend analysis under temporal dependence
We analyze the presence of trends in time series recorded at the 1106 stations selected from the GHCN gauge network. For the sake of comparison with test-based results, trends are investigated by applying the same MK and PR tests. However, the distributions of test statistics (and therefore of critical values) are estimated from 10 000 IAAFT samples to properly account for the over-dispersion of the marginal distributions and the temporal correlation. Moreover, tests are firstly performed at the local 5 % significance level without applying FDR to check if the empirical rejection rate is close to the nominal one, as expected under correct model specification.
Figures and show the maps of trend test results for the 95 % threshold; the 100-year sample size; and the three regions of North America, Eurasia, and Australia, with and without FDR, respectively (results for the other combinations of thresholds and sample sizes with and without FDR are reported in Figs. S2–S7 in the Supplement). Results in Figs. and are different from those reported by in their analogous Fig. 9. To better understand such differences, we examine the rejection rates of both MK and PR tests for the four combinations of two thresholds (95 % and 99.5 %) and two sample sizes (50 and 100 years) and four different cases: () critical values of test statistics obtained by IAAFT without bias correction of the autocorrelation and without FDR, () critical values obtained by IAAFT with bias correction and without FDR, () critical values obtained by IAAFT without bias correction and with FDR, and () critical values obtained by IAAFT with bias correction and FDR (Table ). Case highlights the effect of adjusting for the bias of classical ACF estimators under the assumption of temporal dependence, while cases and show the effects of spatial correlation (albeit indirectly via FDR).
Figure 7
Maps of statistically significant trends at the GHCN gauges of the three regions of North America (a), Eurasia (b), and Australia (c). Results refer to MK and PR tests applied to time series for the 100-year sample size and the 95 % threshold without FDR. Statistical tests are performed at the local 5 % significance level without applying FDR. The distributions of test statistics (and therefore critical values) are estimated from 10 000 IAAFT samples. Gray circles denote lack of rejection by both tests. Results for the other combinations of thresholds and sample sizes are reported in Figs. S2–S4 in the Supplement.
[Figure omitted. See PDF]
Figure 8
Similar to Fig. but with FDR. Results for the other combinations of thresholds and sample sizes are reported in Figs. S5–S7 in the Supplement.
[Figure omitted. See PDF]
Table 1Rejection rates of PR and MK tests for the four combinations of two thresholds (95 % and 99.5 %) and two sample sizes (50 and 100 years) and different treatments of spatio-temporal dependence (cases , , , and ).
Test | Threshold (%) | Sample size (years) | Rejection rate (%) | |||
---|---|---|---|---|---|---|
Case | Case | Case | Case | |||
PR | 95 | 100 | 22.3 | 13.4 | 3.0 | 0 |
MK | 20.3 | 12.6 | 0 | 0 | ||
PR and MK | 18.2 | 9.9 | 4.6 | 0 | ||
PR | 95 | 50 | 6 | 5.8 | 0 | 0 |
MK | 6 | 5.6 | 0 | 0 | ||
PR and MK | 3.9 | 3.9 | 0 | 0 | ||
PR | 99.5 | 100 | 13.7 | 12.6 | 0 | 0 |
MK | 14.2 | 11.9 | 0 | 0 | ||
PR and MK | 11 | 9.1 | 0 | 0 | ||
PR | 99.5 | 50 | 5.5 | 5.4 | 0 | 0 |
MK | 5.6 | 5.3 | 0 | 0 | ||
PR and MK | 3.7 | 3.6 | 0 | 0 |
Focusing on case , for 50-year time series, local rejection rates are always close to the nominal 5 %, as expected. For 100-year time series, local rejection rates corresponding to the 95 % and 99.5 % thresholds reach a maximum of 22 % and 14 %, respectively. These values seems to be higher than expected. However, after correcting ACF bias (case ), the maximum rejection rate for the 95 % threshold drops to 13 %, whereas the rejection rates for the 99.5 % threshold stay almost unchanged. This is due to the higher (lower) autocorrelation of OT values corresponding to lower (higher) thresholds and, therefore, stronger (weaker) bias correction. Overall, accounting for ACF bias results in rejection rates ranging between 9 % and 13 %. After considering (indirectly) the effects of spatial correlation via FDR (cases and ), the rejection rate drops to zero in all cases if we correct ACF bias (case ), meaning that all tests are globally not significant at . If we do not adjust ACF bias (case ), a small percentage of tests indicate global significance at for a sample size of 100 and a threshold equal to 95 %. This is expected as the correction of ACF bias is more effective for larger sample sizes and lower thresholds. In fact, decreasing thresholds generally correspond to an increasing temporal correlation of samples, and larger samples allow a better quantification of the properties of a correlated process.
Some simple diagnostic plots can provide a clearer picture. Figure shows the scatterplots of the pairs , with the rejections highlighted by different markers. Figure is analogous to Fig. 10 in but using IAAFT samples and considering the cases , , , and . It shows how the number of rejections decreases as the effects of temporal and spatial correlation are progressively compounded. Focusing on case , rejections tend to occur for higher values of conditioned to the value of , but there is no systematic rejection for all exceeding a specified value, as for the case . On the contrary, the pairs marked as rejections overlap with the pairs marked as no rejection, indicating that we can have both rejections and no rejections for time series with similar values of and .
Figure 9
Scatterplots of the pairs , with the rejections highlighted by different markers. Markers refer to rejections of the MK test, PR test, or both for time series corresponding to the four combinations of two thresholds (95 % and 99.5 %) and two sample sizes (50 and 100 years) and the three cases , , , and . Gray circles denote lack of rejection by both tests.
[Figure omitted. See PDF]
These results disagree with those reported by , who found that the hypothesis of a significant trend is always rejected for events per year and concluded that “the occurrence of the different cases is controlled by , while is not influential, thus providing additional evidence on the limited effect of autocorrelation on trend detection”. However, those rejections result from MK and PR tests performed under the assumption of temporal independence (case ; in ). In this case, rejections are necessarily independent of due to the implicit model under which the tests are performed. As recalled in Sect. , results should be interpreted in light of the underlying statistical model and not vice versa.
As mentioned throughout this study, a suitable diagnostic plot might be more informative than just reporting the number and/or rate of rejections. Figure displays the FDR diagrams of the sorted values versus their ranks for MK and PR tests. All values are above the FDR reference line. Independently of the geographic region one focuses on, all tests are not significant at , and cannot be rejected at the global level. Moreover, FDR diagrams provide additional information. In fact, when is consistent with the underlying (implicit or explicit) model and corresponding assumptions, values are expected to be uniformly distributed – that is, aligned along a straight line connecting the origin and the point with abscissa equal to the maximum rank and ordinate equal to the maximum value, where the latter is equal to 1 (0.5) for one-sided (two-sided) tests
Figure 10
Illustration of FDR criterion with (diagonal gray line). Plotted points are the sorted values resulting from the application of MK and PR tests to data recorded at each location. Results are reported for the five regions described in Sect. . The distributions of test statistics (and therefore critical values) are estimated from 10 000 IAAFT samples. Points below the diagonal lines represent significant results (i.e., rejections of ) according to FDR control levels.
[Figure omitted. See PDF]
5.3.2 Space beyond time: the non-negligible effects of spatial dependenceFigure (and Figs. S2–S4 in the Supplement) shows that the locally significant trends tend to cluster in geographic areas where such trends exhibit the same sign (e.g., southwestern Australia, North America, and European coastal areas around the North Sea). Often, this behavior is interpreted as further evidence of trend existence. However, this interpretation neglects the fact that the spatial clustering is also an inherent expression of spatial dependence in the same way temporal clustering is the natural expression of temporal dependence
For example, state that the detection of significant trends with similar signs or magnitudes over spatially coherent areas “is also supported by the physical argument that extreme is often controlled by synoptic processes (Barlow et al., 2019), and that their occurrence is changing in time (Zhang & Villarini, 2019)”. However, while a similar evolution of the occurrence of extreme and synoptic systems is due to their physical relationships, statistical tests for trends cannot provide information about the nature of the temporal evolution of such processes. Indeed, as shown in Sect. , the outcome of statistical tests depends on the underlying assumptions. Therefore, the jointly evolving fluctuations of both processes (extreme and synoptic systems) can be identified as not significant or significant based on the assumptions used to perform the statistical tests. Loosely speaking, if we observe an increasing trend in the occurrence of synoptic systems, a similar behavior is likely to emerge in local records observed over the area where the synoptic processes occur as the latter causes the former. Therefore, what we actually need is to identify the physical mechanism causing trends in the synoptic systems as trends in extreme are just a consequence. In this respect, performing massive statistical testing is rather uninformative as it does not matter whether the observed fluctuations are statistically significant or not. Detecting trends in multiple local processes that are known to react to fluctuations in synoptic generating processes does not add evidence and just reflects information redundancy due to their common causing factor.
To support the foregoing statements with quantitative analysis, we checked the consistency of the spatio-temporal behavior of observed OT occurrences with the assumption of spatio-temporal dependence. Following the model-based approach, we avoid statistical tests for trend detection and rely on theoretical reasoning to formulate a coherent model, thus checking the agreement with observations by means of simple but effective diagnostic plots.
We firstly check the role of the possible spatial dependence of OT occurrences, focusing on the distribution of the number of daily OT occurrences over the five regions introduced in Sect. (i.e., the world, North America, Eurasia, northwestern Europe, and Australia). A daily timescale is selected to isolate the effect of spatial dependence from that of temporal dependence as it is the finest timescale, and the counting procedure does not involve any aggregation over time. The occurrence of OT events over locations can be seen as the outcome of Bernoulli trials. Under dependence, the distribution of can be described by a distribution, where the parameter controlling over-dispersion can be expressed as the average of the off-diagonal terms of the lag-0 correlation matrix of the process describing the daily occurrence or non-occurrence of OT events at each spatial location (see Sect. S4 in the Supplement). In other words, if the spatial correlation is sufficient to describe the spatial clustering, we expect that the distribution with estimated as the average cross-correlation between binary time series of daily OT occurrences faithfully matches the empirical distribution of over any region.
Figure shows that the distribution reproduces accurately the above-mentioned empirical distributions for any threshold and region. Note that North America, Eurasia, Australia, and northwestern Europe are nested regions of the world, and northwestern Europe is also a nested region of Eurasia. Therefore, the remarkable fit of indicates that the spatial correlation is sufficient to describe the spatial clustering both globally and locally. In other words, the simultaneous occurrence of daily OT events in North America or northwestern Europe, for instance, is consistent with a stationary spatially correlated process. Figure also reports the binomial distribution that would be valid under independence, thus highlighting the huge (but too often neglected) impact of spatial dependence on the distribution of
Figure 11
ECDFs of number of OT events (for the 95 % and 99.5 % thresholds) occurring at a daily timescale over different regions, along with binomial and CDFs.
[Figure omitted. See PDF]
Since the aim of tests for trends should be the detection of “deterministic” temporal patterns, we checked the possible temporal evolution of the distribution of over the five regions. This information is summarized in Fig. in terms of boxplots of aggregated at a decadal scale to better visualize temporal patterns along the century. Figure also shows the 95 % prediction intervals from binomial and distributions reported in Fig. . Of course, these prediction intervals are constant as the binomial and distributions are unique under the assumption of stationarity. The empirical distributions of do not show any evident temporal evolution throughout the 10 decades, and more importantly, any possible fluctuation is well within the range of values allowed by the distribution. The comparison of the 95 % prediction intervals from binomial and models further highlights the huge effect of spatial dependence, which can yield prediction intervals from up to times wider than those corresponding to spatial independence.
Figure 12
Boxplots summarizing the decadal distributions of the number of OT events (for the 95 % and 99.5 % thresholds) occurring at a daily timescale over different regions, along with the 95 % prediction intervals corresponding to binomial and distributions.
[Figure omitted. See PDF]
5.3.3 Space and time: the non-negligible effects of spatio-temporal dependenceWhile the analysis at a daily scale allowed us to focus on spatial dependence, here we focus on the annual number of OT events over multiple locations. Studying the spatial clustering of such data implies aggregation in space and time. In other words, the occurrence of OT events can be thought of as a set of Bernoulli trials over locations (i.e., the number of stations in each geographic region) and time steps (i.e., the 365 d in a 1-year time interval), and we are interested in the distribution of resulting from Bernoulli trials. This case is analogous to that concerning daily OT occurrences. The distribution is still a theoretically consistent model for , and its over-dispersion parameter can be expressed as the average of all lagged auto- and cross-correlation values of the generating binary process up to time lag (see Sect. S4 in the Supplement). As with the case of spatial dependence, we use probability plots and boxplots to assess the validity of the distribution and, therefore, its underlying assumption of spatio-temporal dependence.
Figure shows that the model faithfully describes the ECDFs of for any threshold and region. This means that the local spatio-temporal correlation is sufficient to describe the differences in all regions and sub-regions without introducing any ad hoc local models involving, for instance, physically unjustified linear trends or generic links with local exogenous factors. We stress again that the parameter is not estimated based on the 100 values of in each region; instead, it comes from the spatio-temporal correlation values of the generating binary process . Therefore, the goodness of fit of the distribution is not related to the minimization of some distance metric for 100-size samples but depends on the agreement of the observed binary time series with the hypothesized stationary spatio-temporal stochastic process .
Figure 13
ECDFs of number of OT events (for the 95 % and 99.5 % thresholds) occurring at an annual timescale over different regions, along with binomial and distributions.
[Figure omitted. See PDF]
For any threshold and region, Fig. confirms that the temporal fluctuations of the distribution of are well within the range of values expected from a stationary stochastic process characterized by the observed spatio-temporal correlation structure. In this case, the 95 % prediction intervals from models under spatio-temporal dependence are from up to times wider than those yielded by binomial distribution under spatio-temporal independence. Of course, an increasing pattern along the decades is evident in the regions of the Northern Hemisphere. However, accounting for spatio-temporal correlation dramatically changes their interpretation. Such fluctuations are obviously inconsistent with the assumption of independence and, therefore, the binomial model. This explains the high rejection rate of the trend tests performed under independence. On the other hand, low-frequency fluctuations evolving over wide spatial scales and timescales comparable to or longer than the observation period are the expected behavior of spatio-temporally dependent processes. Therefore, we should ask ourselves whether such fluctuations can look unexpected or surprising because they are actually unusual or just because humans tend to systematically underestimate the actual uncertainty characterizing the surrounding environments, thus looking at hydro-climatic processes with a too-anthropocentric point of view, which is inherently uncertainty-averse or behaviorally biased towards known outcomes.
Figure 14
Boxplots summarizing the decadal distributions of the number of OT events (for the 95 % and 99.5 % thresholds) occurring at an annual timescale over different regions, along with the 95 % prediction intervals corresponding to binomial and distributions.
[Figure omitted. See PDF]
6 General discussion and concluding remarks6.1 Statistical tests for trend detection: unfit for purpose!
Disagreements between model-based and test-based methods are mainly related to the inherent problems affecting statistical hypothesis tests. These are statistical methods developed for the evaluation of differences in repeatable experiments that “have been misused to create an illusion of a scientific analysis of unrepeatable hydrologic events” . Logical, conceptual, and practical inconsistencies in statistical tests have been widely discussed in both theoretical and applied literature
One of the key drawbacks of statistical tests is the error of the “transposed conditional” (also known as the “converse inequality argument” or “inverse-probability problem” ). This consists of confusing conditional and conditioning events so that we are interested in the probability of the null hypothesis given the observational evidence (data); hence, we end up calculating the probability observational data when is assumed to be true. This is like confusing the probability [a man is a UK citizen a man is the King of the UK] with the probability [a man is the King of the UK a man is a UK citizen] .
In the context of statistical tests for trend detection applied to hydro-climatic data, the rejection of (e.g., no trend) does not provide information about the likelihood of given the observations. Rejection does not allow any statement about possible deterministic trends because deterministic trends are not a property of the model assumed to perform the test. In other words, rejection can be due to something that is unknown and different from deterministic trends. Similarly, no rejection might be related to the violation of implicit assumptions of the model , thus introducing spurious effects due to exogenous factors . One of these factors is the spatio-temporal dependence. Its effects on statistical inference have been widely studied in the literature . In particular,
6.2 Models, tests, and their interpretation
The aim of most of the literature applying statistical tests for trend detection in hydro-climatic processes is to find the answer to a question that can generally be summarized as “are these processes stationary or nonstationary?”. However, such a question is scientifically ill-posed as natural processes cannot be either stationary or nonstationary. Only mathematical models used to describe physical processes can be one or the other.
It can be argued that “statistical trend testing attempts to assess whether the natural process has manifested in a stationary or nonstationary fashion during the period of observation to ultimately support decision-making in the future”. However, this type of statement confuses sampling fluctuations, which can look monotonic or not, with a population property such as stationarity. Statistical trend tests attempt to infer the latter as theoretical properties or assumptions are the only one behind any statistical method. As with every statistical method, statistical tests for trend detection make inferences about “population stationarity” and not about sampling fluctuations, which can result from a variety of stationary and nonstationary processes. These tests attempt to establish what kind of population behavior is compatible with observed sampling fluctuations. Otherwise, we would not need any test to state that an observed sample shows a given (monotonic or non-monotonic) temporal pattern as we would just need to look at the diagrams of time series. We infer the population properties because these allow us to assume a model and to make out-of-sample predictions. We argue that the vague use of the term “stationarity”, overlooking a formal definition and its consequences, is one the main reasons for the widespread misuse and misinterpretation of the output of statistical tests
The comparison of test-based and model-based approaches discussed in this study attempts to clarify the foregoing concepts. For the same physical process (i.e., the OT occurrences of ), we showed two options. On the one side, we can choose to model the number of OT occurrences by nonstationary Poisson distributions (NHP). In this way, (i) we neglect the fact that the Poisson distributions are theoretically unsuitable to describe and, therefore, do not reproduce the observed marginal distribution of the process, and (ii) we assume that the rate of occurrence in NHP models changes in time according to linear (or nonlinear) laws that have no physical justification. The aim of these types of regression models is exactly to follow sampling fluctuations, and they hardly ever provide information about the underlying population properties. This also explains why extrapolation for these types of models is always deprecated in textbooks and introductory courses in statistics, and why, when it is done, it might yield paradoxical results . On the other side, we can attempt to preliminarily understand the general theoretical properties of spatio-temporal OT processes, look for appropriate models reproducing such expected population properties, and check if these models are general enough to reproduce the observations at various spatio-temporal scales. Using this approach, we ended up with the conclusion that the spatio-temporal correlation structure of a stationary stochastic process provides a good description of the behavior of the observed OT frequencies at various spatio-temporal scales. Thus, the actual question is not about the (non)stationarity of natural processes or “(non)stationary behavior of observed samples” but rather which kind of model we deem to be more suitable in terms of generality, reliability, and parsimony.
Conversely to what is often iterated in the literature, accurate statistical trend analyses of observed and modeled time series are not key to validate hypotheses regarding the underlying physical mechanisms and do not improve our ability to predict the magnitude of these changes. On the contrary, the foregoing discussion shows that the statistical tests for trend detection might generate confusion, potentially concealing model misspecification
6.3 Confirmatory versus disconfirmatory empiricism
Why do our results contrast with those reported in most of the existing literature on trend detection? The reason is that most of this literature resorts to methods based on the same unrealistic assumption of independence and corresponding trivial models such as those discussed in this study. On the other hand, when dependence is accounted for, its true consequences for the entire inference procedure are commonly underestimated, partially missed, or neglected
Why does most of the literature on trend detection rely on the same methods? There are several reasons. We argue that the main one is a too-superficial approach to probability and statistics, along with insufficient exposure to the epistemological principles of science. Using similar methods based on the same assumptions always gives similar results. However, “a million observations of white swans do not confirm the non-existence of black swans”, and “a million confirmatory observations count less than a single disconfirmatory one… What is called `evidence based' science, unless rigorously disconfirmatory, is usually interpolative, evidence-free, and unscientific” . Since most of the literature on trend detection is just an iterative application of test-based approaches and, eventually, statistical tests under the assumption of independence or ill-defined dependence, one should wonder whether the general agreement is related to the common misinterpretation of the output of the same inappropriate methodologies rather than to physical properties. In this study, we offered an alternative point view, which is nothing but the specialization for data analysis of the scientific method used for centuries and seemingly forgotten in recent decades in some research areas. Obviously, according to the scientific method, the content of this study should also be taken critically, and interested readers should independently assess which approach (test-based, model-based, or something else) looks more general; reliable; parsimonious; and, eventually, consistent with the epistemological principles of scientific inquiry.
Data availability
Data are freely available from the Global Historical Climatology Network repository
The supplement related to this article is available online at:
Competing interests
The author has declared that there are no competing interests.
Disclaimer
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.
Acknowledgements
The author acknowledges the support from the Willis Research Network. The author thanks the handling editor, Nadav Peleg (University of Lausanne), and the eponymous reviewers, Panayiotis Dimitriadis (National Technical University of Athens), Demetris Koutsoyiannis (National Technical University of Athens), and Giuseppe Mascaro (Arizona State University), for their critical remarks that helped substantially to improve the content and presentation of the original draft of the paper. The author also thanks Hans von Storch for his feedback on a previous version of this paper. The analyses were performed in R .
Review statement
This paper was edited by Nadav Peleg and reviewed by Giuseppe Mascaro, Demetris Koutsoyiannis, and Panayiotis Dimitriadis.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Statistics is often misused in hydro-climatology, thus causing research to get stuck on unscientific concepts that hinder scientific advances. In particular, neglecting the scientific rationale of statistical inference results in logical and operational fallacies that prevent the discernment of facts, assumptions, and models, thus leading to systematic misinterpretations of the output of data analysis. This study discusses how epistemological principles are not just philosophical concepts but also have very practical effects. To this aim, we focus on the iterated underestimation and misinterpretation of the role of spatio-temporal dependence in statistical analysis of hydro-climatic processes by analyzing the occurrence process of extreme precipitation (
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer