1 Introduction
1.1 The big one: dream or nightmare for the forecaster?
In many countries, operational flood forecasting services (FFS) issue forecasts routinely throughout the year and during rare or critical events. End users are mostly concerned by the largest and most damaging floods, when critical decisions have to be made. For such events, operational flood forecasters must get prepared to deal with extrapolation, i.e. to work on events of a magnitude that they and their models have seldom or never met before.
The relevance of simulation models and their calibration in evolving conditions, such as contrasted climate conditions and climate change, has been studied by several authors. For example, , , and explored the transferability of hydrological model parameters from one period to another and assessed the uncertainty associated with this parameterisation, while proposed a generalisation of the differential split-sample test . In spite of its importance in operational contexts, only a few studies have addressed the extrapolation issue for flow forecasting, to the best of our knowledge, with the notable exception of data-driven approaches
Addressing the extrapolation issue involves a number of methodological difficulties. Some data issues are specific to the data used for hydrological modelling, such as the rating curve reliability . Other well-known issues are related to the calibration process: are the parameters, which are calibrated on a limited set of data, representative or at least somewhat adapted to other contexts? A robust modelling approach for operational flood forecasting, i.e. a method able to provide relevant forecasts in conditions not met during the calibration phase, requires paying special attention to the behaviour of hydrological models and the assessment of predictive uncertainty in an extrapolation context.
1.2 Obtaining reliable forecasts remains a challenging task
Even if significant progress has been made and implemented in operational flood forecasting systems
The uncertainty associated with operational forecasts is most often described by a predictive uncertainty distribution. Assessing a reliable predictive uncertainty distribution is challenging because hydrological forecasts yield residuals that show heteroscedasticity, i.e. an increase in the uncertainty variance with discharge, time autocorrelation, skewness etc. Some studies
Various approaches to uncertainty assessment have been developed to assess the uncertainty in hydrological predictions
1.2.1 Modelling each source of uncertainty
A first approach intends to model each source of uncertainty separately and to propagate these uncertainties through the modelling chain
In practice, it is necessary to consider the meteorological forecast uncertainty to issue hydrological forecasts. The ensemble approaches intend to account for this source of uncertainty. They are increasingly popular in the research and the operational forecasting communities. An increasing number of hydrological ensemble forecasting systems are in operational use and have proved their usefulness, e.g. the European Flood Awareness System
Multi-model approaches can be used to assess modelling uncertainty . While promising, this approach requires the implementation and the maintenance of a large number of models, which can be burdensome in operational conditions. There is no evidence that such an approach ensures that the heteroscedasticity of the predictive uncertainty distribution would be correctly assessed.
In forecasting mode, data assimilation schemes based on statistical modelling are of common use to reduce and assess the predictive uncertainty. Some algorithms such as particle filters or the ensemble Kalman filter provide an assessment of the predictive uncertainty as a direct result of data assimilation (“in the loop”). Some of these approaches can explicitly account for the desired properties of the predictive uncertainty distribution, such as heteroscedasticity, through the likelihood formulation.
1.2.2 Post-processing approaches
Alternatively, numerous post-processors of deterministic or probabilistic models have been developed to account for the uncertainty from sources that are not modelled explicitly. They differ in several aspects
1.2.3 Combining different approaches
The approaches presented in Sect. 1.2.1 and 1.2.2 are not exclusive of each other. Even when future precipitation is the main source of uncertainty, post-processing is often required to produce reliable hydrological ensembles . Thus, many operational flood forecasting services use post-processing techniques to assess hydrological modelling uncertainty, while meteorological uncertainty is taken into account separately . Post-processors are then trained with “perfect” future rainfall (i.e. equal to the observations). Moreover, even for assessing modelling uncertainty, combining the strengths of several methodologies may improve the results.
Note that many of these approaches use a variable transformation to handle the heteroscedasticity and more generally the evolution of the predictive distribution properties with respect to the forecasted discharge. Some have no calibrated parameter, while others encompass a few calibrated parameters, allowing more flexibility in the predictive distribution assessment. More details on commonly used variable transformations are presented in Sect. 2.1.4.
1.3 Scope
In this article, we focus on uncertainty assessment with a post-processing approach based on residuals modelling. While the operational goal is to improve the hydrological forecasting, this study does not consider the meteorological forecast uncertainty: it only focuses on the hydrological modelling uncertainty, as these two main sources of uncertainty can be considered independently
-
to present a framework aimed at testing the hydrological modelling and uncertainty assessment in the extrapolation context,
-
to assess the ability and the robustness of a post-processor to provide reliable predictive uncertainty assessment for large floods when different variable transformations are used,
-
to provide guidance for operational flood forecasting system development.
We attempt to answer three questions: (a) can we improve residuals modelling with an adequate variable transformation in an extrapolation context? (b) Do more flexible transformations, such as the log–sinh transformation, help in obtaining more reliable predictive uncertainty assessment? (c) If the performance decreases when extrapolating, is there any driver that can help the operational forecasters to predict this performance loss and question the quality of the forecasts?
Section 2 describes the data, the forecast model, the post-processor and the testing methodology chosen to address these questions. Section 3 presents the results of the numerical experiments that are then discussed in Sect. 4. Finally, a number of conclusions and perspectives are proposed.
2 Data and methods
2.1 Data and forecasting model
2.1.1 Catchments and hydrological data
We used a set of 154 unregulated catchments spread throughout France (Fig. ) over various hydrological regimes and forecasting contexts to provide robust answers to our research questions . They represent a large variability in climate, topography and geology in France (Table ), although their hydrological regimes are little or not at all influenced by snow accumulation. Hourly rainfall, potential evapotranspiration (PE) and streamflow data series were available over the 1997–2006 period. PE was estimated using a temperature-based formula . Rainfall and temperature data come from a radar-based reanalysis produced by Météo-France . Discharge data were extracted from the national streamflow HYDRO archive . When a hydrological model is used to issue forecasts, it is often necessary to compare the lead time to a characteristic time of the catchment (Sect. ). For each catchment, the lag time (LT) is estimated as the lag time maximising the cross-correlation between rainfall and discharge time series.
Table 1
Characteristics of the 154 catchments, computed over the 1997–2006 data series.
Quantiles | |||||||
---|---|---|---|---|---|---|---|
0 | 0.05 | 0.25 | 0.50 | 0.75 | 0.95 | 1 | |
Catchment area (km) | 9 | 27 | 79 | 184 | 399 | 942 | 3260 |
Average altitude (m above sea level) | 64 | 92 | 188 | 376 | 589 | 897 | 1050 |
Average slope (%) | 2 | 3 | 6 | 9 | 18 | 32 | 39 |
Lag time (h) | 3 | 5 | 9 | 12 | 19 | 29 | 33 |
Mean annual rainfall (mm yr) | 639 | 727 | 876 | 1003 | 1230 | 1501 | 1841 |
Mean annual potential evapotranspiration (mm yr) | 549 | 549 | 631 | 659 | 700 | 772 | 722 |
Specific mean annual discharge (mm yr) | 53 | 142 | 262 | 394 | 583 | 1114 | 1663 |
Mean annual discharge (m s) | 1 | 1 | 1 | 2 | 5 | 17 | 53 |
Maximum hourly rainfall (mm h) | 10 | 12 | 16 | 20 | 26 | 41 | 61 |
Quantile 0.99 of the hourly discharge (m s) | 1 | 2 | 6 | 15 | 33 | 115 | 296 |
The set of 154 unregulated catchments used in this study. Average altitude is given in metres above sea level (m a.s.l.).
[Figure omitted. See PDF]
2.1.2 Hydrological modelWe used discharge forecasts computed by the GRP rainfall–runoff model. The GRP model is designed for flood forecasting and is currently used by the FFS in France in operational conditions . It is a deterministic lumped storage-type model that uses catchment areal rainfall and PE as inputs. In forecasting mode (Appendix A), the model also assimilates discharge observations available when issuing a forecast to update the main state variable of the routing function and to update the output discharge. In this study, it is run at an hourly time step, and forecasts are issued for several lead times ranging from 1 to 72 h. More details about the GRP model can be found in Appendix A.
Since herein only the ability of the post-processor to extrapolate uncertainty quantification is studied, the model is fed only with observed rainfall (no forecast of precipitation), in order to reduce the impact of the input uncertainty. For the same reason, the model is calibrated in forecasting mode over the 10-year series by minimising the sum of squared errors for a lead time taken as the LT. The results will be presented for four lead times, – LT / 2, LT, 2 LT and 3 LT – to cover the different behaviours that can be seen when data assimilation is used to reduce errors in an operational flood forecasting context .
2.1.3 Empirical hydrological uncertainty processor (EHUP)
We used the empirical hydrological uncertainty processor (EHUP) presented in . It is a data-based and non-parametric approach in order to estimate the conditional predictive uncertainty distribution. This post-processor has been compared to other post-processors in earlier studies and proved to provide relevant results . It is now used by operational FFS in France under the operational tool called OTAMIN . The main difference with many other post-processors (such as the MCP, the meta-Gaussian processor, the BJP or the BFS) is that no assumption is made about the shape of the uncertainty distribution, which brings more flexibility to represent the various forecast error characteristics encountered in large-sample modelling. We will discuss the impact of this choice in Sect. .
The basic idea of the EHUP is to estimate empirical quantiles of errors stratified by different flow groups to account for the variation of the forecast error characteristics with forecast magnitude. Since forecast error characteristics also vary with the lead time when data assimilation is used, the EHUP is trained separately for each lead time.
For each lead time separately, the following steps are used:
-
Training:
-
The flow groups are obtained by first ordering the forecast–observation pairs according to the forecasted values and then stratifying the pairs into a chosen number of groups (in this study, we used 20 groups), so that each group contains the same number of pairs.
-
Within each flow group, errors are calculated as the difference between the two values of each forecast–observation pair, and several empirical quantiles (we used 99 percentiles) are calculated in order to characterise the distribution of the error values.
-
-
Application:
-
The predictive uncertainty distribution that is associated with a given (deterministic) forecasted value is defined by adding this forecasted value to the empirical quantiles that belong to the same flow group as the forecasted value.
-
Since this study focuses on the extrapolation case, the validation is achieved with deterministic forecasts higher than the highest one used for the calibration. Therefore, only the highest-flow group of the calibration data is used to estimate the uncertainty assessment (to be used on the control data). This highest-flow group contains the top 5 % pairs of the whole training data, ranked by forecasted values. This threshold is chosen as a compromise between focusing on the highest values and using a sufficiently large number of forecast–observation pairs when estimating empirical quantiles of errors. In extrapolation, when the forecast discharge is higher than the highest value of the training period, the predictive distribution of the error is kept constant, i.e. the same values of the empirical quantiles of errors are used, as illustrated in Fig. .
The EHUP can be applied after a preliminary data transformation, and by adding a final step to back-transform the predictive distributions obtained in a transformed space. In previous work, we used the log transformation because it ensures that no negative values are obtained when estimating the predictive uncertainty for low flows . When estimating the predictive uncertainty for high flows, the data transformation has a strong impact in extrapolation, because the variation of the extrapolated predictive distribution, which is constant in the transformed space, is controlled in the real space by the behaviour of the inverse transformation, as explained below.
2.1.4 The different transformation families
Many uncertainty assessment methods mentioned in the Introduction use a variable transformation to handle the heteroscedasticity of the residuals and account for the variation of the prediction distributions with the magnitude of the predicted variable. Here, we briefly recall a number of variable transformations commonly used in hydrological modelling. Let and be the observed and forecasted variables (here, the discharge), respectively, and the residuals. When using a transformation , we consider the residuals .
Three analytical transformations are often met in hydrological studies: the log, Box–Cox and log–sinh transformations. The log transformation is commonly used
1 where is a small positive constant to deal with values close to 0. It can be taken as equal to 0 when focusing on large discharge values. This transformation has no parameter to be calibrated. Applying a statistical model to residuals computed on log-transformed variables may be interpreted as using a corresponding model of multiplicative error (e.g. assuming a Gaussian model for residuals of log-transformed discharges is equivalent to a log-normal model of the multiplicative errors ). Therefore, it may be adapted to strongly heteroscedastic behaviours. It has been used successfully to assess hydrological uncertainty .
The Box–Cox transformation is a classic transformation that is quite popular in the hydrological community
The inverse transformation explains the final effect on the uncertainty assessment: the constant probability distribution in the transformed space (provided by the EHUP) will result in a distribution in the untransformed space, whose evolution depends on the behaviour of the inverse data transformation. Here, the Box–Cox transformation provides different behaviours, depending on its parameter value (). It ranges from an affine transformation (equivalent to no transformation; red dashed line) to the log transformation (thick green dashed line). With a single parameterisation, the log–sinh transformation can be equivalent to the log transformation for values of much smaller than the value of its parameter and equivalent to an affine transformation for large values of (much higher than ; see Appendix B).
[Figure omitted. See PDF]
More recently, the log–sinh transformation has been proposed . It is a two-parameter transformation: 3 This transformation provides more flexibility. Indeed, for and , the log–sinh transformation reduces to no transformation, while for it is equivalent to the log transformation (Fig. ). Thus, with the same parameterisation, it can result in very different behaviours depending on the magnitude of the discharge. Applying no transformation may be intuitively attractive when modelling the distribution of residuals for large discharge values when the variance is no longer expected to increase (homoscedastic behaviour). It is then particularly attractive when modelling predictive uncertainty in an extrapolation context, in order to avoid an excessively “explosive” assessment of the predictive uncertainty for large discharge values.
In addition to the log transformation used by , in this study we tested the Box–Cox and the log–sinh transformations to explore more flexible ways to deal with the challenge of extrapolating prediction uncertainty distributions (Fig. ). The impacts of the data transformations used in this study are illustrated in Fig. .
Figure 3Predictive 0.1 and 0.9 quantiles when assessed with no transformation, the log transformation, the Box–Cox transformation with its parameter equal to 0.5 and the log–sinh transformation with its and parameters equal to 0.1 and 8, on the Ill River at Didenheim (668 km) for a lead time equal to LT, (a) as a function of the deterministic forecasted discharge and (b) for the flood event on 3 March 2006. The heteroscedasticity strongly differs from one variable transformation to another.
[Figure omitted. See PDF]
Another common variable transformation is the normal quantile transformation
2.2.1 Testing framework
The EHUP is a non-parametric approach based on the characteristics of the distribution of residuals over a training data set. Moreover, the Box–Cox and the log–sinh transformations are parametric and require a calibration step. Therefore, the methodology adopted for this study is a split-sample scheme test inspired by the differential split-sample scheme of and based on three data subsets: a data set for training the EHUP, a data set for calibrating the parameters of the variable transformation and a control data set for evaluating the predictive distributions when extrapolating high flows.
2.2.2 Events selection
To populate the three data subsets with independent data, separate flood events were first selected by an iterative procedure similar to those detailed by and : (1) the maximum forecasted discharge of the whole time series was selected; (2) within a 20 d period before (after) the peak flow, the beginning (end) of the event was placed at the preceding (following) time step closest to the peak at which the streamflow is lower than 20 % (25 %) of the peak flow value; and (3) the event was kept if there was less than 10 % missing values, if the beginning and end of the event were lower than 66 % of the peak flow value and if the peak value was higher than the median value of the time series. The process is then iterated over the remaining data to select all events. A minimum time lapse of 24 h was enforced between two events, ensuring that consecutive events are not overlapping and that the autocorrelation between the time steps of two separate events remains limited.
The number of events and their characteristics vary greatly among catchments, as summarised in Table . Note that the events selected for one catchment can slightly differ for the four different lead times considered in this study, because the selection was made using the forecasted discharge and not the observed discharge.
Table 2
Characteristics of the events selected for the lead time (LT) over the 1997–2006 data series.
Quantiles | |||||||
---|---|---|---|---|---|---|---|
0 | 0.05 | 0.25 | 0.50 | 0.75 | 0.95 | 1 | |
Total length of events (d) | 663 | 1178 | 1299 | 1505 | 1808 | 2061 | 2762 |
Number of events G1 | 28 | 65 | 114 | 141 | 193 | 276 | 434 |
Number of events G2 | 8 | 16 | 23 | 32 | 43 | 58 | 170 |
Number of events G3 | 7 | 12 | 19 | 26 | 35 | 46 | 162 |
Specific median value of the peak discharges G1 (mm h) | 0.009 | 0.022 | 0.041 | 0.062 | 0.098 | 0.184 | 0.266 |
Specific median value of the peak discharges G2 (mm h) | 0.026 | 0.064 | 0.140 | 0.217 | 0.342 | 0.706 | 1.042 |
Specific median value of the peak discharges G3 (mm h) | 0.050 | 0.131 | 0.295 | 0.439 | 0.701 | 1.896 | 3.071 |
The selected events were then gathered into three events sets – G1, G2 and G3 – based on the magnitude of their peaks and the number of useful time steps for each test phase (training of the EHUP post-processor, calibration of the variable transformations and evaluation of the predictive distributions): G1 contains the lowest events, while the highest events are in G3.
The selection of the data subsets was tailored to study the behaviour of the post-processing approach in an extrapolation context. The control data subset had to encompass only time steps with simulated discharge values higher than those met during the training and calibration steps. Similarly, the calibration data subset had to encompass time steps with simulated discharge values higher than those of the training subset.
To achieve these goals, only the time steps within flood events were used. We distinguished four data subsets, as illustrated in Fig. . The subset D1 gathered all the time steps of the events of the G1 group. Then, the set D2 of the time steps of the events of the G2 group was split into two subsets: D2 gathered all the time steps with forecasted discharge values higher than the maximum met in D1, and D2 was filled with the other time steps. Finally, D3 was similarly filled with all the time steps of the G3 events with forecasted discharge values higher than the maximum met in D2.
Figure 4
Illustration of the selection of the data subsets for the Ill River at Didenheim (668 km). First, the events are selected (grey highlighting). Then, the four data subsets are populated according to the thresholds (horizontal dashed lines). See Sect. for more details.
[Figure omitted. See PDF]
The discharge thresholds used to populate the D1, D2 and D3 subsets from the events belonging to the G1, G2 and G3 groups were chosen to ensure a sufficient number of time steps in every subset. We chose to set the minimum number of time steps in D3 and D2 to 720 as a compromise between having enough data to evaluate the methods and keeping the extrapolation range sufficiently large. We lowered this limit to 500 for the top 5 % pairs of D1, since this subset was only used to build the empirical distribution by estimating the percentiles during the training step and not used for evaluating the quality of the uncertainty assessment.
2.2.4 Calibration and evaluation stepsSince there are only one parameter for the Box–Cox transformation and two parameters for the log–sinh transformation, a simple calibration approach of the transformation parameters was chosen: the parameter space was explored by testing several parameter set values. For the Box–Cox transformation, 17 values for the parameter were tested: from 0 to 1 with a step of 0.1 and with a refined mesh for the sub-intervals 0–0.1 and 0.9–1. For the log–sinh transformation, a grid of 200 () values was designed for each catchment based on the maximum value of the forecasted discharge in the D2 subset, as explained in greater detail in Appendix B.
Note that the hydrological model was calibrated over the whole set of data (1997–2006) to make the best use of the data set, since this study focuses on the effect of extrapolation on the predictive uncertainty assessment only.
We used a two-step procedure, as illustrated in Fig. . The first step was to calibrate the parametric transformations. For each transformation and for each parameter set, the empirical residuals were computed over D1 and D2, where the EHUP was trained on the highest-flow group of D1 (see Sect. ) and the calibration criterion was computed on D2. Indeed, the data transformations have almost no impact on the uncertainty estimation by EHUP for events of the same magnitude as those of the training subset. Therefore, the calibration subset has to encompass events of a larger magnitude (D2). The parameter set obtaining the best criterion value was selected. In the second step, the EHUP was trained on the highest-flow group of D1, D2 and D2 combined and using the parameter set obtained during the calibration step. Then, the predictive uncertainty distribution was evaluated on the control data set D3. Training the EHUP on the highest-flow group of the union of D1, D2 and D2 allows the uncertainty assessment to be controlled from small to large degrees of extrapolation (on D3). Indeed if we had kept the training on D1 only, we would have not been able to test small degrees of extrapolation on independent data for every catchment (see the discussion in Sect. ).
Figure 5
Residuals as a function of the forecast discharges in the transformed space. The horizontal dashed lines represent the 0.1, 0.25, 0.5, 0.75 and 0.9 quantiles of the residuals computed during the training phase of the EHUP post-processor, for the highest-flow group (top 5 % pairs of the training data ranked by forecasted values). The straight horizontal lines represent their use in assessing the predictive uncertainty in extrapolation during (a) the calibration step of the variable transformation parameters and (b) the evaluation step of the predictive uncertainty. During the calibration step (a), many parameters set values are tested, while only the calibrated set of transformation parameters is used during the evaluation step (b). The vertical dashed lines show the beginning of the extrapolation range. The grey dots are the data pairs which are not selected in D1, D2, D2 or D3: they are not used during the calibration step or the evaluation step. Illustration from the Ill River at Didenheim, 668 km. Data used for the EHUP training at each step is sketched by a thick rectangle. In this study, only the top 5 % pairs of the training data ranked by forecasted values are used to estimate the residual distribution, since we focus on the extrapolation behaviour of the EHUP.
[Figure omitted. See PDF]
2.3 Performance criteria and calibration2.3.1 Probabilistic evaluation framework
Reliability was first assessed by a visual inspection of the probability integral transform (PIT) diagrams . Since this study was carried out over a large sample of catchments, two standard numerical criteria were used to summarise the results: the index, which is directly related to the PIT diagram , and the coverage rate of the 80 % predictive intervals (bounded by the 0.1 and 0.9 quantiles of the predictive distributions), used by the French operational FFS (a perfect uncertainty assessment would reach a value of 0.8). The index is equal to , where is the area between the PIT curve and the bisector, and its value ranges from 0 to 1 (perfect reliability).
The overall quality of the probabilistic forecasts was evaluated with the continuous rank probability score
4 where is the number of time steps, is the predictive cumulative distribution, the Heaviside function and is the observed value. We used a skill score (CRPSS) to compare the mean CRPS to a reference, here the mean CRPS obtained from the unconditional climatology, i.e. from the distribution of the observed discharges over the same data subset. Values range from to 1 (perfect forecasts): a positive value indicates that the modelling is better than the climatology.
For operational purposes, the sharpness of the probabilistic forecasts was checked by measuring the mean width of the 80 % predictive intervals. A dimensionless relative-sharpness index was obtained by dividing the mean width by the mean runoff: 5 where and are the upper and the lower bounds of the 80 % predictive interval for each forecast, respectively. The sharper the forecasts, the closer this index is to 1.
In addition to the probabilistic criteria presented above, the accuracy of the forecasts was assessed using the Nash–Sutcliffe efficiency (NSE) calculated with the mean values of the predictive distributions (best value: 1).
2.3.2 The calibration criterionSince the calibration step aims at selecting the most reliable description of the residuals in extrapolation, the index was used to select the parameter set that yields the highest reliability for each catchment, each lead time and each transformation. While other choices were possible, we followed the paradigm presented by : reliability has to be ensured before sharpness. Note that the CRPS could have been chosen, since it can be decomposed as the sum of two terms: reliability and sharpness . However, in the authors' experience, the latter is the controlling factor . Moreover, the CRPS values were often quite insensitive to the values of the log–sinh transformation parameters.
In cases where an equal value of the index was obtained, we selected the parameter set that gave the best sharpness index. For the log–sinh transformation, there were still a few cases where an equal value of the sharpness index was obtained, revealing the lack of sensitivity of the transformation in some areas of the parameter space. For those cases, we chose to keep the parameter set that had the lowest value and the value closest to .
Figure 6
Distributions of the -index values on the calibration data set D2, obtained with different transformations for four lead times (the filled box plots represent the calibrated distributions). Box plots (5th, 25th, 50th, 75th and 95th percentiles) synthesise the variety of scores over the catchments of the data set. The optimal value is represented by the horizontal dashed lines.
[Figure omitted. See PDF]
Figure 7Distribution over the basins of the values of the Box–Cox transformation parameter obtained during the calibration step for the four different lead times.
[Figure omitted. See PDF]
3 Results3.1
Results with the calibration data set D2
Figure shows the distributions of the -index values obtained with different transformations on the calibration data set (D2) for lead times LT / 2, LT, 2 LT and 3 LT. The distributions are summarised with box plots. Clearly, not using any transformation leads to poorer reliability than any tested transformation. In addition, we note that the calibrated transformations provide better results (although not perfect) than the uncalibrated ones in the calibration data set, as expected, and that no noticeable difference can be seen in Fig. between the calibrated Box–Cox transformation (d), the calibrated log–sinh transformation (e) and the best-calibrated transformation (f). Nevertheless, the uncalibrated log transformation and Box–Cox transformation with parameter set at 0.2 (BC) reach quite reliable forecasts. Comparing the results obtained for the different lead times reveals that less reliable predictive distributions are obtained for longer lead times, in particular for the transformations without a calibrated parameter.
Figures and show the distribution of parameter values obtained for the Box–Cox and the log–sinh transformation during the calibration step. The distributions vary with lead time. While the log transformation behaviour is frequently chosen for LT/2 and LT, the additive behaviour (corresponding to the use of no transformation; see Sect. ) becomes more frequent for 2 and 3 LT. A similar conclusion can be drawn for the log–sinh transformation: a low value of and a high value of yield a multiplicative behaviour that is frequently chosen, for all lead times, but less for 2 and 3 LT than for LT/2 and LT. This explains in particular the loss of reliability that can be seen for the log transformation for 3 LT in Fig. . These results reveal that the extrapolation behaviour of the distributions of residuals is complex. It varies among catchments and with lead time because of the strong impact of data assimilation.
Figure 8Distribution over the basins of the values of the log–sinh transformation parameters obtained during the calibration step for the four different lead times. and (see Appendix B).
[Figure omitted. See PDF]
3.2 Results with the D3 control data set3.2.1 Reliability
First, we conducted a visual inspection of the PIT diagrams, which convey an evaluation of the overall reliability of the probabilistic forecasts (examples in Fig. ). In some cases, the forecasts are reliable (e.g. the Thérain River at Beauvais, 755 km, except if no transformation is used). Alternatively, these diagrams may show quite different patterns, highlighting bias (e.g. the Meu River at Montfort-sur-Meu, 477 km) or under-dispersion (e.g. the Aa River at Wizernes, 392 km, or the Sioulet River at Miremont, 473 km; for the latter, the calibration on D2 leads to log–sinh and Box–Cox transformations equivalent to no transformation, which turns out not to be relevant for the control data set, where the log and the Box–Cox transformations are more reliable).
Figure 9
Examples of PIT diagrams obtained with the control data set D3, with different transformations at four locations.
[Figure omitted. See PDF]
Then the distribution of the -index values in Fig. reveals a significant loss of reliability compared to the values obtained with the calibration data set (Fig. ). We note that the log transformation is the most reliable approach for LT / 2 and is comparable to the Box–Cox transformation (BC) for LT. With increasing lead time, the BC transformation becomes slightly better than the other transformations, including the calibrated ones. In addition, comparing the results obtained for the different lead times confirms that it is challenging to produce reliable predictive distributions when extrapolating at longer lead times. Overall, it means that the added value of the flexibility brought by the calibrated transformations is not transferable in an independent extrapolation situation, as illustrated in Fig. .
Figure 10Distributions of the -index values on the control data set D3, obtained with different transformations for four lead times (the filled box plots represent the calibrated distributions). Box plots (5th, 25th, 50th, 75th and 95th percentiles) synthesise the variety of scores over the catchments of the data set. The optimal values are represented by the horizontal dashed lines. Option “g” gives the best performance that could be achieved with this model and this post-processor for these catchments (see the discussion in Sect. ).
[Figure omitted. See PDF]
Figure 11Scatter plots of the reliability index obtained with the log transformation and the log–sinh transformation (a) in D2 on the calibration step and (b) on D3 in the control step.
[Figure omitted. See PDF]
In operational settings, non-exceedance frequencies of the quantiles of the predictive distribution, which are the lower and upper bounds of the predictive interval communicated to the authorities, are of particular interest. The 80 % predictive interval (bounded by the 0.1 and 0.9 quantiles) is mostly used in France. It is expected that the non-exceedance frequency of the lower bound and the exceedance frequency of the upper bound remain close to 10 % for a reliable predictive distribution. Deviations from these frequencies indicate biases in the estimated quantiles. Figure reveals that on average the 0.1 quantile is generally better assessed than the 0.9 quantile on average, though the latter is generally more sought after for operational purposes. More importantly, the lack of reliability of the log transformation for the 3 LT lead time seen in Fig. appears to be related to an underestimation of the 0.1 quantile, which is higher than for the other tested transformations, while the 0.9 quantile is less underestimated than for the other transformations. These results highlight that reliability can have different complementary facets and that some parts of the predictive distributions can be more or less reliable. In a context of flood forecasting, particular attention should be given to the upper part of the predictive distribution.
Figure 12Distributions over the catchment set of (a) the non-exceedance frequency of the 0.1 quantile and (b) the exceedance frequency of the 0.9 quantile on the control data set D3, obtained with the different transformations tested (the filled box plots are related to calibrated transformations). The optimal values are represented by the horizontal dashed lines.
[Figure omitted. See PDF]
3.2.2 Other performance metricsIn addition to reliability, we looked at other qualities of the probabilistic forecasts, namely the overall performance (measured by the CRPSS) and accuracy (measured by NSE). We also checked their sharpness (relative-sharpness metric). The distributions of four performance criteria are shown for LT in Fig. . We note that the log transformation has the closest median value for the coverage ratio, at the expense of a lower median relative-sharpness value, because of larger predictive interval widths caused by the multiplicative behaviour of the log transformation. In addition, the CRPSS and the NSE distributions have limited sensitivity to the variable transformation (also shown by for the CRPS), even if we can see that not using any transformation yields slightly better results. This confirms that the CRPSS itself is not sufficient to evaluate the adequacy of uncertainty estimation. Similar results were obtained for the other lead times (Supplement).
Figure 13
Distributions of coverage rate, relative-sharpness, CRPSS and NSE values over the catchment set on the control data set D3, obtained with the different transformations tested (the filled box plots are related to calibrated transformations). The optimal values are represented by the horizontal dashed lines.
[Figure omitted. See PDF]
3.3 Investigating the performance loss in an extrapolation contextFor operational forecasters, it is important to be able to predict when they can trust the forecasts issued by their models and when their quality becomes questionable. Therefore we investigated whether the reliability and reliability loss observed in an extrapolation context were correlated with some properties of the forecasts. First, Fig. shows the relationship between the -index values obtained with and those obtained with D3 for three representative transformations. The results indicate that it is not possible to anticipate the -index values when extrapolating high flows in D3 based on the -index values obtained when extrapolating high flows in D2.
Figure 14
Comparison of the -index values obtained with D2 and D3. One point for each catchment. The optimal values are represented by the dashed lines. Similar results were obtained for the three other transformations (not shown).
[Figure omitted. See PDF]
In addition, two indices were chosen to describe the degree of extrapolation: the ratio of the median of the forecasted discharges in D3 over the median of the forecasted discharge in D2, and the ratio of the median of the forecasted discharges in D3 over the discharge for a return period of 20 years (for catchments where the assessment of the vicennial discharge was available in the national database:
Finally, we found no correlation with the relative accuracy of the deterministic forecasts either. The goodness of fit during the calibration phase cannot be used as an indicator of the robustness of the uncertainty estimation in an extrapolation context (see Supplement for figures).
4 Discussion4.1 Do more complex parametric transformations yield better results in an extrapolation context?
Overall, the results obtained for the control data set suggest that the log transformation and the fixed Box–Cox transformation (BC) can yield relatively satisfactory index and coverage ratio values given their multiplicative or near-multiplicative behaviour in extrapolation. More tapered behaviours that can be obtained with the calibrated Box–Cox or log–sinh transformations do not show advantages when extrapolating high flows on an independent data set. In other words, what is learnt during the calibration of the more complex parametric transformations does not yield better results in an extrapolation context.
These results could be explained by the fact that the calibration did not result in the optimally relevant parameter set. To investigate whether another calibration strategy could yield better results, we compared the performance on the control D3 data set when the calibration is achieved on the D2 data set (“f: best calibrated”) and on the D3 data set (“g: best reliability”). The results shown in Fig. reveal that, even when the best parameter set is chosen among the 217 possibilities tested in this study (17 for the Box–Cox transformation and 200 for the log–sinh transformation), the -index distributions are far from perfect and reliability decreases with increasing lead time. This suggests that the stability of the distributions of residuals when extrapolating high flows might be a greater issue than the choice of the variable transformation. Nonetheless, the gap between the distributions of the transformations without a calibrated parameter (“b” and “c”), the best-calibrated transformation (“f”) and the best performance that could be achieved (“g”) highlights that it might be possible to obtain better results with a more advanced calibration strategy. This is, however, beyond the scope of this study and is therefore left for further investigations.
4.2 Empirically based versus distribution-based approaches: does the distribution shape choice impact the uncertainty assessment in an extrapolation context?
Besides the reduction of heteroscedasticity, many studies use post-processors which are explicitly based on the assumption of a Gaussian distribution and use data transformations to fulfil this hypothesis . Examples are the MCP or the meta-Gaussian model; the NQT was designed to precisely achieve it. In their study on autoregressive error models used as post-processors, showed that error models with an empirical distribution for the description of the standardised residuals perform better than those with a normal distribution. We first checked whether the variable transformation helped to reach a Gaussian distribution of the residuals computed with the transformed variables. Then we investigated whether better performance can be achieved using empirical transformed distributions of residuals or using Gaussian distributions calibrated on these empirical distributions.
We used the Shapiro–Francia test where the null hypothesis is that the data are normally distributed. For each parametric transformation, we selected the parameter set of the calibration grid which obtains the highest value. For more than 98 % of the catchments, the value is lower than 0.018 (0.023) when the Box–Cox transformation (the log–sinh transformation) is used. This indicates that there are only a few catchments for which the normality assumption is not to be rejected. In a nutshell, the variable transformations can stabilise the variance, but they do not necessarily ensure the normality of the residuals. It is important not to overlook this frequently encountered issue in hydrological studies.
Even if there is no theoretical advantage to using the Gaussian distribution calibrated on the transformed-variable residuals rather than the empirical distribution to assess the predictive uncertainty, we tested the impact of this choice. For each transformation, the predictive uncertainty assessment obtained with the empirical transformed-variable distribution of residuals is compared to the assessment based on the Gaussian distribution whose mean and variance are those of the empirical distribution. Figure shows the -index distributions obtained over the catchments for both options in the control data set D3. We note that no clear conclusion can be drawn. No transformation (or identity transformation), which does not reduce the heteroscedasticity at all, benefits from the use of the Gaussian distributions for all lead times. In contrast, the predictive uncertainty assessment based on the empirical distribution with the log transformation is more reliable than the one based on the Gaussian distribution. For short lead times, it is slightly better to use the empirical distributions for the calibrated transformations (Box–Cox and log–sinh), but we observe a different behaviour for longer lead times. For these longer lead times, assessing the predictive uncertainty using the Gaussian distribution fitted to the empirical distributions of transformed residuals obtained with the calibrated log–sinh or Box–Cox transformations is the most reliable option. It is better than using the log transformation with the empirical distribution, but not very different from using the BC transformation.
Figure 15
Distributions of the -index values on the control data set D3, obtained with different transformations for four lead times, when using the empirical distributions of residuals (straight box plots) and the Gaussian distributions (dashed box plots). The optimal values are represented by the horizontal dashed lines.
[Figure omitted. See PDF]
Investigations on the impact of the choice between the empirical and the Gaussian distributions on the post-processor performance are shown in the Supplement. They show that the choice of the distribution is not the dominant factor.
4.3 A need for a focus change?In most modelling studies, several methodological steps depend on the range of the observations. First, calibration is designed to limit the residual errors in the available historical data. However the largest residuals are often associated with the highest discharge values. It is well known that removing the largest flood events from a data set can significantly modify the resulting calibrated parameter set. This is particularly true with the use of some common criteria, such as quadratic criteria, which strongly emphasise the largest errors . Conversely, it is likely that “unavailable data” such as a physically realistic but (so far) unseen flood would significantly change the calibration results if it could be included in the calibration data set. Moreover, model conceptualisation (building) itself is often based on the understanding of how a catchment behaves “on average”. In some studies, outliers may even be considered as disturbing and be discarded .
However, to provide robust models for operational purposes, we also need to focus on rare (rarely observed) events, still keeping in mind all the well-known issues associated with working with (too) few data . For predictive uncertainty assessment, this issue is exacerbated by the seasonality of hydrological extremes for most approaches which rely heavily on data (beyond data-learning approaches, all models which need to be calibrated). Therefore, there is an urgent need to gather and compile data on extreme events . Nevertheless, operational forecasters must still prepare themselves to work in an extrapolation context, as pointed out by .
5 Conclusions
Even if major floods are rare, it is of the utmost importance that the forecasts issued during such events are reliable to facilitate efficient crisis management. Like Lieutenant Drogo in the Tartar Steppe, who spent his entire life fulfilling his day-to-day duties but waiting in his fortress for the invasion by foes , many forecasters expect and are preparing for a major event, even if their routine involves only minor events. That is why a strong concern for the extrapolation context should be encouraged in all modelling and calibration steps. This article proposes a control framework focusing on the forecasting performance in an extrapolation context.
We use this framework to test the predictive uncertainty assessment using a statistical post-processing of a rainfall–runoff model, based on a variable transformation. The latter has to handle the heteroscedasticity and the evolution of the other predictive uncertainty distribution properties with the discharge magnitude to issue reliable uncertainty assessment, which is very problematic in an extrapolation context. As pointed out by , the choice of the heteroscedastic error modelling approach makes a significant impact on the predictive performance. This is true as well in an extrapolation context.
5.1 Main findings
Using the proposed framework for an evaluation in an extrapolation context, we showed the following:
a.
Using an appropriate variable transformation can significantly improve the predictive distribution and its reliability. However, a performance loss still remains in an extrapolation context with any of the three transformations we tested.
- b.
The transformations with more calibrated parameters do not achieve significantly better results than the transformations with no calibrated parameter:
-
while it allows a flexibility which can theoretically be very attractive in an extrapolation context, the log–sinh transformation is not more reliable in such a context;
-
the uncalibrated log transformation and Box–Cox transformation with the parameter set to 0.2 are robust and compare favourably.
-
- c.
We did not find any variable significantly correlated with the performance loss in an extrapolation context.
The findings reported herein corroborate the results of within the context of flood forecasting and extrapolation: calibrating the Box–Cox or log–sinh transformation can be counter-productive. We therefore suggest that operational flood forecasting services could consider the less flexible but more robust options: using the log transformation or the Box–Cox transformation with its parameter set close to 0.2.
Importantly, these results reveal significant performance losses in some catchments when it comes to extrapolation, whatever variable transformation is used. Even if the scheme tested yields satisfying results in terms of reliability for the majority of catchments, it fails in a significant number of catchments, and further investigations are needed to gain a deeper understanding of when and why failures occur.
5.2 Limitations and perspectivesWe used the framework designed by , which has already been applied in various studies, that separates the input uncertainty (mainly the observed and forecasted rainfall) and the hydrological uncertainty. Our study focuses only on the “effect” of the extrapolation degree in the hydrological uncertainty when using the best available rainfall product. Future works should combine both input uncertainty (rainfall) and hydrological uncertainty to evaluate the impact of using uncertain (forecasted) rainfall in a forecasting context.
Though no variable was found to be correlated to the performance loss, the investigations should be continued using a wider set of variables. First, it may open new perspectives to explain these losses and improve our understanding of the flaws of the hydrological model and of the EHUP. Furthermore, it would be very useful to help operational forecasters to detect the hydrological situations for which their forecasts have to be questioned (in particular during major events when forecasts are made in an extrapolation context).
Furthermore, improving the calibration strategy and using a regionalisation of the predictive distribution assessment, as proposed in and , could help build more robust assessment of uncertainty quantification when forecasting high flows.
Finally, more studies focusing on the extrapolation context may help to better elucidate the limitations of the modelling (hydrological model structure, calibration, post-processing etc.) and their consequences for practical matters. It is to be encouraged as a key for better and more reliable flood forecasting.
Appendix A GRP model
The GRP model belongs to the suite of GR models . These models are designed to be as simple as possible but efficient for hydrological studies and for various operational needs, resulting in parsimonious model structures (
-
a production function which is the same as in the well-known GR4J model developed by ;
-
a routing function which is a simplified version of the GR4J's routing function, since it only counts one flow branch composed of a unit hydrograph and quadratic routing store. The tests showed that the performance of the GRP and GR4J structures was similar in a forecasting mode.
Figure A1
The GRP model flow chart. After an interception step, the production function splits the net rainfall (), according to the level of the production store. The effective rainfall () is the sum of the direct flow and the percolation from this store. A corrective multiplicative coefficient (CORR) is then applied. Then the flow runs through a unit hydrograph (time base: TB) and the routing store (capacity: ROUT).
[Figure omitted. See PDF]
A snow module may be implemented on top of the model if necessary.
Like any GR model, it is parsimonious. It has only three parameters: (a) an adjustment factor of effective rainfall, which contributes to finding a good water balance; (b) the unit hydrograph time base used to account for the time lag between rainfall and streamflow; and (c) the capacity of the routing store, which temporally smooths the output signal.
Its main difference with the other GR models is the implementation “in the loop” of two data assimilation schemes:
-
a state-updating procedure which modifies the main state of the routing function as a function of the last discharge values,
-
output updating based on an autoregressive model of the multiplicative error or an artificial neural network (multi-layer perception) whose inputs are the last discharge value and the two last forecasting errors. In this study, the autoregressive model was used.
The parameters are calibrated in forecasting mode, i.e. with the application of the updating procedures. This model is used by the French flood forecasting services, some hydroelectricity suppliers and canal managers at an hourly time step in order to issue real-time forecasts for lead times ranging from a few hours to a few days at several hundred sites. Recently, paved the way to a multi-time-step GRP version.
Appendix B The log–sinh transformation behavioursB1 Log–sinh transformation formulations
In this study, we used the formulations of the log–sinh transformation chosen by :
B1 It is strictly equivalent to the formulation used by : B2
B2 Different behavioursDepending on the relative values of and and on the value range of , compared to and , the log–sinh transformation can be reduced to an affine transformation (i.e. ) or to the log transformation. The former case is equivalent to no transformation (identity function; additive error model), whereas the latter one is equivalent to a multiplicative error model (Table ).
Table B1
Summary of the different behaviours of the log–sinh transformation.
Cases | Behaviours |
Additive error model | |
Additive error model | |
Multiplicative error model | |
Additive error model | |
Otherwise if | Multiplicative error model (with an additive constant) |
When , Thus, When , the latter results in B3
B2.1 Cases where the log–sinh transformation is equivalent to an affine transformationIf and , then . Therefore when , the log–sinh transformation is equivalent to an affine transformation. In such cases, .
This happens when
-
;
-
the value is chosen to be large enough so that for any value in the discharge range .
B2.2 Cases where the log–sinh transformation is equivalent to a log transformation
As pointed out by McInerney et al. (2017), when , Eq. () gives The log–sinh transformation is then equivalent to a mere log transformation.
B3 Calibration
The and parameters are in the same physical dimension as the variable. Since this study is dedicated to the extrapolation context, we used the following dimensionless parameterisation to calibrate the variable transformation in various catchments. and are compared to the maximum forecasted discharge in the D2 data subset: B4 In the calibration step, the parameter space is explored on a (, ) grid: 18 values of from 0.01 to 100 and 15 values of from 0.1 to 100 were tested, excluding combined values leading to the very same behaviours, such as , equivalent to no transformation (additive error model). A total of 200 (, ) combinations were tested for each calibration.
Data availability
Data for this work were provided by Météo France (climatic data) and the hydrometric services of the French government (flow data). Readers can freely access streamflow observations used in this study at the Banque HYDRO website (
The supplement related to this article is available online at:
Author contributions
All co-authors contributed to the study and the article. LB and FB contributed equally and share co-first authorship.
Competing interests
The authors declare that they have no conflict of interest.
Acknowledgements
The authors thank the editor Dimitri Solomatine, Kolbjorn Engeland and one anonymous reviewer. Both reviewers provided insightful comments which helped to greatly improve this text. Thanks are also due to Météo-France for providing the meteorological data and Banque HYDRO database for the hydrological data. The contribution of the authors from Université Gustave Eiffel and INRAE was financially supported by SCHAPI (Ministry of the Environment).
Financial support
This research has been supported by SCHAPI (Ministry of the Environment) (grant nos. 21367400 (2018), 2201132931 (2018) and 2102615443 (2019)).
Review statement
This paper was edited by Dimitri Solomatine and reviewed by Kolbjorn Engeland and one anonymous referee.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
An increasing number of flood forecasting services assess and communicate the uncertainty associated with their forecasts. While obtaining reliable forecasts is a key issue, it is a challenging task, especially when forecasting high flows in an extrapolation context, i.e. when the event magnitude is larger than what was observed before. In this study, we present a crash-testing framework that evaluates the quality of hydrological forecasts in an extrapolation context. The experiment set-up is based on (i) a large set of catchments in France, (ii) the GRP rainfall–runoff model designed for flood forecasting and used by the French operational services and (iii) an empirical hydrologic uncertainty processor designed to estimate conditional predictive uncertainty from the hydrological model residuals. The variants of the uncertainty processor used in this study differ in the data transformation they use (log, Box–Cox and log–sinh) to account for heteroscedasticity and the evolution of the other properties of the predictive distribution with the discharge magnitude. Different data subsets were selected based on a preliminary event selection. Various aspects of the probabilistic performance of the variants of the hydrologic uncertainty processor, reliability, sharpness and overall quality were evaluated. Overall, the results highlight the challenge of uncertainty quantification when forecasting high flows. They show a significant drop in reliability when forecasting high flows in an extrapolation context and considerable variability among catchments and across lead times. The increase in statistical treatment complexity did not result in significant improvement, which suggests that a parsimonious and easily understandable data transformation such as the log transformation or the Box–Cox transformation can be a reasonable choice for flood forecasting.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 DREAL Centre-Val de Loire, Loire Cher & Indre Flood Forecasting Service, Orléans, France
2 GERS-LEE, Univ Gustave Eiffel, IFSTTAR, 44344 Bouguenais, France; Université Paris-Saclay, INRAE, UR HYCAR, 92160 Antony, France
3 Université Paris-Saclay, INRAE, UR HYCAR, 92160 Antony, France
4 Ministry for the Ecological and Inclusive Transition, SCHAPI, Toulouse, France