Rigorous evaluations of land surface models (LSMs) are necessary for guiding the developments and applications of the models, but the widely adopted verification‐oriented evaluations do not fully meet the intended objectives. There are always known knowns, known unknowns, and unknown unknowns. Models represent the known knowns well but suffer from known unknowns (e.g., the uncertain parameterization of known land surface processes and known subscale heterogeneity) and unknown unknowns (e.g., unrepresented boundary conditions, unresolved subscale processes, and unknown biogeochemical processes). Solving the model equations requires that the modeled system has to be closed, but the existence of unknown unknowns constantly challenges this assumption. As a consequence, rigorous verification of LSMs is impossible (Oreskes et al., 1994): Even if a model prediction is consistent with all the known observations in all the known criteria, the model still cannot be completely verified. The interpretations of the verification‐oriented evaluation results are inevitably inconclusive (Hill et al., 2017; Kirchner, 2006).
Observations also involve known unknowns and unknown unknowns; the errors in existing large‐scale land surface water budget observations can be shown to be significant. Assuming a sufficiently long time and no lateral flow of groundwater (detailed in section 3.1), the long‐term average precipitation should be reasonably balanced by evapotranspiration plus runoff (Kauffeldt et al., 2013; Thornthwaite, 1948; Wilm et al., 1944). However, the analysis of available 30‐year multisource data sets over the continental United States (CONUS) shows that the magnitude of the imbalance is troubling (Figure 1): >10% of precipitation in most of the CONUS and >60% of precipitation in the West. Such an imbalance can be attributed to unmeasured lateral flow of groundwater and the observational errors in precipitation (Adam et al., 2006; Henn et al., 2018), evapotranspiration (Long et al., 2014), and runoff (Wilby et al., 2017). The observational errors stem from insufficient spatiotemporal sampling, instrumental limitations, site changes, human activities, and erroneous data archiving and postprocessing, thereby making it infeasible to close the terrestrial water budget (Pan et al., 2012; Sahoo et al., 2011; Sheffield et al., 2009) across various databases (Kauffeldt et al., 2013).
1 Figure. Water budget imbalance among the NLDAS precipitation (P), FLUXNET MTE evapotranspiration (ET), and USGS runoff (R) at each HUC8 basin over the CONUS. The water budget imbalance is calculated as the difference between the 30‐year (1982–2011) average annual precipitation and the sum of the annual evapotranspiration and runoff. Detailed descriptions of the data sets are given in section 4.1.
Observational errors greatly complicate evaluations. If multiple data sets are not physically consistent (e.g., Figure 1), the observations of different water budget components may present conflicting information (Beven & Westerberg, 2011; Kauffeldt et al., 2013; Pan et al., 2012; Sheffield et al., 2009). The evaluation of a model using such kind of data sets would be unavoidably controversial. In general, observations are used to drive models and to evaluate model predictions. If the driving observations contain errors, then even a perfect model can generate problematic predictions. If the observations used for the evaluations contain errors, then the consistency between the observations and the model predictions should be a sign of modeling errors. In either case, the consistency between model predictions and observations should not be considered as a solid indicator of accurate predictions.
In the recognization of the observational and model prediction errors, the likelihood of a model being true is often estimated. However, the validation of this approach is conditional on a subjective assumption: Unknown unknowns can be neglected. It may be reasonable to neglect unknown unknowns at a well‐controlled reference site, but in the strictest sense, unknown unknowns exist at all times. When unknown unknowns cannot be neglected, then the estimation of likelihood inevitably involves an infinite logical regression (Popper, 2002). A priori knowledge about the truth is always necessary (e.g., an a priori value, an a priori error between the initial guess and the truth, or an a priori error distribution), but our prior knowledge always involves unknown unknowns, which are logically impossible to know in a priori.
As mentioned above, rigorous verification‐oriented evaluation is impossible as a result of the existence of unknown unknowns. The difficulty is deeply rooted in our epistemological foundations, impeding our advances in scientific understanding and modeling capability. How, then, can a rigorous evaluation be performed? Signature‐based (Gupta et al., 2008) hypothesis testing (Beven, 2001, 2018) subject to falsification (Popper, 2002) appears to be a viable approach.
In an attempt to apply this approach, section 2 introduces several key aspects of falsification‐oriented signature‐based evaluation, showing how the difficulties in verification can be avoided. Section 3 proposes a practical framework for evaluating long‐term land surface water budgets. As an application of the framework, section 4 presents an experiment over the CONUS. Section 5 gives our results, which are discussed in section 6, followed by our conclusions in section 7.
Verification is “an assertion or establishment of truth” (Refsgaard & Henriksen, 2004), whereas falsification is an assertion or establishment of false. Verification‐ and falsification‐oriented evaluations aim to test the consistency and inconsistency, respectively, between model predictions and observations.
As discussed in section 1, the impossibility of rigorous verification has long been recognized (Oreskes et al., 1994; Popper, 2002). The inclusiveness of verification‐oriented evaluations plays a notable part in the coexistence of numerous, but different, models. The falsification‐oriented approach has therefore attracted increasing attention (Baker, 2017; Beven, 2018; Blöschl, 2017; Linde, 2014; McKnight, 2017; Neuweiler & Helmig, 2017; Pfister & Kirchner, 2017). However, these opinion papers have overlooked the asymmetry between verification and falsification, and there is a lack of practical frameworks and successful applications of this approach.
It is essential to recognize the asymmetry between falsification and verification: Every consistency between model predictions and observations cannot completely verify the model as a result of the existence of unknown unknowns, whereas one single inconsistency always indicates that there must be something wrong. The inconsistency is a solid indicator of errors of any kind, regardless of whether they stem from known unknowns or unknown unknowns.
Popper (2002) summarized the philosophy of falsification as “all our knowledge grows only by correcting our mistakes.” “Learning from mistakes” (Beven, 2001, 2018) is a defining characteristic of falsification. With the falsification‐oriented approach, the reasons for the inconsistencies (mistakes) between the model predictions and the observations are investigated. Better predictions and observations are the result of addressing these “mistakes” (Niu et al., 2005, 2007, 2011; Yang et al., 2011). Son and Sivapalan (2007) provided an excellent example of how models can be improved by learning from “wrong” predictions. However, with the verification‐oriented approach, “good” predictions are selected by showing their consistencies with the observations. The likelihood of a scientific hypothesis being true is often estimated by using the “good” predictions that are consistent with the observations. The “wrong” predictions, which are rich in information about how to improve the models and observations further, are largely abandoned.
The inconsistency between model predictions and observations may stem from known unknowns and unknown unknowns. Falsification‐oriented evaluation can detect unknown unknowns by considering known unknowns. Figure 2 shows that a comparison between model predictions and observations results in three possible outcomes: (1) not falsified, (2) partially falsified, and (3) fully falsified. In the case of the fully falsified situation, if the known unknowns have been carefully considered, then unknown unknowns must exist.
2 Figure. Diagram showing the falsification‐oriented signature‐based evaluation paradigm. The paradigm is divided into several phases, as shown at the top. The hypothesis is whether the prediction signature is inconsistent with the observation signature. The three outcomes of the falsification‐oriented hypothesis testing are shown in the dashed box. Further details can be found in section 2.
The consideration of known unknowns (Beven, 1993, 2004, 2009; Beven & Binley, 1992, 2014; Binley et al., 1991; Sivapalan et al., 2003) can be iteratively refined with a “try‐fail‐refine” strategy. In an experiment, known unknowns are specified when the observations, models (i.e., parameterizations and parameters), and evaluation methods are chosen. If the model predictions are asserted as inconsistent with the observations, then it is possible that too few known unknowns have been considered. The experiment can be refined by considering more known unknowns. The refinement can be performed in sequence, as demonstrated by Son and Sivapalan (2007). Each iteration provides additional insights into the unknowns. If all the feasible known unknowns have been considered, then there must be unknown unknowns. A fully falsified test of this case is a strong stimulation of pursuing new theories, modeling new processes, and observing new phenomena (Popper, 2002; Sherwood, 2011).
As discussed in section 1, the existence of unknown unknowns means that any a priori estimates or approximations of the truth cannot be reliable. They should be avoided entirely in rigorous evaluations.
The falsification‐oriented evaluation is built on the test of scientific hypotheses. The term “scientific hypotheses” is used here to exclude mathematical and logical hypotheses. It is worth noting that scientific hypotheses are not approximations of the truth. Instead, they represent our growing understanding of the truth and compete to produce fewer “mistakes” (i.e., fewer inconsistencies between the predictions and the observations). Figure 2 shows that scientific hypotheses are involved in models, observations, and evaluation methods.
A numerical LSM is a scientific hypothesis. A model assembles a set of parameterizations and parameters. A parameterization is a scientific hypothesis about a relationship between different natural phenomena (e.g., precipitation, soil moisture, and runoff). A parameter is an adjustable coefficient in the relationship (Clark et al., 2011) and represents a scientific hypothesis about the properties of natural systems. In practice, parameters can be assigned hypothetically, either with observations or with other scientific hypotheses such as pedotransfer functions (Van Looy et al., 2017). Model predictions are made by numerically solving the mathematical equations that represent parameterizations (e.g., infiltration and runoff), parameters (e.g., soil hydraulic conductivity), and observations (e.g., precipitation and solar radiation). The solving process is a form of rigorous deduction.
“Observation always involves theory” (i.e., a set of scientific hypotheses) (Hubble, 2013). Observations of terrestrial water fluxes inevitably involve scientific hypotheses about the relationships between the phenomena intended to be observed (e.g., streamflow or evapotranspiration) and the phenomena that can be observed (e.g., water level or soil moisture). Scientific hypotheses also have to be made about subscale processes and heterogeneity, which always contain known unknowns and unknown unknowns. If a model prediction is inconsistent with the observations, then errors may stem from the scientific hypotheses that underlie the observations.
Evaluation methods also involve scientific hypotheses. Gupta et al. (2008) proposed signature‐based evaluations, in which the signatures of model predictions and observations are compared. A signature is information extracted from the model predictions and observations based on some theories (i.e., a set of scientific hypotheses) and is used to measure functional behavior (Wagener & Montanari, 2011). The scientific hypotheses of evaluation methods or signature extraction should also be tested.
We propose that a signature should be rigorously deduced from model predictions and observations based on a set of scientific hypotheses. With a rigorous deduction, the logical rule that a false conclusion must have false premises will be useful. If model predictions and observations disagree in their signatures, the inconsistency can stem from (1) the errors in the observations, (2) the model that gives the predictions (i.e., parameterizations and parameters), and (3) the invalidation of the evaluation method (i.e., signature extraction).
A practical framework for evaluating long‐term land surface water budgets is proposed. The framework extracts signatures based on three scientific hypotheses: (1) The mass conservation of water at the land surface is satisfied; (2) there is no horizontal water exchange between adjacent basins/grids; and (3) the length of time is sufficiently long.
Three facets of the framework are described. First, we introduce the three scientific hypotheses of the framework and illustrate the framework using a diagram. Second, we explore how the signatures are extracted based on the three scientific hypotheses. Third, we describe the rules of falsifying the consistency between model predictions and observations.
For a control volume at the land surface (e.g., catchments or subcatchment units), water inputs, outputs, and storage change should be subject to the physical principle of mass conservation (the first scientific hypothesis) (Eagleson, 1978; Reggiani et al., 2000, 2001; Reggiani & Schellekens, 2003). With the assumption that there is no horizontal water exchange between adjacent control volumes (the second scientific hypothesis), a lumped water balance equation (Beven et al., 2011) can be classically written as follows: [Image Omitted. See PDF]where p is the precipitation rate (kg m−2 s−1), et is the evapotranspiration rate (kg m−2 s−1), r is the runoff rate (kg m−2 s−1), and ds is the internal change rate of the storage at all points and at all levels in the control volume (kg m−2 s−1).
Equation 1 can be integrated over a period of time Δt (s): [Image Omitted. See PDF]where P, ET, and R are the time‐averaged precipitation rate (kg m−2 s−1), the evapotranspiration rate (kg m−2 s−1), and the runoff rate (kg m−2 s−1) over the time period, respectively. ΔS is the change in water storage in the time period (kg m−2). Note that the lower case letters (p, et, and r) denote instantaneous values and that the upper case letters (P, ET, and R) denote the time‐averaged values over Δt.
As the length of the time period approaches infinity (Δt → ∞), the terrestrial water storage is unable to be either replenished to infinity nor drained to negative infinity (∞ > ΔS > − ∞). Thus, if the time period is sufficiently long (the third scientific hypothesis), then the terrestrial water storage term in Equation 2 can be neglected ( ): [Image Omitted. See PDF]
In Figure 3, Equation 3 is represented by the line CD (black dashed line) that connects the value of 1 on the x axis and the value of 1 on the y axis. Line CD is defined as the water balance line. If all three scientific hypotheses of the framework hold (i.e., water mass conservation, no horizontal flow, and a sufficiently long time period), then the point (ET/P, R/P) must lie on line CD.
3 Figure. Falsification‐oriented signature‐based framework for evaluating land surface water balance. CD denotes the water balance line. As detailed in section 3.1, line CD is rigorously deduced from three scientific hypotheses: water mass conservation, no lateral flow of groundwater, and a sufficiently long time period. The signature in this framework is extracted from a set of precipitation (P), evapotranspiration (ET), and runoff (R) under the constraint of the three scientific hypotheses. The signatures must therefore lie on line CD, representing the long‐term‐averaged partitioning of precipitation between evapotranspiration and runoff. (a) Estimation the range of the signature of a point Q(ET/P,R/P). Points S1, S2, and S3 denote the feasible estimates of the signature by giving the three scientific hypotheses (line CD). They are obtained by combining P, ET, and R. Points A and B are set as the two ends of the three estimates. In this estimation, range AB is linearly proportional to the water budget imbalance (2×P−ET−R/P). (b) How to assert that a pair of observations and model predictions are inconsistent. AoBo and ApBp denote the ranges of the feasible signatures extracted from the observations (point O) and model predictions (point P), respectively. The consistency between the observations and the model predictions under the constraint of the three scientific hypotheses (line CD) is fully falsified if, and only if, the two signature ranges AoBo and ApBp do not overlap. (c) Falsifying rules for ensembles of observations and model predictions. The ensemble of model predictions is inconsistent with the observations if the two signature ranges AOBO and APBP do not overlap.
As signatures are extracted based on these three scientific hypotheses, they must lie on the water balance line. The position on the line indicates the long‐term partitioning of precipitation between evapotranspiration and runoff at the land surface.
Figure 3a shows how to extract the signature from point Q (ET/P, R/P). Point Q denotes a set of precipitation (P), evapotranspiration (ET), and runoff (R), which can be either observations or model predictions. Point Q may not lie on line CD. When point Q does not lie on line CD, its signature has a range. The range is estimated as follows. First, all feasible estimates of point Q's signature are obtained by combining the three scientific hypotheses (i.e., must lie on line CD) and two of the three water budget components (i.e., precipitation and evapotranspiration, evapotranspiration and runoff, and runoff and precipitation), which are points S1, S2, and S3. Second, the upper and lower boundaries of these feasible estimates are obtained and denoted as points A and B. Note that range AB encompasses exactly all the feasible signatures extracted based on the three scientific hypotheses.
From Figure 3a, the precise positions of points A and B can be calculated as follows: (1) If point Q is above line CD (P − ET − R < 0), then points A and B are at (1 − R/P,R/P) and (ET/P,1 − ET/P), respectively; (2) if point Q is below line CD (P − ET − R > 0), then points A and B are at (ET/P,1 − ET/P) and (1 − R/P,R/P), respectively; and (3) if point Q is on line CD (P − ET − R = 0), then points A and B are the same as point Q (ET/P,R/P). As a result, the distance between points A and B is , which is linearly proportional to the water budget imbalance normalized by precipitation.
Figure 3b shows the falsifying rules for a pair of model predictions and observations. Point P denotes the model predictions (ETsim/Pobs, Rsim/Pobs), and point O denotes the observations (ETobs/Pobs, Robs/Pobs). ApBp and AoBo denote the range of feasible signatures extracted from the model predictions (point P) and the observations (point O), respectively. If the two ranges (ApBp and AoBo) do not overlap, then, given the three scientific hypotheses of the framework, the model predictions cannot be consistent with the observations. This is the fully falsified situation shown in Figure 2. There must be something wrong with the model predictions, the observations, and/or the scientific hypotheses for extracting signatures.
Figure 3c shows the falsifying rules for a pair of ensembles of model predictions and observations. Points P1 to Pm denote the ensemble of model predictions, whereas points O1 to On denote the ensemble of observations. For each model prediction and observation, feasible signatures are obtained following the steps described in section 3.2. Range APBP denotes the range of the feasible signatures of all model predictions, and AOBO denotes the range of the feasible signatures of all observations. The two ensembles of model predictions and observations are asserted as inconsistent if APBP and AOBO do not overlap.
The framework is applied at each United States Geological Survey (USGS) eight‐digit Hydrologic Unit (HUC8) basin over the CONUS. The observations used are described in section 4.1. The LSM used and its configurations are presented in section 4.2. The execution of the model simulations is described in section 4.3.
The precipitation data at a spatial resolution of 1/8th degree over the CONUS are from the North American Land Data Assimilation System (NLDAS). The NLDAS precipitation data (Xia et al., 2012; Xia et al., 2012) are derived from the gauged‐only daily Climate Prediction Center (CPC) analysis data and adjusted for orographic effects based on the Parameter‐elevation Regressions on Independent Slopes Model (PRISM) climatology. The daily precipitation is disaggregated into hourly values (Xia, Mitchell, Ek, Cosgrove, et al., 2012; Xia, Mitchell, Ek, Sheffield, et al., 2012) based on the hourly weights from the State II Doppler radar derivations, the CPC MORPHing technique (CMORPH) satellite‐based analyses, the CPC Hourly Precipitation Data Base (CPC HPD), and the North American Regional Reanalysis (NARR). The hourly data are used to drive the LSM.
The evapotranspiration data at a spatial resolution of 0.5° are derived by upscaling eddy covariance measurements of a global network (FLUXNET) using a multi‐tree ensemble (MTE) method (Jung et al., 2009). The runoff data for each HUC8 are derived by the USGS. The evapotranspiration and runoff data have been widely used to evaluate NLDAS simulations over the CONUS (Xia et al., 2016). We upscaled the precipitation, evapotranspiration, and runoff observations at different spatial resolutions to the same USGS HUC8 basins.
Figure 1 shows the water budget imbalance in the three observational data sets, which is linearly proportional to the range of the signature of the observations. These water budgets are well balanced in the eastern United States, and the imbalance is within the range delineated by ±10% of precipitation. The balance reflects the fact that there are dense FLUXNET MTE training sites, rain gauges, and streamflow gauges in this region. There are no FLUXNET MTE training sites in eastern New Mexico, northwestern Texas, western Oklahoma, and western Kansas (Jung et al., 2009); the imbalance is significantly positive and about 25% of precipitation. In the mountainous areas of the West, the imbalance is significantly negative and even beyond −60% of precipitation, reflecting the sparsity of rain gauges and FLUXNET towers in the complex terrain and the significant measurement errors in the harsh environment.
Noah‐MP (Niu et al., 2011; Yang et al., 2011) hosts multiple alternative options for several key process parameterizations and is therefore able to account for the known unknowns in these parameterizations. A 48‐member ensemble is configured by combining four runoff options, three β‐factor options (representing the control of soil moisture on transpiration), two stomatal conductance options, and two turbulence options. The four runoff options are divided into two groups. In the first group, there are two options with a groundwater component (Niu et al., 2005, 2007) as used in the Community Land Model (CLM) (Oleson et al., 2004). In the second group, the two options follow the Noah LSM (Chen & Dudhia, 2001) and the Biosphere Atmosphere Transfer System (BATS) (Dickinson et al., 1993), respectively. The three β‐factor options are adopted from CLM (Oleson et al., 2004), Noah (Chen & Dudhia, 2001), and Simplified Simple Biosphere (SSiB) model (Xue et al., 1991). The two turbulence options (Brutsaert, 1982; Chen et al., 1997) are based on Monin‐Obukhov similarity theory, which is commonly used in many LSMs. The two stomatal conductance options, the Jarvis (Chen et al., 1996) and Ball‐Berry (Ball et al., 1987; Collatz et al., 1991, 1992) schemes, are used in second and third generation LSMs (Sellers et al., 1997), respectively. These parameterization options represent a spectrum of widely used LSMs, which Niu et al. (2011) and Yang et al. (2011) report dominate the hydrological simulations. Details of these options are described in Zheng et al. (2019).
All the Noah‐MP parameters (Cuntz et al., 2016) use default values. The values have not been rigorously calibrated. With the falsification‐oriented evaluation, parameters are not necessarily rigorously precalibrated. Miss‐specified parameter values may result in inconsistencies between the model predictions and the observations, which can be detected and corrected in future experiments.
The observational data sets and models have been used for monitoring drought (
The atmospheric forcing, soil, and vegetation data at a spatial resolution of 1/8th degree are from NLDAS. The vegetation type data are the MODIS (Moderate Resolution Imaging Spectroradiometer) land cover type product classified using the International Geosphere‐Biosphere Program scheme. The soil type data are derived from the State Soil Geographic database. For reference, the vegetation and soil types are shown geographically in Figure S1 in the supporting information.
The initial states of each simulation were obtained from a 102‐year spin‐up. The spin‐up was performed in two steps. The models were run repeatedly 100 times over the year 1979 and then run through the 2 years from 1980 to 1981 with a time step of 15 min. The simulations for the following 30‐year period from 1982 to 2011 were analyzed in this study.
Figure 4 shows the evaluation results over the CONUS. Specifically, emphasis is placed on three aspects: how the results help to guide future enhancements of the observations; insights for future improvements of the model as informed by the spatial patterns; and where the ensemble can and cannot outperform a single prediction.
4 Figure. Evaluation of the Noah‐MP ensemble over the continental United States. (a) AOBO denotes the range of the signature of observations, whereas APBP denotes the range of the signature of the model predictions. The range AOBO is linearly proportional to the water budget imbalance, which is shown geographically in Figure 1. All six possible outcomes from the comparison of predictions and observations are shown in different colors: (i) not falsified with large observational uncertainty (green), (ii) fully falsified because all ensemble predictions overestimate the evapotranspiration fraction in precipitation (dark red), (iii) fully falsified because all ensemble predictions overestimate the runoff fraction in precipitation (dark blue), (iv) not falsified with a large modeling uncertainty (gray), (v) partially falsified because some predictions overestimate the evapotranspiration fraction, whereas none overestimates the runoff fraction (light red), and (vi) partially falsified because some overestimate runoff fraction, whereas none overestimates the evapotranspiration fraction (light blue). The geographical pattern of the evaluation results is shown in panel (b). The data for producing this figure can be found in the supporting information.
Figure 4a shows the six types of outcomes from the evaluation. According to Figure 2 (section 2.2), they can be divided into three categories: (1) not falsified (types i and iv), (2) fully falsified (types ii and iii), or (3) partially falsified (types v and vi).
The nonfalsified category consists of two types: types i and iv. For type iv, the signature range of the predictions is larger than that of the observations. For type i, the signature range of the observations is larger than that of the predictions. Figure 4b shows that the type i nonfalsified situation occurs in western Texas, which is arid, and in the western United States, which is mountainous. As the signature range is linearly proportional to the water budget imbalance (section 3.2), the spatial patterns closely coincide with that of the water budget imbalance shown in Figure 1. The water budget imbalance (known unknowns in the observations) is too large to falsify any of the ensemble predictions. A top priority for the type i nonfalsified situation is to obtain physically consistent observations of precipitation, evapotranspiration, runoff, and lateral flows to close the water budget. Targeted observations should be pursued to address the measurement and processing errors.
The fully falsified situation could be characterized by either overestimating evapotranspiration (type ii) or runoff (type iii). Because all the predictions are inconsistent with the observations, the ensemble members share a common error. The ensemble prediction cannot outperform a single prediction in terms of matching the observations.
Figure 4b shows that all the ensemble predictions overestimate evapotranspiration and underestimate runoff (type ii) in the middle Ohio river basins and the Yellowstone river basin. The overestimation of evapotranspiration in the Ohio river basin may be attributable to model failure. Noah‐MP introduced an explicit canopy layer, enabling the modeling of canopy interception and evaporation. It is a major structural augment over the Noah LSM. The Ohio river basins are covered by deciduous broadleaf forest (Figure S1b), the canopy of which is highly capable of intercepting precipitation. Canopy interception and evaporation are strong over the middle Ohio river basins and significantly different across the Noah‐MP ensemble (Zheng et al., 2019). The canopy interception and loss may be too strong, leading to an overall overestimation of evapotranspiration and an underestimation of runoff. The Noah‐MP parameterizations and parameters for canopy interception and loss should therefore be scrutinized.
The overestimation of evapotranspiration in the Yellowstone river basin (Wyoming) may be linked to groundwater recharge driven by the geothermal plumbing. The Yellowstone river basin exhibits complex geothermal activity and has the largest collection of geysers on Earth. Groundwater seeps down into the geothermal plumbing to be heated by the Yellowstone megavolcano and then flows into rivers and lakes through hot springs, geysers, and mud pots. While increasing runoff, the geothermal plumbing process drives groundwater recharge from soil moisture and lowers evapotranspiration. The Noah‐MP does not represent these processes and therefore overestimates evapotranspiration and underestimates runoff.
Figure 4b shows that all the ensemble predictions underestimate evapotranspiration and overestimate runoff (type iii) in central Florida, eastern Texas, the Niobrara river basin (Nebraska), and the Salton Sea river basin (southern California). Interestingly, all these basins are in groundwater discharge areas. The majority of the water in the Niobrara river comes from groundwater seepage from the High Plains, or Ogallala, aquifer. Central Florida is occupied by the lower Floridan aquifer, the water of which comes from the upper Floridan aquifer in Alabama and Georgia and may also be intruded by seawater. The Texas basins are precisely downstream (east) of the Balcones fault zones. The Salton Sea in southern California is one of the world's largest inland lakes. The surface elevation in the Salton Sea is about 70 m below sea level, allowing the inflow of water from the surrounding mountainous.
There are several possible reasons for the falsification in these basins. First, groundwater discharge provides additional soil moisture for evapotranspiration, which can be captured by the evapotranspiration observations but not by the Noah‐MP model. All the Noah‐MP ensemble predictions therefore underestimate evapotranspiration. Second, all these basins coincide with the sand soil type (Figure S1). Noah‐MP estimates the permeability to be the same as sand. However, the estimated permeability is probably too low for the permeable surface in the groundwater discharge zones (e.g., macropores) and causes an overestimation of the runoff. For instance, the fully falsified Texas basins are rich in vertisols. Vertisols can have deep and wide cracks, which are preferential pathways for water flow (Kurtzman et al., 2016). Third, groundwater discharge zones often exhibit heavy human activity. Humans extract water from groundwater and rivers into the soil moisture pool for agriculture. These activities increase evapotranspiration and decrease runoff, which is not considered in Noah‐MP.
Despite the challenges of pinpointing the causes of the fully falsified situation, Figure 4 shows that the falsification‐oriented evaluation is powerful in identifying areas of modeling/observational challenges. These results can stimulate the development of new scientific hypotheses and observations. Unless these missing processes (i.e., unknown unknowns) have been correctly represented by the model and measured by the observations, ensemble predictions are not very helpful because all the ensemble members have been falsified. In addition, verification‐oriented calibration is not expected to be a robust solution for improving the simulations for these basins as a result of as yet unknown processes.
Interestingly, these nonfalsified and fully falsified basins have a remarkable overlap with the locations of Critical Zone Observatories (CZOs) (Dybas, 2013) (Figure S4). The Southern Sierra (Holbrook et al., 2014; Visser et al., 2019), Reynolds Creek (Seyfried et al., 2018), and Catalina‐Jemez (McIntosh et al., 2017) CZOs are in basins where the water budget imbalance is too large to falsify any model predictions (type i). The Clear Creek, Susquehanna Shale Hills (Jin et al., 2011; Liu et al., 2020), and Christina river basin CZOs are in basins where all the model predictions are falsified by overestimating evapotranspiration and underestimating runoff (type ii). The evaluation results also highlight the need for strengthening observatories in Texas and Florida, where all the Noah‐MP predictions are falsified by overestimating the runoff and underestimating the evapotranspiration (type iii). The Texas Water Observatory infrastructures have already been installed over these areas. New theoretical advancements and observational data (Fan, 2015) from these observatories are expected to resolve the water budget imbalance and the inconsistencies between the model predictions and the observations.
The partially falsified situations (types v and vi) hint at the importance of model intercomparison and statistical postprocessing. The ensemble predictions are distinguishable in terms of matching the observations. By intercomparing the poorly performing predictions and those that performed well, known unknowns can be solved, and known knowns grow. The range of the signature of predictions can thus be reduced, which is important for future falsification‐oriented evaluations (Figure 2). Statistical methods are also beneficial in these situations, and an ensemble can produce better predictions for various applications by selecting and weighting the ensemble members.
Falsification‐ and verification‐oriented evaluations have the same target: guiding the development of models and the enhancement of observations. The difference is that falsification exposes “mistakes” in, or inconsistencies between, the model predictions and the observations. The “mistakes” should be carefully identified, reported, analyzed, and fixed. This study identified and reported such “mistakes” in the modeled and observed climatology over the CONUS with a newly developed framework.
The proposed framework assumes that the time period is long enough to neglect the change in terrestrial water storage. In this study, limited by the observations, the period spans 30 years (1982–2011). This time period should be sufficiently long because the terrestrial water storage is characterized by generally stable annual cycles (Lorenz et al., 2014). A 30‐year time period length is long enough for even decadal drought events. During the Millennium Drought from 2001 to 2009 in southeast Australia (van Dijk et al., 2013), terrestrial water storage decreased by roughly 60 mm, while the annual average precipitation during the drought was approximately 450 mm. The ratio of the change in terrestrial water storage to the product of the annual average precipitation and 30 years is ≤0.5%, which can be neglected. Using satellite observations from the Gravity Recovery and Climate Experiment (GRACE) and the Noah‐MP model predictions over the CONUS, Figures S2 and S3 also show that the terrestrial water storage term is neglectable compared with the precipitation term in Equation 2. We therefore argue that a time period of 30 years is generally sufficient to accumulate enough amount of precipitation to neglect the change in terrestrial water storage in Equation 2.
However, because the change in terrestrial water storage is neglected, this framework is limited to evaluating the climatology (i.e., the long‐term‐averaged partitioning of precipitation between evapotranspiration and runoff). More advanced signatures should be developed to evaluate hydrometeorological variations (e.g., droughts and floods). These signatures should consider both the terrestrial water storage itself and its change to fully characterize the trajectory of the dynamics of the terrestrial water system in phase space (i.e., state and change). We also expect that the evaluation is problem‐oriented (e.g., frequency, magnitude, and the onset of extreme events) and differs in time scales (e.g., daily, monthly, or yearly). This warrants further investigations.
Human activities do not affect the deduction of the framework described in section 3. However, human activities can complicate the validity of the three scientific hypotheses and, thus, require careful attention when interpreting a falsified result. First, the third scientific hypothesis of the framework, which states that 30 years is sufficiently long to neglect the change in terrestrial water storage, is very likely to hold. From Figure S2b, the human‐induced change in the terrestrial water storage is unlikely to be comparable with that caused by drought. The change in terrestrial water storage can be safely neglected in comparison with the long‐term accumulated amount of precipitation. Falsified results are therefore unlikely to be attributable to the failure of this assumption. Second, groundwater withdrawal can drive horizontal replenishment from the surrounding areas (the second scientific hypothesis of the framework). The lateral flow of groundwater can increase or decrease the water budget imbalance, depending on the existing imbalance in the observations caused by errors, and may play a part in the falsified results. Third, the use of water by humans can redistribute the water locally among different components of the water budget. Agricultural water use may reduce runoff and increase evapotranspiration as compensation. If the redistribution is sufficiently significant, then the model predictions without considering the human use of water should be inconsistent with the observations. In a falsified result, the failure in modeling the human use of water is often accounted for.
The consideration of known unknowns can be improved further. This study only considered the water budget imbalance resulting from all the observable terms as a whole (Beven & Westerberg, 2011). In fact, other known unknowns exist in each of the observational terms. Sun et al. (2018) provided a comprehensive overview of 30 precipitation products and concluded that their reliability is mainly limited by the number and spatial coverage of surface stations, the satellite algorithms, and the data assimilation models. Wang and Dickinson (2012) provided a comprehensive review of evapotranspiration observations and showed that the unknowns in the underlying theory (i.e., the Monin‐Obukhov similarity theory), the spatial coverage of surface stations, satellite algorithms, and data postprocessing techniques (e.g., gap filling) have significant impacts. Di Baldassarre and Montanari (2009) argued that the unknowns in large‐scale runoff observations are also far from negligible. The terrestrial water cycle is also influenced by human activities, which may not have been reflected in the evapotranspiration and runoff observations. However, compared with the spread among multiple observations of precipitation (Xia et al., 2016) and evapotranspiration (Long et al., 2014) over the CONUS, the water budget imbalance among them (Figure 1) is much more significant over most areas. The neglect of the known unknowns within each observation should therefore be generally valid, except for those areas with a neglectable water budget imbalance. The method for signature extraction (section 3.2) is extendable to include these known unknowns.
There are always known knowns, known unknowns, and unknown unknowns in the modeling and observation of land surface processes. Large‐scale land surface modeling provides a baseline of “modeling everywhere as a learning process” (Beven, 2007; Beven & Alcock, 2012; Beven & Cloke, 2012; Wood et al., 2011) and is vital to advance Earth system modeling (Archfield et al., 2015; Bierkens, 2015; Clark et al., 2015). This study proposes the falsification‐oriented signature‐based evaluation. This method recognizes the asymmetry between verification and falsification and “learns from mistakes” through a “try‐fail‐refine” strategy. With this approach, not only models but also observations and evaluation methods can be evaluated simultaneously.
Verification is valuable, but verification‐oriented evaluation results should be interpreted with the caution of overconfidence. Verification‐oriented evaluation often implicitly assumes that if the model predictions are consistent with every known observation in every known aspect, then everything is correct. This assumption is conditional on the negligibility of unknown unknowns. Only if unknown unknowns are neglectable, the likelihood of a model being true can be estimated. In the practice of verification‐oriented evaluation, unknown unknowns are often lumped into known unknowns (e.g., a lumped a priori error distribution). Overconfidence in the evaluation results may hinder scientific explorations into unknown unknowns for new observations and theories.
The test of the consistency between model predictions and observations is asymmetrical between verification‐ and falsification‐oriented approaches. One single inconsistency always indicates that there must be something wrong. Inconsistency is a solid indicator of errors in any characteristic (i.e., known unknowns and unknown unknowns) and from any source (e.g., observations, parameterizations, parameters, evaluation methods, and coding bugs). The solidity of the inconsistency as an indicator of “mistakes” is important for making concrete progress in both model developments and observation enhancements. Falsification‐oriented evaluation values and learns from these inconsistencies.
The proposed framework is shown to be powerful in revealing the areas with modeling and observational challenges. In the mountainous and arid areas of the western CONUS, the water budget imbalance among observations is too large to falsify a model prediction. The imbalance results from the unobserved lateral flow and the observational errors in precipitation, evapotranspiration, and runoff. In these areas, enhancements of the observations are more urgently needed than model improvements. The imbalance in the eastern United States is much smaller than that in the West. The underestimation of evapotranspiration and overestimation of runoff occur simultaneously in the HUC8 basins of central Florida on the lower Floridan aquifer, the eastern Texas basins just downstream of the Balcones fault zones, the Niobrara river basin at the north tip of the Ogallala aquifer, and the Salton Sea river basin. The middle Ohio river basins and the Yellowstone river basin are falsified by overestimating evapotranspiration and underestimating runoff. The falsified results may be attributable to the failure of Noah‐MP in representing region‐specific processes and human activities. A top priority for these fully falsified situations is to develop new scientific hypotheses and observations to close the gap between the model predictions and the observations.
Interestingly, a substantial portion of the CZOs is located in the regions with nonfalsified or fully falsified evaluation results. These established CZOs include Southern Sierra, Reynolds Creek, Catalina‐Jemez, Clear Creek, Susquehanna Shale Hills, and Christina River Basin. New theoretical and observational advancements from these observatories are expected to resolve the water budget imbalance and the inconsistencies between model predictions and observations. Our results also highlight the need for strengthening the Texas Water Observatory and for extending observatories in Florida.
We thank the reviewers and editors for their detailed and insightful comments. We thank Dr. Peili Wu at the Met Office Hadley Center for his generous help with structuring and writing this paper. This work is jointly supported by the National Key Research and Development Program of China grants 2018YFA0606004 and 2016YFA0600403, the National Natural Science Foundation of China grants 41605062 and 41375088, and the Science and Technology Project of State Grid Corporation of China.
The NLDAS static data and meteorological forcing data were obtained from the Goddard Earth Sciences Data and Information Services Center (
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
We develop a novel framework for rigorously evaluating land surface models (LSMs) against observations by recognizing the asymmetry between verification‐ and falsification‐oriented approaches. The former approach cannot completely verify LSMs even though it exhausts every case of consistency between the model predictions and observations, whereas the latter only requires a single case of inconsistency to reveal that there must be something wrong. We argue that it is such an inconsistency that stimulates further development of the models and enhancement of the observations. We therefore propose a falsification‐oriented signature‐based evaluation framework to identify cases of inconsistency between model predictions and observations by extracting signatures based on a set of key assumptions. We apply this framework to evaluate an ensemble of simulations from the Noah‐MP LSM against observations over the continental United States under the three assumptions of water mass conservation, no lateral water flow, and a sufficiently long period of time. Regions showing inconsistencies between the Noah‐MP ensemble simulations and the observations are located in the western mountainous areas, the Yellowstone river basin, the lower Floridan aquifer, the Niobrara river basin at the north tip of the Ogallala aquifer, and the basins downstream of the Balcones fault zones in Texas. These regions coincide with the sites where both advances in theoretical modeling and new observational data (e.g., from the Critical Zone Observatories) have emerged.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details










1 Key Laboratory of Regional Climate‐Environment Research for Temperate East Asia, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, China
2 Department of Geological Sciences, The John A. and Katherine G. Jackson School of Geosciences, University of Texas at Austin, Austin, TX, USA
3 Department of Geological Sciences, The John A. and Katherine G. Jackson School of Geosciences, University of Texas at Austin, Austin, TX, USA; Now at Department of Civil and Environmental Engineering, Princeton University, Princeton, NJ, USA
4 Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters/Key Laboratory of Meteorological Disaster, Ministry of Education/International Joint Research Laboratory on Climate and Environment Change, Nanjing University of Information Science and Technology, Nanjing, China
5 School of Geographical Sciences, Southwest University, Chongqing, China
6 State Key Laboratory of Operation and Control of Renewable Energy and Storage Systems, China Electric Power Research Institute, Beijing, China