1 Introduction
Explicitly modeling the terrestrial water cycle at the global scale and at high resolution has recently been referred to as a “grand challenge in hydrology” (Bierkens et al., 2015; Wood et al., 2011), an undertaking that has excited the hydrologic community and encouraged the development of large-scale modeling efforts, workshops, and working groups. These “everywhere and locally relevant” hydrologic models (Bierkens et al., 2015) differ from land surface models (LSMs) and general circulation models (GCMs) by providing spatially ubiquitous and hyper-resolution, physically based hydrologic simulations. While LSMs and GCMs may provide water balance estimates at regional, continental, or global scales, their hydrologic schemes can be coarse-resolution, simplified, or highly parameterized (Wood et al., 2011). A process-based and mechanistic (rather than empirical) representation of both the large-scale and local water cycle is necessary to address hydrologic problems surrounding society, agriculture, resource management, biodiversity, and climate (Clark et al., 2015).
Therefore, high-resolution, large-scale, and physically based hydrologic modeling offers profound and multifaceted benefits. From a societal perspective, these models enable operational forecasting and planning in regions where water balance estimates are unavailable or poorly constrained by scarce or nonexistent observations, such as developing countries (Group on Earth Observations, 2009). As Beven and Cloke (2012) point out, hyper-resolution hydrologic model outputs (as opposed to coarse-resolution global hydrologic model – GHM – results) can be more accessible and logical to local water managers by providing locally relevant and detailed information. High-resolution hydrologic modeling could also be used to inform, initialize, or downscale LSMs and GCMs. Clark et al. (2015) identify spatial heterogeneity, organization, and integration of soil moisture and groundwater to be a major missing link in LSMs, meteorological models, and climate models. Further, large-scale hydrologic models could be used to better understand or constrain results from remote sensing. For instance, LSMs may be used in forward modeling approaches to estimate signal attenuation in remote sensing of total water storage change (Landerer and Swenson, 2012).
These motivating factors have catalyzed the development of several hyper-resolution, continental- or global-scale modeling efforts over the last decade. Some fine examples include physically based platforms, such as the Terrestrial Systems Modeling Platform (TerrSysMP), a fully integrated soil–vegetation–atmosphere model employed over the European CORDEX domain (Keune et al., 2016), and integrated groundwater–surface water modeling over the continental United States with ParFlow v3 (Maxwell et al., 2015). Others have used a global water balance approach, like WaterGAP (Döll et al., 2003) and PCR-GLOBWB (Sutanudjaja et al., 2018), which was recently coupled to MODFLOW at 1 resolution globally (de Graaf et al., 2017). High-resolution land surface modeling has begun to include topographically informed routing of surface or subsurface water storage: for example, the Land Information System software group (Zaitchik et al., 2010) or Noah-MP (Niu et al., 2011) and operational flood forecasting from the National Water Model (NWM) v2.0 (NOAA Office of Water Prediction: the NWM,
While global and continental hydrologic representation continues to improve, the extreme-scale hydrologic modeling community still faces many challenges, and models can struggle to close the water balance with certainty. Given the lack of spatially and temporally continuous hydrologic measurements across the globe, as well as their associated computational demand, parameter calibration at these scales is often problematic or infeasible (Maxwell et al., 2015). Distributed macroscale hydrology models must often rely on a priori information and datasets informed by field measurements or hydrologic theory (which may be unavailable, especially in underdeveloped regions); or, less commonly, they can employ regionalization approaches to transfer calibrated parameters from gauged to ungauged catchments (Beck et al., 2016). Validation can also be problematic in that large gaps exist in space or time for in situ measurements, and remote sensing products often depend on hydrologic algorithms and parameterization (Archfield et al., 2015).
Studies assessing model performance suggest that while continental and global hydrologic modeling is promising, there is considerable room for improvement when it comes to model skill, and most of these performance assessments only evaluate one or two output variables at one time. For instance, Sutanudjaja et al. (2018) evaluated streamflow and total water storage performance of a 5 resolution simulation of PCR-GLOBWB relative to the Global Runoff Data Center (GRDC) and remote sensing from the Gravity Recovery and Climate Experiment (GRACE). Although PCR-GLOBWB was able to acceptably reproduce terrestrial water storage (TWS) anomalies for major global river basins, only 40 % of discharge locations exhibited a Kling–Gupta efficiency coefficient (KGE, a measure of performance; Bai et al., 2016; Moriasi et al., 2007) of 0.3, suggesting that the large majority of GRDC stations show unsatisfactory performance. Recent streamflow results from WaterGAP2.2d are encouraging (Schmied et al., 2020), with a median KGE of 0.79 and a near-optimum bias measure; however, the model underestimated TWS amplitude and trend in the majority of basins. Salas et al. (2018) evaluated the National Flood Interoperability Experiment (NFIE-Hydro), which leverages the WRF-Hydro framework and the Noah-MP LSM. They identify several regions for model improvement, including a positive bias of flow in the southern US and Central Plains and a negative bias in the Rocky Mountains, suggesting several potential sources for bias depending on the area, including snowpack formulation, precipitation bias, soil column draining dynamics, or failure of lateral redistribution to attenuate flow. These results reiterate that acceptable performance of one model output does not necessarily translate to appropriate simulations of the full water balance, and evaluating multiple output parameters simultaneously (such as snow water equivalent, soil moisture, evapotranspiration, and many others) could help confidently attribute sources of bias.
We argue that validation and performance assessment should continue to be highly prioritized for uncalibrated, high-resolution, and large-scale hydrologic models, and validation studies that evaluate several output variables are paramount to guiding and improving model development. It has been well established that calibration methods utilizing multiple types of observational datasets result in overall better model skill (e.g., Finger et al., 2015); additionally, understanding the relationships between multiple output variables (e.g., evaporative fraction and soil saturation; Rakovec et al., 2019) is imperative to diagnosing performance deficiencies. Multivariate model validation can help attribute sources of bias and increase certainty in water balance components; this is especially true for the physically based hydrologic community and at continental scales and above.
In this study, we present a rigorous, multivariable evaluation of a hyper-resolution continental-scale hydrologic simulation by comparing model results to state-of-the-art monitoring networks and remote sensing products. We focus on performance of the CONUS version 1.0 model, a ParFlow–CLM integrated groundwater–surface water simulation configured across the continental United States (hereby referred to as PFCONUSv1) (Maxwell et al., 2015). Since its construction, the PFCONUSv1 model has been updated to a ParFlow–CLM simulation, in which ParFlow is coupled to the Common Land Model to capture surface energy partitioning and land surface fluxes (Maxwell and Miller, 2005). Recent publications have used the PFCONUSv1 model to (1) diagnose mechanistic relationships between water table depth, topography, recharge, and evapotranspiration at a range of scales (Condon et al., 2015; Condon and Maxwell, 2015, 2017); (2) characterize groundwater controls on evapotranspiration partitioning (Maxwell and Condon, 2016); (3) explore anthropogenic impacts on the water and energy balances, such as impacts on evapotranspiration, streamflow, and groundwater from aquifer depletion (Condon et al., 2020; Condon and Maxwell, 2019); and (4) estimate water residence times and their sensitivity to climate and geology (Maxwell et al., 2016).
To our knowledge, this is the most rigorous evaluation of an integrated, physically based hydrology–land surface model at this resolution and scale. We present comparisons of model results and observations or remote sensing products over 4 simulation years (water years 2003 through 2006) for several water balance components, including streamflow, water table depth, soil moisture, snow water equivalent, evapotranspiration, and total water storage, as well as atmospheric forcing (precipitation and temperature). We discuss sources of error in the model and prioritize areas for improvement, with careful attention to error propagation from atmospheric forcing datasets and terrain processing algorithms. These results provide a benchmark for forthcoming PFCONUS iterations and should be used to guide future model development. Most importantly, this study implicates the improvement of atmospheric forcing datasets and topographic processing algorithms to advance the field of continental-scale hydrology, and it highlights the importance of evaluating the continental-scale water balance as a whole for a process-based understanding of model performance and bias.
2 Methods
The PFCONUSv1 model was simulated using the coupled hydrology–land surface platform ParFlow–CLM. In this section, we describe the governing equations for ParFlow–CLM-formulated water balance, PFCONUSv1 configuration and inputs, datasets for model validation, and performance metrics.
2.1 Modeling the integrated water and energy balance with ParFlow–CLM
The full water balance for a given hydrologic unit can be generally expressed as , where and represent the hydrologic inflows and outflows to some control volume, and is the change in water storage within the control volume. More specifically, the full water budget for a watershed under natural (non-anthropogenic) conditions can be written as
1
Figure 1
A conceptual model of the (a) complete and (b) simplified water budget for a hydrologic control volume, corresponding to Eqs. () and ().
[Figure omitted. See PDF]
In Eq. (), inflows to the watershed are precipitation in the form of rain or snow (), surface runoff entering the basin from upstream areas (), or subsurface influx (). Water may leave the watershed in the form of surface runoff (), evapotranspiration from transpiration () or evaporation from bare surfaces (), or as groundwater flux to downstream basins () or deeper reservoirs (). These fluxes have a net impact on yield increases or decreases in sources of basin water storage, such as soil and groundwater reservoirs ( and ), surface water ponding (), or storage as snow water equivalent (). Components in Eq. () are typically expressed as units of equivalent water height or volume per unit of time. This description of the water budget equation (Eq. ) is illustrated in Fig. 1a, and it may be amended to incorporate other components particular to a watershed; these could include anthropogenic fluxes and storage like irrigation, dam storage or pumping, or they could be unique traits of the basin such as fractured flow, lacustrine groundwater discharge, or seawater intrusion. Equation () may also be simplified by lumping precipitation, evapotranspiration, and storage components and also by ignoring surface and subsurface inputs external to watershed divides, which, for large enough control volumes, will be negligible (Fig. 1b). The water balance may then be simply expressed as 2 for precipitation , evapotranspiration ET, surface runoff , and total change in all storage sources .
In this study, the complete water balance (Eq. , Fig. 1a) is modeled using ParFlow–CLM (Kollet and Maxwell, 2006; Maxwell and Miller, 2005), an integrated groundwater–surface water model which uses the mixed form of the Richards equation to simulate three-dimensional variably saturated flow. The Richards equation is given as 3 for specific storage (), relative saturation (–), pressure head (L), saturated hydraulic conductivity tensor (), relative permeability (–), and porosity of the medium (–) at depth (L) and time (T). In Eq. (), relative permeability varies with pressure head through time based on relationships established by van Genuchten (1980), and is a source–sink term (). A free-surface overland-flow boundary condition for continuity of pressure and flux applies to the groundwater flux term across the land surface and subsurface interface. The kinematic wave approximation of the momentum equation is used to solve overland flow, which is a function of ponded depth given by Manning's equation: 4 where is the Manning roughness coefficient () and is the ponding depth (surface pressure head) (L). Note that the friction slope (–) in Eq. () is used to approximate the bed slope (–) in the kinematic wave approximation.
ParFlow is coupled with the Common Land Model (CLM) (Dai et al., 2003), a land surface model which balances energy and calculates evapotranspiration at the land surface, in order to simulate the coupled water and energy budgets. CLM requires atmospheric conditions (precipitation, temperature, specific humidity, wind speed, and longwave and shortwave radiation) in order to provide hourly partitioning of net radiation into sensible, latent, and ground heat. Shown in Eq. (), CLM calculates direct evaporation from the ground using the gradient between specific humidity at the ground surface (MM) and at a reference height (MM), along with air density (ML), atmospheric resistance (TL), and a soil resistance term (–). 5
To calculate actual transpiration, CLM adjusts potential transpiration by stomatal and aerodynamic resistance terms as follows.
Potential transpiration (Eq. ) is a function of leaf and stem area index and (–), boundary layer resistance (TL), air density (ML), and the gradient of specific humidity between foliage and canopy (MM). Transpiration (Eq. ) only occurs from the dry fraction of the canopy () and further depends on stomatal resistance (TL). In Eq. (), the sunlit (sun) and shaded (sha) fractions of the dry canopy are separately defined with their own LAI and stomatal resistance values. Note that the leaf and stem area index and stomatal resistance terms are parameterized by plant functional types and defined per cell, without fractional vegetation, and for a single canopy layer. For a further explanation of ET calculations in ParFlow–CLM, see Jefferson et al. (2017).
2.2 PFCONUSv1 configuration, parameters, and inputsThe PFCONUSv1 model represents the first integrated groundwater–surface water model employed at the continental scale at hyper-resolution (1 ). A full description of the model configuration and inputs can be found in Maxwell et al. (2015) and Maxwell and Condon (2016), but a brief summary is given below.
Spanning roughly 6.3 at 1 lateral grid spacing, the PFCONUSv1 model encompasses the majority of eight major river basins in the United States at high resolution, including the Ohio, Missouri, Arkansas, Mississippi, and Colorado River basins. The model is composed of 3442 cells in the (east–west) direction and 1888 cells in the direction (north–south). Its five vertical layers of variable thickness provide a cumulative vertical depth of 102 . From the top, soil layers are 0.1, 0.3, 0.6, and 1 , respectively. Topographic slopes were calculated using the Barnes et al. (2016) algorithm, applied to the HydroSHEDS (Hydrological data and maps SHuttle Elevation Derivatives at multiple Scales; Lehner et al., 2008) digital elevation model, to guarantee a connected drainage network. Vegetation classes for characterization of plant functional parameters were provided by the IGBP land cover classifications and the USGS land cover dataset. Distributed, heterogeneous soil parameters, including saturated hydraulic conductivity, porosity, and van Genuchten parameters, were assigned to spatial soil units described by the Soil Survey Geographic database (SSURGO). Geologic units for the bottom 100 thick layer of the PFCONUSv1 model were developed from the Gleeson et al. (2011) national permeability map. Estimates from Gleeson et al. (2011) were adjusted using the -folding relationship described in Fan et al. (2013), which accounts for topographic complexity, and variance in permeability was also reduced. No-flow boundary conditions were imposed at the bottom of the model domain (assuming impermeable bedrock) and on the sides. Note that with a model depth of just over 100 , the model may more appropriately be considered a shallow aquifer storage model. Deeper contributions are not resolved, which may not represent deeper hydrologic flow paths of thick and expansive aquifers such as the Ogallala, the saturated thickness of which can exceed 300 (McGuire et al., 2012); however, as Maxwell et al. (2015) explain, the current model thickness and vertical discretization are limited not by computational expense but by data availability, with a lack of detailed depth-to-bedrock and aquifer thickness estimates at meaningful resolution.
Initial conditions were provided by an intensive spin-up process. First, a steady-state ParFlow groundwater configuration was run continuously without CLM; this model was forced by an average surface recharge flux derived from Maurer et al. (2002) and run continuously until the difference between outflow and recharge rates was less than 3 % of total water storage change. A full description of this steady-state model and its performance can be found in Maxwell et al. (2015). Second, and using the initial condition provided by the steady-state model, a transient system was simulated with the fully coupled ParFlow–CLM for water year 1985, the most climatologically average water year within the past 30 years. As described in Maxwell and Condon (2016), atmospheric forcing was bilinearly interpolated from the North American Land Data Assimilation System Phase 2 (NLDAS-2) (Cosgrove, 2003; Xia et al., 2012a, b) For spin-up purposes, the transient simulation was run continuously for 4 years of repeated 1985 atmospheric forcing to provide an initial condition for the simulation in this study. Thus, the initial condition provided here represents pressure head, soil moisture, and surface energy balance conditions that would be present during the most climatologically average water year in recent history. Since the model does not incorporate anthropogenic abstractions in the form of pumping, injections, irrigation, or surface water diversions and dam storage, the initial conditions provided also represent a pre-development scenario.
For this study, PFCONUSv1 was run for modern-day water years using initial conditions provided by the transient spin-up process described above. The simulation here was run at hourly temporal resolution for water years 2003 through 2006. Atmospheric forcing originated from the 12 NLDAS-2 product (Xia et al., 2012a, b); however, finer-resolution products were blended in where available and elevation effects were incorporated to produce higher-resolution, more physically realistic meteorological variables. Such products included the 4 Stage IV and Stage II radar and gauge products and Level 2 shortwave radiation from the GOES Surface and Insolation Products (GSIP). These adjustments to the 12 NLDAS data and the finer-resolution products are described, for example, in Pan et al. (2016) and include the following: gap-filling and daily rescaling procedure to ensure the Stage IV hourly data match daily totals from NLDAS-2; adjustments to timing for the GSIP Level 2 data based on solar angles; and elevation-dependent downscaling of 12 NLDAS-2 products, such as hydrostatic effects for atmospheric pressure and lapse rates for specific humidity, air temperature, and longwave radiation. The final atmospheric variables were interpolated using bilinear interpolation to the 1 PFCONUSv1 grid.
Table 1
Products used to evaluate PFCONUSv1-simulated water balance component performance.
Water balance component | Data product | Spatial scale | Product type |
---|---|---|---|
Comparisons to modeled water budget components | |||
Surface runoff, | USGS stream gauges | Aggregate of upstream area, 2392 locations | Point observation |
Evapotranspiration, ET | MODIS | 1 resolution, global scale | Remote sensing |
SSEBop | 1 resolution, global scale | Remote sensing | |
FLUXNET | Local, 30 locations | Point observation | |
Storage, | SNOTEL | Local, 556 locations | Point observation |
GRACE (5 products) | 0.25 to 1 resolution, 3 basis function, global scale | Remote sensing | |
ESACCI active/passive | 0.25 resolution, global scale | Remote sensing | |
USGS wells | Local 41 269 locations static, 2486 locations temporal | Point observation | |
Comparisons to atmospheric forcing | |||
Precipitation, temperature | GHCND | Local, 9139 locations | Point observation |
Precipitation, temperature | SNOTEL | Local, 556 locations | Point observation |
Temperature, vapor pressuredeficit, wind speed | FLUXNET | Local, 30 locations | Point observation |
An important consideration when attempting high-resolution integrated models of this kind is of course the computational demand. ParFlow–CLM solves the globally implicit solutions to nonlinear and coupled equations in Eq. () through Eq. () with a Newton–Krylov parallel solver (Jones and Woodward, 2001); the associated significant computational challenge is tackled with a multigrid preconditioner and highly scaled parallel efficiency (Kollet et al., 2010; Reed M. Maxwell, 2013; Osei-Kuffuor et al., 2014). The simulations presented here were run on 3456 processors distributed to 72 and 48 units in the and directions, respectively, on the Cheyenne high-performance computing system managed by the National Center for Atmospheric Research (NCAR) Computational and Information Systems Lab. Required core hours for a single water year averaged over 300 000 core hours for this processor topology; however, the scaled parallel efficiency even at this decomposition is over 60 %. The hourly outputs generated over 11 of information per water year, while the required storage for the interpolated atmospheric forcing alone was over 3 per water year.
2.3 Datasets for comparisonSimulated runoff, evapotranspiration, and sources of storage change from the PFCONUSv1 model were compared against available point-scale measurements and coarse-resolution remote sensing products in order to identify locations of relatively better or worse performance, major sources of model bias, and regions most in need of improvement. Table 1 provides a summary of all data products compared to PFCONUSv1 outputs. It is important to note here that while we use absolute error metrics common to calibrated models developed specifically for prediction, calibration of the PFCONUSv1 model is not a goal of this study, nor is it feasible given the computational demands posed by such a highly parallelized platform. Rather, the intent is to evaluate the model's ability to demonstrate realistic behavior, to identify regions, times, and sources of uncertainty, and to prioritize areas of improvement for future model development.
2.3.1
Surface water runoff,
Modeled surface water runoff ( in Eq. ) was compared to daily observations at 2392 US Geological Survey (USGS) stream gauges containing observations over the simulation period (1 October 2002 through 30 September 2006) within the PFCONUSv1 domain (Table 1) (obtained from
Evapotranspiration, ET
For evapotranspiration (ET in Eq. ), three datasets are used to evaluate PFCONUSv1 results (Table 1). Observations from FLUXNET, an international network of meteorological towers that rely on the eddy covariance method to estimate evapotranspiration, were used to evaluate the temporal performance in ET. FLUXNET data were obtained from the FLUXNET 2015 online data portal (
Storage,
To evaluate PFCONUSv1 storage change ( in Eq. ), four products are used to compare to individual storage components, including total water storage, snow water storage, and soil water storage. Modeled snow water equivalent was compared to Snow Telemetry (SNOTEL) station data, a network maintained by the Natural Resources Conservation Service (NRCS). SNOTEL data were accessed from the NRCS online report generator 2.0 (
PFCONUSv1 total water storage anomalies (an aggregate of all subsurface, snow water, and surface water storage components) were also compared to terrestrial water storage anomalies provided from remote sensing products from the Gravity Recovery and Climate Experiment (GRACE). The GRACE products are derived from slight fluctuations in Earth's gravity caused by changes in mass and measured by twin satellites launched in 2002; these gravity field changes over land may be attributable to terrestrial water storage change. GRACE solutions are provided by three processing centers: the NASA Jet Propulsion Laboratory (JPL), the GeoforschungsZentrum Potsdam (GFZ), and the Center for Space Research at University of Texas, Austin (CSR). In this study, PFCONUSv1 total water storage changes were compared to the Release-06 gravity field solutions (RL06) at 1, calculated using the spherical harmonic approach (Landerer and Swenson, 2012) with varying degrees and orders, spherical harmonic coefficients, and filtering processes. We also compare PFCONUSv1 to the mass concentration block (mascon) solutions provided by JPL at 0.5 and CSR at 0.25 (Save et al., 2016; Wiese et al., 2016), which eliminate much of the need for empirical post-processing and filtering required in the spherical harmonic solutions. The GRACE products listed above are hereafter referred to as JPL, GFZ, and CSR for the RL06 spherical harmonic solutions and JPLm and CSRm for the mascon solutions. For both the ESACCI soil moisture product and the GRACE total water storage anomalies, PFCONUSv1 estimates are aggregated to the coarse-resolution product by area-weighted mean prior to comparisons.
Finally, PFCONUSv1-calculated depths to water table are compared with water levels from 41 269 USGS groundwater wells across the continental United States; like streamflow, these data are freely available for download from the USGS National Water Information System (
One important source of bias is that of atmospheric forcing; to evaluate the impact of meteorological performance on simulated water balance variables, we compare the interpolated NLDAS product to observed daily precipitation () and observed averaged daily temperature () at meteorological stations maintained by the Global Historical Climatology Network (GHCND) (Table 3.1) (data accessed via Climate Data Online portal,
2.4 Performance metrics
Performance metrics for evaluating PFCONUSv1 include percent annual bias or total annual bias, Spearman rank correlation coefficient, and the ratio of root mean squared error to the standard deviation of observations (RSR). While these were not calculated for all validation datasets and the temporal resolution at which they were evaluated differed between datasets (e.g., daily, weekly, or monthly), they are each used at some point in our analysis, so we define them here.
As a measure of average magnitude accuracy with an optimal value of 0, percent bias is given by
8 where and are simulated and observed values. Percent bias in PFCONUSv1 outputs was calculated using daily observations in Eq. () such that days during which observations were unavailable were excluded for both simulated and observed annual totals. Percent bias is an effective metric for evaluating long-term mean values, but it cannot be used to evaluate timing or shorter temporal events; further, if the model underpredicts and overpredicts with similar magnitudes, PBIAS can be deceivingly low.
For these reasons, we also calculate for each stream gauge Spearman's rank correlation coefficient, or Spearman's , given by Eq. (). 9
Unlike the coefficient of determination , which describes the degree of collinearity between the data, Spearman's independently ranks the simulated and observed values, with in Eq. () being the difference in ranks for a given value , and is the number of values in the series. Unlike other metrics describing temporal correlation, such as or Nash–Sutcliffe efficiency, is less restrictive; it does not assume linearity and instead tests for monotonic correlation. The optimal value for is 1, and the cutoff for good performance is likely analogous to that of , which varies in the literature but is generally around 0.6.
A final performance metric, the RMSE–observation standard deviation ratio (RSR), is also provided. RSR is given by Eq. () and describes root mean squared error (RMSE) relative to the standard deviation of the observations. 10
In Eq. (), is the mean of observations. While RSR is less widely used than PBIAS and , its benefit lies in its normalization of the common error index statistic RMSE; the ratio describes error relative to natural variability in the true system such that an RSR of 1 suggests that the mean daily error is equal to 1 standard deviation of observed values and thus comparable to what we may expect from noise or intra-annual variability. An RSR value of 0 is optimal, while values under 0.5 (RMSE is less than half of the standard deviation of observations) are considered to be excellent (Moriasi et al., 2007).
Figure 2
Mean annual water balance components from PFCONUSv1 at 1 resolution: (a) interpolated precipitation from atmospheric forcing inputs, (b) simulated mean annual evapotranspiration (ET), (c) simulated mean annual runoff , and (d) simulated mean annual total water storage amplitude (combined seasonality of snow water equivalent, groundwater, soil water, and surface water). Total water storage amplitude is the peak-to-peak seasonal storage anomaly rather than annual storage trend; seasonality (rather than inter-annual variability) explained the majority of the variance in total . Dotted lines are states, while thicker solid lines are major US river basin outlines, which are labeled in (a).
[Figure omitted. See PDF]
Together, performance metrics (Eq. through Eq. ) are quantitative indicators of model realism, representing a model's ability to capture long-term states (PBIAS) and timing (, as well as its error relative to expected system variability (RSR). However, many other statistical criteria are popular (Waseem et al., 2017), and the target values used to indicate unacceptable, acceptable, or excellent performance can vary because criteria for evaluation necessarily depend upon model purpose (i.e., a regional surface water model that has been well calibrated for operational forecasting will represent spatiotemporal patterns of streamflow with higher accuracy than a continental-scale land surface model can plausibly achieve). Further, performance is expected to decrease with increasingly higher temporal resolution: for instance, criteria may be more lenient across all error metrics when moving from monthly to daily timescales at the watershed scale (Moriasi et al., 2015) as well as from seasonal to monthly timescales at the global scale (Krysanova et al., 2020). As a physically based, high-resolution (spatially and temporally), and uncalibrated continental-scale model, a primary purpose of the PFCONUS, and others like it (Gleeson et al., 2021), is to understand process interactions between groundwater, surface water, and ecohydrological fluxes. In this study, a PFCONUS-simulated water balance component in Eq. () is generally judged to be excellent for this purpose with the following measures: RSR 0.6, 0.7, or 20 %. Locations that indicate unacceptable or poor performance are those with RSR 1.2, 75 %, and 0.5. However, error metrics are reported with the primary goal of intercomparison across locations (interpretation of metrics should be paired with visual inspection of spatial patterns and time series provided) or, where discussed, relative to the performance of other continental-scale hydrologic or land surface models. Gleeson et al. (2021) caution against the use of model evaluation to indicate a “finished” product and instead recommend open-ended evaluation and model improvement. Metrics (Eq. through Eq. ) are therefore used to identify where future development of PFCONUS can be focused to improve upon timing, volume, and variability of fluxes. Performance metrics reported in this study are also supplemented by plots of probability of exceedance or non-exceedance where appropriate (see Figs. S1 through S8 in the Supplement), which should help regional-scale modelers identify relative performance of major basins at various thresholds. Since there are many other commonly used performance metrics particular to streamflow, we also report Nash–Sutcliffe efficiency and Kling–Gupta efficiency for simulated flows at USGS gauges (Fig. S9 in the Supplement) (Gupta et al., 2009).
3 ResultsBy providing detailed partitioning of the water and energy budgets at high spatial and temporal resolution and at continental spatial extent, the PFCONUSv1 ParFlow–CLM model offers an unprecedented opportunity to study large-scale nonlinear relationships and to provide hydrologic process estimates at locations remote from observation networks. The 2003–2006 water year simulations in this study estimate hourly pressure head and saturation at each of the approximately 31.5 million 1 three-dimensional model cells; the simulations also provide evapotranspiration and energy balance estimates at each of the 6.3 million land surface grid cells. Figure 2 shows the PFCONUSv1 model extent, mean annual precipitation from interpolated atmospheric forcing, and mean annual simulated components of Eq. (). Below, these water balance components, their performance, and their relative bias sources are discussed in detail. Note that different performance metrics were discussed for model components based on their temporal and spatial coverage, continuity, resolution, and uncertainty. For instance, the sheer amount of temporal and spatial coverage provided by the USGS stream gauge network allowed for several different error metrics to evaluate long-term behavior, hydrograph shape, and flashiness. Comparisons of model results with remote sensing products and well observations were more limited by higher uncertainty and lower temporal and spatial resolution and continuity, but they were still valuable in identifying regions for model improvement and analyzing error propagation between water balance components.
3.1
Runoff,
The ability to accurately simulate overland flow at the major basin or continental scale and above has for several years been a topic of much interest in the hydrologic community. Continental or global streamflow estimates could be coupled to general circulation models to provide predictions of surface water resource vulnerability to climate change (e.g., Koirala et al., 2014); large-scale runoff models could additionally provide flood forecasts to regions lacking developed surface water monitoring networks (Kauffeldt et al., 2016). While integrated groundwater–surface water modeling is computationally demanding, results from PFCONUSv1 represent a rare opportunity to evaluate streamflow performance (1) because the integrated system platform resolves shallow aquifer, vadose zone, and surface water transfer, and (2) streams form naturally as surface water is routed by topography without requiring pre-defined stream reaches.
PFCONUSv1 streamflow was evaluated against 2392 USGS stream gauges which are well distributed across the United States. We analyze model performance using percent bias, Spearman rank correlation, and RSR. However, Gleeson et al. (2021) suggest that while the use of error metrics and direct comparison of observations with simulated values are valuable for evaluation, they should be supplemented with hydrologically meaningful diagnostic signatures to better understand system dynamics. Further, PBIAS can be sensitive to precipitation provided by the interpolated NLDAS atmospheric forcing product. Since is an input to the PFCONUSv1 rather than a model result, runoff ratio () was also calculated to extract model performance independent of precipitation bias and to better represent a diagnostic measurement of watershed response to rainfall. RR measures the amount of precipitation partitioned to runoff, with lower RR values generally indicating a greater portion of precipitation lost to infiltration or evapotranspiration. “True” runoff ratios were estimated by first identifying all GHCND precipitation gauges upstream of a USGS stream gauge. The mean annual precipitation was then calculated and applied over the drainage area defined by the Geospatial Attributes of Gages for Evaluating Streamflow (GAGES-II) dataset (Falcone, 2002). RR is equal to the ratio of total USGS gauge flow to GHCND precipitation. A similar process was done for simulated RR using NLDAS-interpolated precipitation, simulated flow at the gauge cell, and model drainage area from the input digital elevation model derived from HydroSHEDS. Note that while the interpolated NLDAS precipitation, unlike the GHCND gauge network, is continuous in space, only modeled cells which matched the nearest-neighbor GHCND gauge network were used to estimate upstream precipitation in order to create as controlled a comparison as possible. Runoff ratios were not calculated for USGS stream gauges with fewer than three upstream GHCND precipitation gauges.
Figure 3
(a) Observed annual streamflow from the USGS gauge network, (b) PBIAS for simulated PFCONUSv1 streamflow, (c) runoff ratio calculated from USGS stream gauges and GHCND precipitation gauges, (d) simulated minus observed runoff ratio, (e) of simulated daily flows, and (f) RSR of simulated daily flows.
[Figure omitted. See PDF]
Observed total annual flow during the simulation period is shown in Fig. 3a; annual streamflow varies by several orders of magnitude across US major basins, with higher flows in the east and the Pacific Northwest and the lowest flows in the Great Plains. Runoff ratios are generally highest in the east; the majority of the arid west exhibits RR of less than 0.1, with the exception of topographically complex regions and headwater watersheds of the Rocky Mountains (Fig. 3c).
PFCONUSv1 reproduces point-scale annual flows across the United States with a median annual PBIAS of 7.7 % and with 25th and 75th percentiles of 26.2 % and 77.4 %, respectively (Fig. 3b). Shown in Fig. 3e and f, the 25th, 50th, and 75th percentiles for daily Spearman's are 0.42, 0.65, and 0.76, while the same for RSR are 0.86, 1.2, and 2.5. The median PFCONUSv1 minus USGS difference in RR is 0.016 (Fig. 3d), which corresponds to a mean percent bias in runoff ratio of 8.3 %. The PFCONUSv1 model simulates observed streamflow with RSR 0.6, 0.7, and 20 % at 54 gauges (approximately 2 % of available sites). An additional 97 locations (4 % of gauges) exhibit RSR 0.7, 0.65, and 30 %. An additional 382 locations (15.7 % of gauges) show RSR 1, 0.6, and 50 %. And, finally, an additional 268 gauges (11 % of gauges) show RSR 1.2, 0.5, and 75 %. As has been shown in previous literature (Waseem et al., 2017), different performance metrics do not always indicate the same closeness of fit: while 2099 gauges (86 % of the dataset) show either RSR 1.2, 75 %, or 0.5, only 801 gauges (34 % of all gauges) fit all those criteria.
Figure 4
Time series of PFCONUSv1-modeled and USGS-observed streamflow time series at representative gauge locations for each major basin.
[Figure omitted. See PDF]
Streamflow performance varies widely across major basins. For instance, median PBIAS, , and RSR for the Ohio River Basin are 7.8 %, 0.79, and 0.84, respectively, and the median of simulated RR values is within 6 % of the median estimate of RR 0.42 from observations. Simulated flows in the Tennessee River Basin also appropriately simulate observed flows: mean PBIAS, , and RSR are 11.9 %, 0.69, and 0.89, respectively; 60 % of the gauges in the basin perform with RSR 1.2, 0.5, and 75 %, and observed and simulated mean RR values are 0.49 and 0.53, respectively, for a percent bias in RR of 9 %. Conversely, the majority of the upper Missouri River Basin shows weak timing performance (median of 0.49) and higher overall bias: the median PBIAS for Missouri is 65 % and median RSR is 2.2, indicating that the majority of Missouri gauges exhibit daily RMSE that is twice the volume of expected daily variability. The Great Plains region is certainly the region with the worst streamflow performance: PFCONUSv1 percent bias in the majority of these gauges is greater than 300 %, and in some cases, simulated flow is greater than 10 times the volume observed. While the mean difference in runoff ratio in this region is only 0.04, this is on average 4 times larger than RR estimated from observations. Results in Fig. 3 therefore suggest that in the arid Great Plains region, a very small change in runoff ratio can result in dramatic error in streamflow bias, and the PFCONUSv1 struggles to capture low flows in this region. There is evidence that continental-scale hydrologic models commonly share this struggle to capture streamflow dynamics in the Great Plains region. The first phase of the Continental Hydrologic Intercomparison Project has shown that a NOAA US National Water Model configuration of WRF-Hydro shares the same streamflow performance category (poor hydrograph shape and/or timing and high bias) as PFCONUSv1, run for the single water year 1985, at many gauges in the Great Plains region (Tijerina et al., 2021). While the intercomparison project is in its infancy and this comparison was primarily proof-of-concept, such results may stress the importance of representing groundwater abstractions and irrigation over the Ogallala in continental-scale hydrologic models.
Note that in Fig. 3, no filtering was done for these metrics in order to eliminate gauges with incorrect drainage area from topographic processing discrepancies, nor have we removed sites proximate to dams, influenced by nearby pumping or irrigation, or affected by bias in atmospheric forcing. As an example of PFCONUSv1 performance in ideal conditions, we show in Fig. 4 selected examples of individual gauge comparisons for each major basin in the PFCONUSv1 domain. Gauges chosen for Fig. 4 were those that tended to be minimally impacted by bias from anthropogenic effects or by errors in basin delineation by topographic processing. Such gauge attributes were determined based on geospatial stream properties obtained from the Geospatial Attributes of Gages for Evaluating Streamflow (Gages-II) dataset (Falcone, 2002), as well as the National Hydrography Dataset (see the supplemental information in Maxwell and Condon, 2016, for a detailed description of geospatial stream gauge attributes). Streamflow time series examples in Fig. 4 include gauges with the following properties: (1) represented greater than 300 of upstream drainage area, (2) PFCONUSv1 drainage area differed from actual drainage area by less than 20 %, (3) total dam storage was less than 3 % of total annual flow for the closest upstream dam, (4) total withdrawals for the previous 5 years were less than 3 % of total annual flow, (5) total irrigated area in 2002 constituted no more than 15 % of the total drainage area, and (6) upstream area Spearman's for precipitation performance must be greater than 0.5. The examples in Fig. 4 therefore represent naturalized gauges with minimal bias in a priori inputs, low anthropogenic impact, and good performance potential. We also compared domain-wide PFCONUSv1 performance at reference gauges identified by Maxwell et al. (2015) (locations with the least human influence and best representing “natural” ecohydrologic conditions) with non-reference gauges. As a whole, PFCONUSv1 performed better at reference locations regardless of the error metric used (Fig. S1). However, the difference in performance between reference and non-reference locations varies considerably between basins (Figs. S4 and S5). The Pacific Northwest, Missouri, Lower Colorado, and Arkansas–Red–White basins exhibit much greater performance at reference gauges across all error metrics, while other basins show mixed signals or poorer performance at reference gauges (Figs. S1 and S2), indicating sources of bias other than anthropogenic effects.
Figure 5
(a) Cumulative annual ET observed at 30 FLUXNET sites across the contiguous United States, (b) percent bias of PFCONUSv1 daily simulated ET at FLUXNET locations, (c) Spearman of PFCONUSv1 daily simulated ET at FLUXNET locations, (d) RSR of PFCONUSv1 simulated daily ET, and (e–g) examples of observed and simulated daily ET at three FLUXNET sites with complete observation periods during the simulation timeframe.
[Figure omitted. See PDF]
3.2Evapotranspiration, ET
Evapotranspiration is a major component of the water balance, accounting for roughly 60 % partitioning of land precipitation into the atmosphere annually (Oki and Kanae, 2006); however, it is also widely considered to be an incredibly difficult value to constrain (Senay et al., 2013; Velpuri et al., 2013; Xu and Singh, 2005) and is often estimated simply as the residual of other components of the water balance. Unlike streamflow and precipitation, direct point measurement methodologies are limited, costly, and difficult to maintain. Direct estimates can be inferred from sap flux measurements; lysimeters, which weigh plant and soil mass to track temporal fluctuations in water storage; or chemical tracers, such as deuterium (Wilson et al., 2001). Currently, the method which likely provides the most defensible direct measurements of ET is the eddy flux or eddy correlation method. Eddy flux towers relate observed turbulent heat fluxes at the surface (latent and sensible heat) to the covariance between instantaneous fluctuations of vertical wind speed, humidity, and temperature (Baldocchi, 2003; Swinbank, 1951). The PFCONUSv1-simulated daily ET was compared to observations from 30 eddy covariance towers managed by the FLUXNET mission. Locations of these sites and their relative performance are shown in Fig. 5, along with time series examples from three FLUXNET sites with complete observations during the entire observation period.
Results in Fig. 5 demonstrate the ability of the PFCONUSv1 model to simulate daily and seasonal ET across difference climatic zones. The mean 25th, 50th, and 75th percentiles for PFCONUSv1-simulated daily ET PBIAS are 3 %, 26 %, and 55 %, respectively. Given that remote sensing estimates regularly exhibit uncertainty of 50 %–60 % for point-scale ET estimates, or 20 % uncertainty in ET at the basin scale (Velpuri et al., 2013), PFCONUSv1 ET results are promising, especially for an uncalibrated model. For daily time series, 25th, 50th, and 75th percentiles are 0.6, 0.72, and 0.81 for and 0.69, 0.92, and 1.33 for RSR, respectively. Because the metric is an indicator of monotonic agreement, the high overall Spearman's values are particularly telling because is sensitive not only to seasonal trends which dominate the time series variance but also the influential day-to-day (sub-seasonal) noise. Out of 30 FLUXNET sites with observations during the simulation time period, the PFCONUSv1 model performs with RSR 1.2, 75 %, or 0.5 at 19 locations (63 % of locations); at 29 out of 30 sites, the PFCONUSv1-simulated ET fits one of these criteria.
The spatial discontinuity of FLUXNET certainly limits ET performance evaluation across the remaining PFCONUSv1 model domain. Eddy covariance ET estimates are applicable within the fetch of the prevailing winds, which is generally on the order of 1 radius surrounding towers (Wilson et al., 2001), and statistical interpolation is generally not recommended without considerable parameterization of atmospheric and vegetative conditions to inform upscaling (Jung et al., 2009).
Figure 6
PFCONUSv1 ET estimates compared to results from MODIS remote sensing and thermal imaging algorithms. (a–c) Annual cumulative ET across HUC8 watersheds, (d–f) differences in annual ET, (g–i) PBIAS of monthly ET, (j–l) Spearman's of monthly ET, and (m–o) RSR of monthly ET for PFCONUSv1 and MODIS products (MOD16A2 and SSEBop algorithms).
[Figure omitted. See PDF]
To evaluate performance at larger spatial scales, the PFCONUSv1 model has also been compared to the MOD16A2 and SSEBop algorithms for MODIS thermal imagery processing. These data, along with PFCONUSv1, have been aggregated to HUC8 spatial scale and monthly temporal resolution to help reduce uncertainty associated with cloud cover in the 8 product. Cumulative annual evapotranspiration for MOD16A2, SSEBop, and PFCONUSv1 are shown in Fig. 6a–c. Both MOD16A2 and SSEBop algorithms should be considered evapotranspiration modeling techniques produced from remote sensing observations rather than observations themselves. However, regions where PFCONUSv1 comparisons to the MOD16A2 and SSEBop agree establish greater confidence in the model's bias or timing of ET estimates. We have therefore used PBIAS, , and RSR error metrics, with PFCONUSv1 monthly ET observations as simulated and MODIS datasets as observed values. Multiple studies to date have compared MOD16A2 and SSEBop performance over a range of geophysical characteristics, vegetative types, and aridity indices by comparing to Penman–Monteith-based estimates (Knipper et al., 2016), lysimetric observations (Senay et al., 2014), FLUXNET observations or upscaled information from FLUXNET sites, and vegetative indices (Senay et al., 2013; Velpuri et al., 2013), with results showing good general agreement and within 50 % error for annual ET totals at point observations. Despite its considerably more simplistic approach for estimating ET from MODIS thermal land imaging, SSEBop performs nearly as well as or, in the case of the western US, better than MOD16A2 (Velpuri et al., 2013). Therefore, we also show MOD16A2 performance relative to SSEBop for reference (Fig. 6c, f, i, l, and o), where observations are MOD16A2 and observations are MOD16A2 monthly ET (Eqs. 8 and 10).
PFCONUSv1 shows similar overall agreement with MOD16A2 and SSEBop algorithms in annual ET, with differences within 30 across the domain. All products provide similar spatial signatures of ET, with overall higher ET in the southwest and the lowest ET in the Great Basin and Colorado River basins. PFCONUSv1 estimates tend to agree more with SSEBop with regards to timing and residual variation, and they are more similar to MOD16A2 with regards to PBIAS (particularly in the western United States) (Fig. 6). The 50th percentiles for PFCONUSv1 PBIAS against MOD16A2 and SSEBop are 7.5 % and 8 %, respectively; 25th and 75th percentiles of PBIAS are 4.4 % and 24 % for MOD16A2 and 4.4 % and 35 % for SSEBop. In several regions, PFCONUSv1 shows similar comparisons with both MODIS products. For instance, in the Upper Mississippi, both products suggest that PFCONUSv1 overpredicts ET in the north and underpredicts in the south, and both products suggest PFCONUSv1 underpredicts ET in the Rocky Mountain headwaters and across most of the Ohio River Basin (Fig. 6d and e). The approximately 30 % underestimation of ET in the CO headwaters further agrees with the PFCONUSv1 performance relative to FLUXNET observations at the Niwot Ridge site in Colorado. However, most of the Missouri and the Arkansas–Red–White basins show opposite behavior between PFCONUSv1-MOD16A2 and PFCONUSv1-SSEBop comparisons; in these regions, we can be less certain of model bias as described by remote sensing of ET. Broadly, across the western US, PFCONUSv1 shows better agreement with MOD16A2 with regards to ET magnitude (PBIAS) because SSEBop estimates negligible ET in the Basin and Range region (Fig. 6e); however, PFCONUSv1 shows dramatically better performance relative to SSEBop in terms of Spearman's and RSR (Fig. 6g and h). The 25th, 50th, and 75th percentiles of are 0.38, 0.85, and 0.92 for monthly PFCONUSv1 compared to MOD16A2; the quantiles for PFCONUSv1 compared to SSEBop are 0.85, 0.91, and 0.93. Similarly, 25th, 50th, and 75th percentiles of RSR are 0.41, 0.85, and 2.2 for performance relative to MOD16A2 and 0.38, 0.47, and 0.62 for performance relative to SSEBop.
Despite differences between PFCONUSv1 comparisons to the two MODIS algorithms, results shown in Fig. 6 suggest that PFCONUSv1 appropriately estimates the magnitude and temporal progression of ET compared to the performance of other LSMs. In a study comparing LSM-based recharge estimates in the western United States, Niraula et al. (2017) showed that the LSMs Mosaic, VIC, and Noah simulated spatially distributed ET with 0.87, 0.77, and 0.75 Pearson's correlation relative to MODIS. Pearson's correlations between PFCONUSv1 and MOD16A2, and between PFCONUSv1 and SSEBop, are 0.9 and 0.95, respectively, which motivates future comparisons of PFCONUSv1 performance relative to other LSMs. However, it is important to again note that MODIS ET estimates are themselves models, and as such they are susceptible to epistemic errors in input data (e.g., inaccuracies in leaf area index or other parameterizations), measurement and remote sensing errors (e.g., cloud cover), and other uncertainties.
3.3Storage,
Terrestrial water storage represents all components of the water balance stored on and below the Earth's surface. As such, total storage is an aggregate of water stored on land in surface water bodies or in the canopy, as well as snow water equivalent, soil moisture in the vadose zone, and shallow or deep saturated aquifer storage. Estimates of overall could simply involve measuring and combining individual components, but these calculations (1) require highly developed monitoring networks and an impressive amount of in situ observations and (2) can have large margins of error if not all of the assorted hydrological stores are accurately resolved (Troch et al., 2007). The most common method for estimation calculates storage as a remainder of other water balance terms: , ET, and in Eq. (), (e.g., Rodell et al., 2011; Tang et al., 2010).
Recent advances in remote sensing have granted hydrologists an estimate of changes to total , without partitioning out storage sources, by measuring fluctuations of Earth's gravity fields as a proxy for mass change. Since water represents the greatest fluctuation of terrestrial mass, gravity anomalies can be translated to variability in . The newly available GRACE twin satellite mission provides approximately monthly values of total water storage at the global scale and at coarse ( 10 ) resolution (e.g., Strassberg et al., 2009). Storage anomaly estimates are based on K-band microwave measurements of the distance between the two low-flying satellites, which varies as a function of gravity field fluctuations (as well as atmospheric, oceanic, and solid Earth tides, which must be corrected to resolve the global water budget; Dahle et al., 2019).
Figure 7
Summary of GRACE and PFCONUSv1 comparisons. Seasonal storage amplitude for (a) the JPL mascon solution and (b) PFCONUSv1 total water storage, with darker red areas indicating a high degree of seasonality and white areas indicating no sub-annual storage fluctuation. (c–h) Time series of total water storage anomalies for five GRACE products and for PFCONUSv1 across complete major basins in the PFCONUSv1 domain. Shaded regions indicate uncertainty in the JPLm product based on leakage and measurement error.
[Figure omitted. See PDF]
The PFCONUSv1 total water storage anomalies (calculated as a sum of all simulated surface and subsurface hydrologic stores) were compared to five monthly gravity field solutions: the RL06 spherical harmonic solutions provided by JPL, GFZ, and CSR, as well as the mascon solutions JPLm and CSRm. Figure 7 shows seasonal storage amplitude in space and basin-aggregate storage change through time by comparing PFCONUSv1 and GRACE solutions for six major river basins. Some basins have been left out due to incompleteness in the model domain or due to size: the basis function for GRACE solutions is generally on the order of 300 000 such that storage anomaly estimates for smaller basins (e.g., the Tennessee River Basin) are not well resolved.
Seasonal storage amplitude represents the average peak-to-peak storage gain or loss over the course of a water year, and it is therefore a depiction of seasonality or intra-annual signal. The GRACE solution shown in Fig. 7 is the JPL mascon solution provided at 0.5 resolution, and amplitude for other GRACE products shows similar spatial signals; however, note that mascon solutions are calculated given a 3 equal-area basis function and subsequently downscaled using forward modeling techniques to account for leakage errors (Wiese et al., 2016). GRACE mascons are not independent of each other, and uncertainty increases dramatically with decreasing basin size. However, qualitative comparisons between GRACE and PFCONUSv1 amplitude indicate several regions of agreement for high or low seasonality. Topographic highs in the Rocky Mountains, where the snow water equivalent signal likely dominates overall storage variance and is entirely seasonally dependent, show high amplitude for both PFCONUSv1 and GRACE (Fig. 7a and b). The Upper and Lower Colorado River basins, in particular, show very similar spatial patterns for overall amplitude. Another area of agreement is the comparably high amplitude in the Lower Mississippi River basin. In both GRACE and PFCONUSv1, the Arkansas–Red–White region sees higher seasonality of total water storage in the east and lower in the west; the locations of highest amplitude for both GRACE and our model lie in the Pacific Northwest region. However, broadly speaking the PFCONUSv1 amplitude is lower than GRACE for the majority of the domain and particularly in the east. Other continental- or global-scale land surface models have also underpredicted seasonal storage amplitude across global river basins relative to GRACE; for example, the WaterGAP (Water Global Assessment and Prognosis) hydrologic model consistently underpredicted amplitude for most of the global land area (Döll et al., 2014), and a validation of four LSMs and global hydrologic models found that the numerical models reproduced GRACE storage signals only to a very limited degree (Zhang et al., 2017). However, LSM tendency for GRACE mismatch is likely attributed to insufficient groundwater representation, which is not as likely to be the cause for PFCONUSv1 and GRACE disparities.
Temporal progression of storage was calculated with the area-weighted mean of the Colorado, Arkansas, Ohio, Missouri, and Upper Mississippi River basins (Fig. 7c–h). Uncertainty (shaded regions) shown indicates the leakage error associated with downscaling 3 basis functions to 0.5 solutions for the JPL mascon product. We show only the JPLm uncertainty, simply because uncertainty estimates for the RL06 products were not yet available at the time of this analysis. The CSR mascon product is suggested to have an error of approximately 2 that is more or less constant through time and space.
The PFCONUSv1 model shows good agreement in the timing storage anomalies for most basins, with Spearman's rank correlation ranging from 0.43 to 0.94 relative to the mean of all GRACE solutions: individual values for major basins are 0.43 (Missouri), 0.63 (Upper Colorado), 0.76 (Pacific Northwest), 0.79 (Great Basin), 0.81 (Lower Colorado), 0.86 (Upper Mississippi), 0.88 (Ohio), and 0.93 (Arkansas–Red–White). However, correlation is not necessarily the best predictor of adequate model performance; for instance, the Upper Mississippi has the third-highest value out of six major basins, but more than 80 % of the total anomaly time series lies within the uncertainty bars provided for the JPLm product. Further, several discrepancies exist between PFCONUSv1 and GRACE trends and amplitude. For example, despite its monotonic agreement with GRACE storage amplitude for the Ohio River Basin, the PFCONUSv1 model simulates a seasonal storage amplitude that is, on average, more than 30 % lower than what GRACE observes. The Upper Colorado River basin captures seasonal timing, but the overall storage gain over the simulation period is roughly 3 times that of what GRACE observes.
Differences between the PFCONUSv1 and GRACE storage water anomaly estimates can come from various sources: (1) model error and uncertainty in PFCONUSv1 model parameters and configuration, error and uncertainty associated with GRACE measurement error, or error associated with the intensive post-processing and filtering on the raw spherical harmonic GRACE solutions; (2) hydrologic stores unaccounted for in the PFCONUSv1 model, such as deep ( 100 ) aquifer storage; or (3) anthropogenic impacts, particularly from groundwater withdrawals from municipal and agricultural aquifer depletions (Chen et al., 2016).
3.4Storage partitioning: , , and
Total water storage anomalies were also validated based on their partitioned components: , , and . First, PFCONUSv1 water table depth (WTD) was compared to USGS well observations across the United States in Sect. 3.4.1. As discussed below, WTD does not necessarily translate to , but it is still a very informative hydrologic state. PFCONUSv1 soil moisture was compared to a combined active/passive remote sensing product in Sect. 3.4.2, and PFCONUSv1 snow water equivalent was compared to snow telemetry measurements in Sect. 3.4.3.
Figure 8
(a) Observed water table depth ( 41 269), (b) difference in observed and PFCONUSv1-simulated WTD, (c) difference in observed and PFCONUSv1-simulated WTD at filtered locations (), and (d) Spearman values at filtered locations using at least 10 instantaneous (daily) observations.
[Figure omitted. See PDF]
3.4.1 Water table depthFigure 8 shows observed WTD across the model domain, as well as difference in observed and modeled heads and correlation for available locations. As a caveat to the results shown in Fig. 8, while WTD is a visibly appealing metric for modeled groundwater performance, it alone is not translatable to total storage or for storage change for several reasons. First, without information regarding aquifer storativity or in the absence of pumping tests, change in water table depth does not equate to total water storage fluctuation in an aquifer of uncertain depth and hydraulic characteristics. Second, water flow is governed by hydraulic head rather than water table depth; therefore, a bias in WTD of tens of meters within a continental model that spans thousands of meters of hydraulic head does not necessarily speak to the model's ability to laterally move water through the saturated subsurface. Finally, perched and confined aquifer systems can completely disconnect anomalies in total subsurface hydrologic stores and measurable WTD fluctuations. However, WTD does indicate vadose-zone-saturated zone connectivity, and for unconfined aquifers it is a good indicator for loss or gain in aquifer storage, so we briefly compare observed and simulated WTDs here.
Observed WTD from over 41 000 aquifers across the contiguous United States spans multiple orders of magnitude and is shown in Fig. 8. The PFCONUSv1 model demonstrates a fairly consistent shallow WTD bias across the domain, with “hot spots” of over 50 depth difference in the southern reaches of the Ogallala aquifer, in the southern Pacific Northwest region, and in the Lower Colorado River basin. However, many of these wells represent locations impacted by extractions (wells are preferentially drilled in regions prioritizing municipal or agricultural groundwater resources), wells tapping confined aquifers, or WTDs that simply cannot be captured by a shallow aquifer model of 102 depth. In a 1985 transient simulation of PFCONUSv1, Maxwell and Condon (2016, supplemental information) found that while no strong connection exists between water table depth bias and the model's geologic properties, WTD bias was aquifer-dependent, with the greatest positive WTD biases occurring in the High Plains aquifer, which has experienced depletions in the last several decades.
Further, WTD is only informative as an indicator of positive or negative if multiple observations are provided through time. Therefore, the available USGS wells have been filtered by excluding (1) locations where the observed minimum WTD was greater than 60 (PFCONUSv1 estimates pressure at cell centers, with the center of the deepest layer at 52 ), (2) locations providing fewer than 10 observations during the simulation timeframe, (3) locations flagged by the USGS as a confined or mixed aquifer system (aquifer type code aqfr_type_cd in the Groundwater Levels for the Nation dataset provided by USGS NWIS,
WTD bias for the remaining 2486 locations is shown in Fig. 8c. WTD agreement is considerably improved at these locations, but a shallow WTD bias is still present, with 25th, 50th, and 75th quantiles for simulated minus observed difference in total water level being 2.5, 5.8, and 13.5 , respectively. However, values suggest that despite PFCONUSv1 shallower water tables, the model is still able to capture temporal fluctuations in depth to saturation (and by association, groundwater ) at almost half of the filtered well sites (Fig. 8d). Quantiles for at the filtered locations are 0.14 (25th), 0.46 (50th), and 0.7 (75th); 46 % of wells show greater than 0.5, 37 % of wells show greater than 0.6, and 25 % of wells show greater than 0.7.
Figure 9
Summary of ESACCI and PFCONUSv1 soil moisture comparisons. Seasonal SM amplitude for (a) the ESACCI solution and (b) PFCONUSv1–ESACCI amplitude difference. Stippling in (b) indicates that the ESACCI product time series was less than 50 % complete during the simulation period (fewer than 750 available observation days). Grey areas (excluded) indicate that the average ESACCI annual cycle had at least 3 months with zero available observations (and therefore the annual amplitude is uncertain). (c–h) Time series of weekly SM anomalies across complete major basins in the CONUSv1 domain. Shaded regions indicate 1 standard deviation taken spatially across the basin.
[Figure omitted. See PDF]
3.4.2Soil moisture,
Soil moisture (SM) anomalies (analogous to ) at the top layer of PFCONUSv1 (up to 0.1 depth) were compared to the ESA CCI soil moisture product at 0.25 resolution and aggregated to weekly totals. Results are shown in Fig. 9. As in GRACE comparisons, we compared seasonal amplitude spatial signals across the PFCONUSv1 domain and basin-scale aggregates through time. The ESA CCI record is a state-of-the-art, multi-decadal, global satellite observation of SM created from combining single-sensor active and passive microwave sensors; since its release, the literature has shown good agreement between the ESA CCI product and spatial and LSM-modeled temporal SM patterns of soil moisture, and the harmonized product has shown better performance than any of its individual single-sensor inputs (Dorigo et al., 2017; Gruber et al., 2019). Because we are interested in over time rather than the total water stored in the soil at any one moment, comparisons were made to SM anomalies, or relative change in soil moisture with respect to the mean value.
Broadly speaking, the PFCONUSv1 shows overall lower amplitude in the west and higher amplitude in the east relative to the CCI product (Fig. 9a and b). While this could be a result of PFCONUSv1 bias in evapotranspiration or other fluxes in which seasonal signal is dominant, it is also possible that amplitude differences are simply a result of temporal coverage or blending algorithms in the ESA CCI product. For instance, for the combined SM product, blending weights are higher for active microwave sensors in the eastern US and high-elevation Rockies, while the rest of the southwest and the northern Great Plains region favored passive microwave sensors (Dorigo et al., 2017). Further, ESA CCI SM is limited by temporal coverage; note that in the majority of the eastern PFCONUSv1 domain, fewer than 365 observation days are available (most likely a product of high humidity and cloud cover) (Fig. 9b), which makes us less confident in amplitude estimates.
Figure 10
Summary of PFCONUSv1-modeled snow water equivalent performance relative to SNOTEL sites. Shown are observed peak SWE at SNOTEL sites (a), percent bias for peak SWE (b) and 1 April SWE (c), daily spatial fraction of stations with snow coverage (d), and mean daily SWE (e). In (e), shaded regions indicate 1 standard deviation in space.
[Figure omitted. See PDF]
At the aggregated basin scale, however, temporal progression of SM shows temporal agreement between PFCONUSv1 and CCI SM for most major basins: individual values for major basins are 0.25 (Upper Colorado), 0.79 (Lower Colorado), 0.75 (Arkansas–Red–White), 0.75 (Ohio), 0.43 (Missouri), 0.65 (Great Basin), 0.72 (Tennessee), and 0.55 (Upper Mississippi). The very weak correlation in the Upper Colorado basin may be indicative of large uncertainties in the ESA CCI SM product that have been observed with particular surface conditions: for regions of dense vegetation, topographic complexity, snow cover, or frozen soils, uncertainty in ESA CCI SM is very high (Dorigo et al., 2017), and we therefore have low confidence in ESACCI comparisons in Rocky Mountain headwater regions.
3.4.3Snow water equivalent,
Finally, the modeled PFCONUSv1 storage component was validated against snow telemetry data in the mountainous west of the model domain (Fig. 10). An important caveat to note is that point-measured snow water equivalent (SWE) is likely to consistently overestimate gridded land surface model products, given that coarse-resolution model cells (in our case, 1 lateral discretization) represent an aggregate of highly heterogenous SWE and canopy interference across a wide spatial area. Telemetry stations are frequently situated in clearings or in breaks in canopy density in order to maximize throughfall. For instance, Molotch and Bales (2005) characterized the distribution of SWE depth in 16, 4, and 1 grid elements surrounding SNOTEL stations in Rio Grande headwaters using a combination of field observations, remote sensing products, and snowpack mass balance modeling. They found that at the majority of the sites, the SNOTEL station represented high percentiles of SWE relative to the surrounding land area and that SNOTEL site conditions (such as vegetation density, solar radiation index, and terrain indices) were not representative of the vast majority of grid element space. In some regions, SNOTEL SWE was more than 200 % greater than the mean grid element value.
As would therefore be expected, the 1 resolution PFCONUSv1 model underestimates annual peak SWE (snow water equivalent at maximum accumulation) and 1 April SWE (snow water equivalent during ablation). PBIAS for annual peak SWE was 50.6 %, 33.5 %, and 14.7 % at the 25th, 50th, and 75th percentiles, respectively. 1 April SWE PBIAS was similar, with some individual SNOTEL stations showing more than double the SWE compared to PFCONUSv1 simulations (Fig. 10c). However, the PFCONUSv1 model clearly captures timing for snow accumulation and ablation, with the fraction of snow-covered sites tracking almost identically between SNOTEL and PFCONUSv1 (Fig. 10d). Percentiles for Spearman's values for cool season daily SWE (Fig. 10d) are 0.85 (25th percentile), 0.92 (50th), and 0.96 (75th).
4 Discussion: known and unknown sources of model biasIn Sect. 3, outputs from an integrated surface water–groundwater hydrologic model, PFCONUSv1, were compared to available point-scale monitoring networks and remote sensing products in an effort to evaluate the model's ability to reliably reproduce components of the water budget listed in Eq. ().
Table 2
Examples of potential sources of bias acting on PFCONUSv1 results.
Bias category | Bias source examples | Components directly affected |
---|---|---|
Topographic processing | Watershed drainage area | Surface flow volume |
Topographic relief | Surface flow volume and timing | |
Stream network mapping | Surface flow volume and timing | |
Atmospheric forcing | Precipitation volume | Surface flow volume and SWE |
Precipitation timing and intensity | Storm hydrographs | |
Temperature trends and diurnal, seasonal cycles | Evapotranspiration, snowmelt amount and timing | |
Humidity | Evapotranspiration | |
Wind speed | Evapotranspiration | |
Anthropogenic | Dams and reservoirs | Surface flow volume and timing |
Groundwater extractions | Groundwater storage | |
Land disturbance | Evapotranspiration, snow accumulation | |
Model parameters | Hydraulic conductivity | Infiltration, recharge |
Porosity | Subsurface storage | |
Manning's | Surface flow timing and hydrograph | |
Land and vegetation (albedo, LAI) | Evapotranspiration, snow accumulation and melt | |
Aquifer model depth | Groundwater storage | |
Initial conditions of pressure and saturation | Groundwater depth | |
Epistemic uncertainty | Scalability of model physics | All/unknown |
Vertical and lateral parameter aggregation | ||
Process interaction; groundwater–surface water and land–atmosphere exchange at various spatial and temporal scales |
Broadly, results suggest that PFCONUSv1 shows promising ability to reproduce the timing, mean states, and inter- and intra-annual variability of continental-scale water balance components. However, the PFCONUSv1 model should be considered a work in progress; with approximately 31 million cells in the domain, PFCONUSv1 bias can originate from errors associated with model physics, inputs, process representation, or epistemic uncertainty (Table 2). The best publicly available datasets were used to populate and drive this simulation, but such inputs are certainly subject to their own errors and uncertainties and must be continuously revisited to improve their fidelity. In this section, we discuss identifiable errors in model inputs and implications for future model development.
4.1 Meteorological forcing errors and topographic processingMajor biases exist in preprocessing of PFCONUSv1 meteorological forcing and topography, which are peripheral to but act simultaneously with all other sources of bias (Table 2). In this way, isolating the effects of a single bias source can be challenging. Streamflow itself is sensitive to errors in drainage area, topographic relief, and precipitation or temperature bias, and the errors in surface and subsurface moisture flux can propagate downstream to impact moisture availability and evapotranspiration in areas remote from the original bias source.
4.1.1 Terrain processing and drainage area
Topographic slopes were defined from a digital elevation model (DEM) generated by HydroSHEDS and subsequently subjected to a hydrologic processing algorithm, which adjusted drainage networks to remove true and artificial pits, depressions, and barriers and ensure complete river network connectivity (Barnes et al., 2016). However, both loss of resolution in DEMs and the topographic processing can result in loss of topographic relief and change in drainage area. Therefore, PFCONUSv1 streamflow percent bias should in theory reflect fidelity of upstream watershed area. We compared PFCONUSv1 drainage area with “true” drainage area determined based on geospatial stream properties obtained from the GAGES-II (Falcone, 2002).
Figure 11
(a) Percent difference between observed and simulated annual flow volume as a function of percent difference in true and PFCONUSv1 drainage area, colored by annual runoff ratio. (b) Locations where error in simulated flow volume is greater than, less than, or expected from drainage area bias. Expected behavior was defined as locations that lie within the 30 % dashed error bars shown in (a).
[Figure omitted. See PDF]
Figure 11 shows the relationship between percent difference in observed and simulated streamflow versus percent difference in observed and processed drainage area for all 2392 USGS stream gauges. There are three primary conclusions to be drawn from this relationship (Fig. 11a): (1) a clear, linearly proportional correlation exists between percent difference in drainage and percent difference in streamflow. For streamflow percent difference from observed ranging from 200 % to 200 %, we find that 977 out of 2392 stream gauges fall within 30 % of this flow–drainage relationship. Essentially, this means that for 41 % of gauges in PFCONUSv1, the percent bias in annual flow can be primarily attributed to errors in topographic processing. (2) A considerable number of gauges exhibit positive percent difference between observed and simulated annual streamflow, and these gauges are typically those with very low runoff ratios. Such a finding is not surprising in that streams with low RR will be particularly sensitive to external drivers. And (3) a certain amount of noise exists in these drainage–flow relationships, with many locations exhibiting higher or lower error in annual flow than that expected by drainage errors regardless of runoff ratio.
Figure 12
Observed precipitation and temperature at GHCND meteorological stations compared to interpolated NLDAS at their nearest-neighbor PFCONUSv1 cell. (a) Observed cumulative annual precipitation, (b) percent bias in annual precipitation, (c) Spearman's between simulated and observed daily precipitation. Also shown are observed average daily minimum (d), average (g), and maximum (j) temperature, the total bias in minimum (e), average (h), and maximum daily temperature (k), and the Spearman correlation for minimum (f), average (i), and maximum (l) daily temperature.
[Figure omitted. See PDF]
Figure 11b shows locations where the flow–drainage relationship was expected or unexpected. For the majority of the eastern United States, bias in streamflow is simply a function of drainage area bias from topographic processing. The mountainous west was considerably noisier, exhibiting in somewhat equal parts lower, higher, or expected annual flow behavior from drainage bias. We expect that much of the noise in annual flow bias is a function of precipitation and temperature bias and timing and, subsequently, snowpack. However, in the Great Plains region, the considerable positive annual flow bias shown in Fig. 3 cannot be attributed to the error in drainage area. In fact, for 600 gauges in the Great Plains area ( 20 % of all locations), the percent difference between PFCONUSv1 and true drainage area is near 0, but percent difference in streamflow is between 30 and 200 %. We believe that the greatest driver of this bias is the lack of groundwater extractions in the PFCONUSv1 model. Note that not only is the PFCONUSv1 model naturalized for the 2002–2006 simulation period, but its initial condition is also informed by 1985 naturalized spin-up, which does not include at least 50 years of groundwater depletion. However, some of the positive annual flow bias behavior in this region could be attributed to some biases in cumulative precipitation, which is detailed in Sect. 4.1.2.
4.1.2 Atmospheric forcing biasThe NLDAS meteorological forcing, which is described in Sect. 2.2, was bilinearly interpolated across the PFCONUSv1 domain; biases in precipitation, evaporation, wind speed, humidity, and radiation can therefore come from either the NLDAS product or its statistical downscaling. We compared daily total precipitation and average daily air temperature from the interpolated NLDAS product at 9139 () and 1678 (temperature) GHCND meteorological stations across the PFCONUSv1 domain by calculating relative bias and Spearman's at each location. Figure 12 summarizes these comparisons. Broadly, we can identify several examples for which NLDAS biases are potential drivers of the bias in timing and volume of hydrologic fluxes.
PFCONUSv1 annual precipitation over the Kansas–Nebraska border in the Great Plains region is 10 %–25 % greater than observed (Fig. 12b). This bias could be one source of positive flow bias at USGS stream gauges east of the High Plains aquifer.
Figure 13
Meteorological forcing, SWE, and ET bias at SNOTEL and FLUXNET stations. At SNOTEL locations, colored by elevation: (a) mean NLDAS mean cool season temperature versus observed cool season temperature, (b) mean NLDAS annual temperature versus observed cool season temperature, (c) NLDAS annual cumulative precipitation versus observed, and (d) PFCONUSv1 annual peak SWE versus observed. At FLUXNET locations, colored by the major basin location of the FLUXNET site: (e) daily NLDAS vapor pressure deficit versus observed, (f) daily NLDAS near-surface lateral wind speed versus observed, (g) daily NLDAS mean air temperature versus observed, and (h) PFCONUSv1 daily ET versus observed. Lines show linear regression with in all cases.
[Figure omitted. See PDF]
Fidelity in streamflow timing will of course be a function of accurate precipitation timing and intensity. A hydrologic model cannot be expected to perform considerably better than its recharge forcing, or results could be considered spurious. Areas with the weakest correlation between observed and NLDAS daily precipitation are in the Rocky Mountain headwater region (Fig. 12c). In the Upper Colorado watershed as a whole, the 50th percentile value for daily precipitation is 0.56, or the lowest of all other major basins. The Upper Colorado is also the basin with poorest overall daily streamflow timing, with 0.33.
Our interpolated NLDAS product underestimates the diurnal temperature fluctuations, primarily by considerably overestimating minimum (nighttime) daily temperature (Fig. 12e), which is likely a considerable driver of underestimated SWE. Further, maximum daily temperature is underestimated over the Rockies (Fig. 12h). Given that ET is largely dependent upon available radiative forcing, this could explain some of PFCONUSv1 negative bias at FLUXNET stations over the Rockies.
Annual temperature errors could also explain several regions where PFCONUSv1 comparisons to MOD16A2 and to SSEBop MODIS algorithms agree. For example, warm temperature biases and positive ET biases (relative to both MODIS algorithms; Fig. 6g and h) are seen in much of the lower elevations of the mountainous west and in the majority of the Pacific Northwest. Spatial patterns of ET biases (Fig. 7g and h) in the Upper Mississippi and Ohio River basins seem to instead follow the spatial pattern of precipitation bias (Fig. 11b), with regions receiving higher precipitation also experiencing higher ET.
NLDAS-simulated daily temperature timing is excellent. However, temperature was not deseasonalized before correlation was calculated, and the seasonal signal will certainly account for the majority of temperature variance.
More specifically, we can verify specific impacts of NLDAS bias on SWE and ET at individual SNOTEL and FLUXNET sites. Figure 13 shows observed and simulated (or interpolated) meteorological conditions and water balance components for snow and evapotranspiration.
SWE bias at SNOTEL sites is preferentially low at higher elevations (Fig. 13d). While this difference, as discussed above, can to a certain extent be attributed to differences in heterogeneous land and vegetation between the point and grid scale (Molotch and Bales, 2005), we also find that biases in temperature and precipitation likely drive the PFCONUSv1 low bias in snowpack. PFCONUSv1 SWE experiences a low bias in cumulative annual precipitation at SNOTEL sites (Fig. 13c), and lower elevations exhibit a warm cool season and annual temperature bias (Fig. 13a and b), both of which would contribute to low accumulation and high ablation rates.
While NLDAS shows good agreement with observed FLUXNET temperature (Fig. 13g), comparisons between NLDAS and observed vapor pressure deficit and wind speed both exhibit a considerable amount of scatter (Fig. 13e and f). Overall, the poorest-performing sites for vapor pressure (those in the Upper Colorado River basin) also exhibited the highest-magnitude ET biases (Fig. 13h). PFCONUSv1 underestimates (overestimates) relatively high (low) daily evapotranspiration rates (Fig. 13h). For FLUXNET locations and days exhibiting ET rates over (under) 4 , mean daily bias is 1.2 (0.3 ). Biases in NLDAS vapor pressure and wind speed could be a contributing factor. Lower vapor pressure deficits (0 to 20 ) and lower wind speeds (0 to 6 ) have an overall positive bias and could explain PFCONUSv1 overpredicting low ET days. Similarly, we believe the bias on high-evapotranspiration days (ET 4 ), which PFCONUSv1 preferentially underpredicts, could be attributed to NLDAS underpredicting wind speeds greater than 10 .
Errors in atmospheric forcing products often necessitate statistical bias correction before simulations are run (Piani et al., 2010). NLDAS has been specifically validated in its ability to reproduce meteorological conditions for streamflow (Xia et al., 2012a), soil moisture (Xia et al., 2015a), and evapotranspiration (Xia et al., 2015b) prediction by LSMs. While long-term spatial patterns and seasonal signals were captured for soil moisture and evapotranspiration, NLDAS fidelity at daily or weekly timescales is less certain. In this study, it is difficult to directly attribute the portion of streamflow, SWE, or ET errors that occur from atmospheric forcing bias, but these water balance components would certainly improve with continued progress in meteorological forcing datasets. The ParFlow–CLM water budget has been shown to be particularly sensitive to uncertainty in both precipitation and temperature forcings in mountainous regions, largely due to their additive influence on snow accumulation, melt, and subsequent mountain block recharge (Schreiner-McGraw and Ajami, 2021). Other studies have also highlighted the persistent biases in precipitation and temperature estimates from continental or global meteorological products, which can propagate into hydrologic model predictions (e.g., Ashfaq et al., 2010; Sperna Weiland et al., 2015).
4.2 Anthropogenic process representation and other epistemic errorsPlenty of sources of uncertainty can contribute to biases not discussed in Sect. 4.1 (Table 2). We have chosen to address meteorological forcing and topographic processing errors above simply because they are somewhat readily quantifiable, while parameter values and other epistemic uncertainties, such as simplification or scaling of model physics, are poorly constrained or simply unknown. Other biases include population of model parameter fields.
While we do not discuss model parameter uncertainty, such as conductivity, porosity, van Genuchten parameters, Manning's surface roughness, land and vegetation parameters, or model horizontal and vertical discretization, these are also areas for improvement. For example, Maxwell et al. (2015) show that national geologic and soil parameter datasets are prone to errors via political boundaries, such as state lines, and the PFCONUSv1 model oversimplifies deeper geology, with a 100 vertically homogenous layer.
However, as mentioned above, these fields are poorly constrained at the continental scale, and simulations are necessarily limited by availability of appropriate distributed products. Model parameter population is often addressed through calibration, but population of parameter fields becomes increasingly difficult as resolution increases and calibration becomes more computationally demanding. Future model iterations may need to take advantage of methods that allow transfer of parameters (e.g., conductivity) from coarse-resolution, efficient models to high-resolution ones (Foster and Maxwell, 2019). Model discretization is another concern. Coarsening of vertical and lateral resolution is a necessity at the continental scale, but aggregating to 1 resolution certainly comes with inherent uncertainties and loss of information. DEMs in particular will lose topographic drivers with scale (Wu et al., 2008), resulting in loss of relief; on the other hand, coarse-resolution cells could result in inappropriately steep pressure gradients as a function of Richards' equation parameterization and pressure-dependent permeability (Maxwell and Condon, 2016), and suitable vertical length scales for Richards' equations generally do not exceed several meters (Or et al., 2015). This calls into question the scalability of model physics; as Beven and Cloke (2012) rightly point out, whether or not governing partial differential equations will scale linearly is a concern. However, the current governing equations for PFCONUSv1 are simply the best currently known representation of hydrologic processes at this scale.
Finally, process representation is certainly a concern. Transient anthropogenic modules, such as urban hydrology models, pumping and injections, or surface water diversions, are currently possible but add to computational demand and require detailed historical data on water use with temporal and spatial coverage simply not yet available. As a naturalized model, the PFCONUSv1 simulations presented here will necessarily overpredict water tables and baseflow in regions where extractions are apparent. For instance, Maxwell and Condon (2016, supplemental information) show streamflow examples at the Lees Ferry USGS gauge, where timing and volume of streamflow are entirely governed by dam hydraulics. Condon and Maxwell (2019) show that incorporating a century of groundwater depletion across the PFCONUSv1 domain considerably decreases streamflow, with sensitivity to pumping concentrating downstream; more specifically, they found that long-term depletions over the High Plains aquifer resulted in a swap of discharging to recharging groundwater. However, naturalized continental models with high fidelity in non-anthropogenic settings could be used to estimate impact from human influence simply by examining the difference between observed and simulated conditions.
5 Conclusions and implications
In this study, we present the detailed evaluation of a transient, coupled hydrologic–land surface simulation at the continental scale and at hyper-resolution using a diverse set of monitoring networks and state-of-the-art remote sensing products. We found that PFCONUSv1 reproduced temporal patterns for continental-scale water budget components with an accuracy of at least 0.5. The following are 50th percentile (in space, over the entire domain) Spearman's rank correlations for individual water balance components: 0.65 for , with evaluation against daily USGS stream gauge observations; 0.72 for ET, with evaluation against daily FLUXNET eddy covariance observations (for monthly HUC8-aggregated remote sensing products, 0.85 for ET relative to MOD16A2 algorithm, and 0.91 for ET relative to SSEBop algorithm); 0.80 for major basin-aggregate , with evaluation against monthly GRACE remote gravity field sensing; 0.46 for filtered USGS well observations, which are related but not equivalent to ; 0.69 for major basin-aggregate , with evaluation against ESA CCI weekly SM from active/passive microwave sensors (Upper Colorado is not excluded, but note the uncertainties in ESA CCI over snow-covered, densely forested, and topographically complex land area); and 0.92 for , with evaluation against daily SNOTEL point observations. In terms of temporally aggregated annual fluxes, which represent long-term water budget component states, PFCONUSv1 simulates 50th percentile of 41.3 % for streamflow relative to USGS gauges: 14.2 % for ET relative to MOD16A2 and 13.2 % for ET relative to SSEBop at the aggregated monthly and HUC8 scales, 37.9 % for ET relative to FLUXNET sites, and 35.3 % for peak annual SWE relative to SNOTEL locations. We also found RSR for PFCONUSv1 performance at point locations, with 0.92 at FLUXNET sites and 1.2 at USGS streams, while RSR values for PFCONUSv1 relative to MODIS products aggregated at the monthly and HUC8 scale are 0.85 and 0.47 for MOD16A2 and SSEBop, respectively. Performance varies widely across the model domain, with the eastern United States showing better overall performance at USGS stream gauges and relative to MODIS remote sensing products than the western US. In terms of , PFCONUSv1-simulated SM is best for the Tennessee, Ohio, and Lower Colorado River basins relative to the spatially aggregated ESACCI soil moisture product; total water storage performance is best for the Upper Mississippi River basin relative to the GRACE TWS anomaly products. Further, we discussed three primary sources of model bias: terrain processing, errors in atmospheric forcing, and lack of anthropogenic influence.
The results presented here provide benefits to the high-resolution, continental-scale (and above) hydrologic community. First, our level of agreement with observations and remote sensing products suggests great promise for extreme-scale and high-resolution modeling to become a reality. We argue that PFCONUSv1 and similar models are feasible and will certainly see improvements in the near future with increased availability of open-access and distributed datasets, remote sensing advancements, improved monitoring networks, and advancements in highly parallelized computing.
Second, these results provide a guide for PFCONUS development. Some areas for model improvement that were immediately identified in this study include the following: (1) the source of high positive bias in the Central Plains should be further addressed. While we propose that this bias is largely attributed to the lack of groundwater pumping in the model (we estimate that at least 25 % of stream gauges are impacted by High Plains aquifer depletions), other potential sources of error could include inappropriate soil or geology hydraulic conductivity or van Genuchten parameters, the lack of spatially distributed Manning's coefficient (Maxwell et al., 2015), or loss of topographic relief associated with 1 lateral resolution. (2) We show that topographic processing has resulted in considerable error in drainage area for approximately 40 % of stream gauges. Accessible improvements could be made to streamflow bias with improved topographic processing algorithms. And (3) interpolated atmospheric forcing from NLDAS reanalysis has two primary biases that, if corrected with statistical bias correction or other methods, would immediately benefit streamflow, ET, and snow water equivalent. First, precipitation timing is lacking in many areas of the domain, particularly over the Rocky Mountain region. Second, mean nighttime air temperature exhibits a high temperature bias, resulting in an underestimated diurnal temperature fluctuation for the majority of the domain. Average daytime maximum temperature is also underestimated over the Rocky Mountains.
Finally, we argue that model fidelity can only be reliably understood at a process-based level if all water balance components available in the model outputs are evaluated. While single-parameter validation may be effective for operational forecasts, we do risk the equifinality dilemma of arriving at the right answer for the wrong reasons (Kirchner, 2006). The value in the type of validation exercise presented here is clearly a mechanistic understanding of model performance and a higher level of confidence in overall water balance modeling skill. Further work should be done to continue to incorporate additional observational and remote sensing networks as they become available. Also, while explicit comparisons with observations and simulations, like those presented here, are valuable, comparisons with other models are equally an asset used to build confidence in the plausibility of parameterization, outputs, and their mechanistic relationships. Gleeson et al. (2021) stress the importance of model intercomparison projects, such as IH-MIP2 (Kollet et al., 2017), as tools for model evaluation for global groundwater simulations. Impressive model validation toolkits that exist in the land surface community, such as the Land surface Validation Toolkit (LVT) (Kumar et al., 2012) and the International Land Model Benchmarking (ILAMB) system (Collier et al., 2018), as well as nascent model comparisons in the continental hydrology community, such as the Continental Hydrologic Intercomparison Project (CHIP) (Tijerina et al., 2021), are inspiring collaborative efforts to streamline and standardize model evaluation. We hope to take advantage of the LVT and ILAMB platforms in the future to compare model performance to other similar continental- and global-scale simulations, to standardize our model evaluation, and to add to our existing observation datasets. Further work is also needed to assess the scale gaps prevalent in observation and remote sensing data. Specifically, point-scale observations sensitive to small-scale heterogeneity, such as in situ soil moisture observations, are unlikely to be applicable to the 1 scale, resulting in commensurability errors (Gleeson et al., 2021); conversely, we cannot guarantee that PFCONUSv1 outputs scale linearly to coarser-resolution products and models. Improved understanding of how model bias scales with loss of spatial or temporal resolution is a vital area of research.
Code and data availability
ParFlow–CLM is an open-source, parallel, modular hydrologic model that is freely available on GitHub at
The supplement related to this article is available online at:
Author contributions
MMFO'N prepared and ran PFCONUSv1 simulations, processed model outputs, analyzed model performance, and prepared the paper with contributions from coauthors. RMM and LEC provided some processing codes for spatiotemporal averaging, offered guidance for interpreting integrated hydrologic simulations and managing data, and aided in paper preparation and editing. DTT contributed to paper preparation, editing, and review.
Competing interests
The authors declare that they have no conflict of interest.
Disclaimer
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Acknowledgements
The authors gratefully acknowledge the high-performance computing resources provided by the National Center for Atmospheric Research (NCAR) Computational and Information Systems Laboratory (CISL) Cheyenne computing platform. The authors also thank the editor and two anonymous referees for their expertise and constructive feedback during the review process.
Financial support
This research has been supported by the National Science Foundation Office of Advanced Cyberinfrastructure, Cyberinfrastructure for Sustained Scientific Innovation (CSSI) project (grant no. 1835903) and the US Department of Energy Interoperable Design of Extreme-scale Application Software (IDEAS) project (grant no. DE-AC02-05CH11231).
Review statement
This paper was edited by Richard Mills and reviewed by two anonymous referees.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Recent advancements in computational efficiency and Earth system modeling have awarded hydrologists with increasingly high-resolution models of terrestrial hydrology, which are paramount to understanding and predicting complex fluxes of moisture and energy. Continental-scale hydrologic simulations are, in particular, of interest to the hydrologic community for numerous societal, scientific, and operational benefits. The coupled hydrology–land surface model ParFlow–CLM configured over the continental United States (PFCONUS) has been employed in previous literature to study scale-dependent connections between water table depth, topography, recharge, and evapotranspiration, as well as to explore impacts of anthropogenic aquifer depletion to the water and energy balance. These studies have allowed for an unprecedented process-based understanding of the continental water cycle at high resolution. Here, we provide the most comprehensive evaluation of PFCONUS version 1.0 (PFCONUSv1) performance to date by comparing numerous modeled water balance components with thousands of in situ observations and several remote sensing products and using a range of statistical performance metrics for evaluation. PFCONUSv1 comparisons with these datasets are a promising indicator of model fidelity and ability to reproduce the continental-scale water balance at high resolution. Areas for improvement are identified, such as a positive streamflow bias at gauges in the eastern Great Plains, a shallow water table bias over many areas of the model domain, and low bias in seasonal total water storage amplitude, especially for the Ohio, Missouri, and Arkansas River basins. We discuss several potential sources for model bias and suggest that minimizing error in topographic processing and meteorological forcing would considerably improve model performance. Results here provide a benchmark and guidance for further PFCONUS model development, and they highlight the importance of concurrently evaluating all hydrologic components and fluxes to provide a multivariate, holistic validation of the complete modeled water balance.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 Department of Geology and Geological Engineering, Colorado School of Mines, Golden, CO, USA; now at: Hydrological Sciences Laboratory, NASA Goddard Space Flight Center, Greenbelt, MD, USA; now at: Earth System Science Interdisciplinary Center, University of Maryland, College Park, Greenbelt, MD, USA
2 Department of Civil and Environmental Engineering and High Meadows Environmental Institute, Princeton University, Princeton, NJ, USA
3 Department of Hydrology and Atmospheric Sciences, The University of Arizona, Tuscon, AZ, USA