Introduction
The impact of urbanization on the local climate and hydrology has sparked scientists' interest and inspired research for centuries (e.g., Howard, 1833; Oke, 1982; Fletcher et al., 2013; Hamdi et al., 2020). With the increasing population in cities (United Nations, 2018) more people are impacted by increased heat stress and flooding (Botzen et al., 2020; Gasparrini et al., 2017; Heaviside et al., 2016; Zhou et al., 2019). Spatial morphological heterogeneity and human interactions make understanding the urban climate challenging (Demuzere et al., 2022; Koopmans et al., 2020; Kotthaus & Grimmond, 2014a; Sun et al., 2018), but weather and climate models need to include the effects of urban areas, as they locally exacerbate extreme events (Hertwig et al., 2020; Oleson et al., 2008; Ronda et al., 2017). Examples are increased flooding due to high impervious fractions (Zhou et al., 2019) and increased heat stress during heat waves resulting from reduced evaporation (Lemonsu et al., 2015; Li et al., 2019). Therefore, models need to capture the impact of urban areas on their climate.
Researchers have developed, evaluated, and improved Urban Land Surface Models (ULSMs) simulating the interaction of the urban surface with the atmosphere. Coupled with a numerical weather prediction or climate model, ULSMs serve as a lower boundary condition and improve the model performance for urban environments (Tewari et al., 2007). ULSMs make different simplifying assumptions regarding urban geometry: a single homogeneous, impervious slab; multiple, individually homogeneous slabs; two-dimensional canyons; or 3D streets with individual buildings (Grimmond et al., 2009). These models also differ in whether and how they include physical processes like anthropogenic heat, irrigation, and snow processes (Lipson et al., 2024). To evaluate their performance, individual models are compared with observations (e.g., Grimmond & Oke, 2002; Hamdi & Schayes, 2007; Krayenhoff & Voogt, 2007; Porson et al., 2010; Ross & Oke, 1988). Although these individual evaluations were sometimes based on the same observations (Grimmond et al., 2009), the lack of a systematic approach prevented consistent comparison of the schemes. To compare the wide variety of models, two successive comparison projects applied a systematic approach. The first systematic comparison of ULSMs generally followed the PILPS protocol (Project for Intercomparison of Land surface Parameterization Schemes, Henderson-Sellers et al. (1996)), hence PILPS-Urban (Grimmond et al., 2010, 2011). Individual modelers received meteorological input and surface characteristics to enable them to run their models. In total, 32 models completed simulations for a site in Vancouver and one in Melbourne. Grimmond et al. (2011) concluded that increased model complexity did not necessarily benefit model performance.
The second intercomparison, Urban-PLUMBER (Lipson et al., 2024), assesses 30 models initially at the PILPS-Urban Melbourne site and adopts benchmarks following the PLUMBER project (Best et al., 2015). Benchmarks serve as a relative reference, to which models are compared to assess whether a cohort performs better (or not) than the benchmark and if input information is utilized effectively. Urban-PLUMBER is extended to the 20 sites presented by Lipson et al. (2022b) in the second phase (Lipson et al., 2023). The Urban-PLUMBER models outperform the PILPS-Urban ones for the sensible and latent heat flux. Some models representing two-dimensional canyons now perform nearly as well as one and two-tile models after efforts to improve hydrology and vegetation representation. However, models with complex urban geometry often still have relatively simple hydrology and vegetation and perform less well overall suggesting the representation of hydrology and vegetation requires more attention (Lipson et al., 2024).
Although PILPS-Urban and Urban-PLUMBER conclude vegetation and hydrology are important for model performance, neither project evaluates the water balance explicitly. The water balance satisfies the conservation of mass (Lavoisier, 1789) in the same way the energy balance satisfies the conservation of energy (Châtelet, 1740). The conservation of energy is forced in many ULSMs to prevent the energetic state of the model from drifting and the consequential, long-term bias in the modeled surface fluxes (Grimmond et al., 2010). Closure is achieved by either updating the surface temperatures based on the residual energy or restricting the turbulent heat flows to the available energy (Grimmond et al., 2010). Both PILPS-Urban and Urban-PLUMBER test whether models close the energy balance, but have not verified the numerical closure of the water balance. Similar to the energy balance, an unclosed water balance can result in model biases and consequential drifting. These biases may in turn affect the energy balance, as the energy and water balance are linked through evapotranspiration , the mass counterpart of the latent heat flux . This direct link implies errors and/or biases in one balance will affect the model's skill for the other balance. Recently, Yu et al. (2022) showed the hydrology in a coupled ULSM has the potential to improve the , humidity, and air temperature with impacts up into the boundary layer (1 km). / has been amongst the most challenging fluxes for ULSMs from the first assessment (Ross & Oke, 1988) until now (Lipson et al., 2024). Given the link to the energy balance, we hypothesize closing the water balance will improve model performance for the energy balance fluxes.
However, the water balance cannot be directly assessed because of a lack of observations at the appropriate spatiotemporal scales at this time. While precipitation is measured routinely in many urban locations with rain gauges and rain radars, runoff, irrigation, and changes in water storage are not. observations from eddy-covariance systems have substantial gaps introduced in the quality control process (Feigenwinter et al., 2012) that rejects more data close to rain events (Grimmond, 2006). Runoff is occasionally measured in urban catchments (Berthier et al., 1999; Walsh et al., 2005), but a challenge is posed by the difference in the source area of observations for runoff and eddy-covariance techniques (Grimmond & Oke, 1986, 1991; Hellsten et al., 2015). External water use, often irrigation, further complicates the water balance in cities, as it mainly occurs at the micro-scale (e.g., garden irrigation). This scale can only be inferred from neighborhood piped water supply observations and water use surveys or estimated from weather, vegetation, and soil type (Grimmond & Oke, 1986; Kokkonen et al., 2018; Mitchell et al., 2001; Zeisl et al., 2018). Tree roots penetrate (sewer) pipes causing damage (Randrup et al., 2001) and simultaneously taking out water, which is an unobserved term. Lastly, measuring the water storage change is logistically difficult, as this requires the state of each individual element contributing to water storage in the city, such as soil moisture, interception, groundwater, and surface water. Thus, a direct comparison of a full set of water balance observations is extremely challenging and an alternative approach is needed.
Here, we develop an alternative approach to evaluate the representation and dynamics of the water balance in ULSMs. To examine the water balance closure, we propose an UWBR (urban water balance representation) score. The score combines seven indicators assessing: water balance closure (1 indicator), (2), water storage dynamics (2), and surface runoff (2). The UWBR score is applied, given a lack of observations, to rank models' capability to accurately capture different aspects of the water balance. Assessing the score of 19 Urban-PLUMBER ULSMs with a complete water balance representation helps to identify model improvement possibilities. The water balance representation is compared with the turbulent heat fluxes model skill since we expect a better water balance representation should improve simulated latent heat fluxes.
Methods
Urban Water Balance Representation (UWBR) Score
The UWBR score is a linear sum of seven indicators of a good water balance, which are assigned a value of one if a specified threshold is passed (Table 1), except the indicator, for which both sub-metrics are assigned 0.5 if passed. No weights are assigned, as these cannot be determined objectively. The UWBR score is compared with the model performance for the latent heat flux assessed with metrics capturing different characteristics (Willmott, 1982) that are not entirely independent:
-
Absolute mean bias error (MBE) assesses the bias providing insight into how well the quantities of the latent heat flux are modeled.
-
Coefficient of determination captures the consistency of the timing as decreases with a shift in a quasiperiodic signal like the latent heat flux.
-
Normalized standard deviation (, divided by ) compares the variability, which is dominated by the daily cycle in the case of the latent heat flux.
-
Systematic Mean Absolute Error indicates the average error. The systematic error is separated from the unsystematic error similar to the approach presented by Willmott (1982) for the root mean square error. This separation allows us to distinguish between systematic and random errors.
-
Unsystematic Mean Absolute Error assesses how well the erratic behavior is captured.
Table 1 Overview of the Seven Indicators That are Linearly Combined in the UWBR Score, Which is Used to Evaluate the Urban Water Balance Representation in ULSMs
Water balance flux | Indicator | Description | Timescale | Criterion | Equation |
All | Closure of the annual water balance assesses relative to the precipitation plus irrigation | Annual | |||
ET | Modeled cumulative normalized by the benchmark over the whole model period | Modeled period | Within benchmark uncertainty* | ||
Similarity of recession timescale distribution between model and observations from the whole model run | Modeled period | Kolmogorov-Smirnov test (Chakravarti et al., 1967) | |||
Range over the whole model run in stored water for both the modeled explicit and implicit water storage compared to water storage capacity | Modeled period | (50% of soilvolume + 3 interception) | (Equation 2) and (Equation 1) | ||
Coefficient of determination between changes in explicit and implicit modeled water storage over the whole model period | Modeled period | 0.9 | 1‒ | ||
Curve number from modeled runoff events and from site characteristics | Event | Within uncertainty* | (Section 2.1.4) | ||
Mean lag (hours) between centre of mass from precipitation and surface runoff of all events | Event | hour |
Before the individual indicators are introduced, we define two ways to calculate water storage from the model output based on either the water storage term (explicit) or the other terms of the water balance combined (implicit). Assuming that the net change in water stored in a “catchment” or a model grid can be derived from the difference between the incoming and outgoing water fluxes, then the implicit water storage is:
Water Balance Closure
Water balance closure assumes that all fluxes add up to zero for the time and space under consideration (here 1 and 1 year):
The water balance closure indicator (, Table 1) assesses if the total sum of all fluxes (including storage) is less than 3% from P + I. The 3% threshold allows for non-closure due to interception storage data not being provided in the model output, errors arising in latent heat flux unit conversion, or numerical model errors. According to the literature, interception storage amounts to 0.5–3 mm explaining a non-closure of up to 0.5% when it is not provided (Carlyle-Moses et al., 2020; Klaassen et al., 1998; Wouters et al., 2015). Converting the latent heat flux to can result in variations up to 2% depending on temperature and snow effects (Bringfelt, 1986; Petrucci et al., 2010). Not all models correct for these effects. To account for numerical model errors arising from discretization and time stepping (MacKay et al., 2022), we allow deviations of up to 0.5%.
Evapotranspiration
The two indicators address the magnitude and timing. The non-randomly distributed gaps in observations prevent direct comparison of total modeled over a model period. Thus, we use one of the Lipson et al. (2024) benchmark models. This allows a total to be obtained without gaps. The Lipson et al. (2024) benchmark model is derived using multivariate ordinary least squares regressions with a K-means clustering approach. The K-means clustering approach is trained in-sample using 81 clusters on four variables: incoming shortwave radiation, air temperature, relative humidity, and wind speed (KM4-IS-SWdown-Tair-RH-Wind in Lipson et al., 2024). To reduce the hourly MBE, wind speed is omitted at both Helsinki sites. At all sites, the MBE is below 1 and at most sites below 0.1 evaluated against available data.
Therefore, is assumed to provide a reasonable estimate of the total flux over the model run for the indicator (Table 1). We compare in units rather than , eliminating unit conversions and calculate the cumulative flux uncertainty from the benchmark based on (a) the benchmark MBE multiplied by the run duration, and (b) lack of energy balance closure associated with eddy-covariance observations (Foken et al., 2012; Franssen et al., 2010; Mauder et al., 2020). The lack of energy closure is calculated by the net all-wave radiation minus the sum of the turbulent heat fluxes. The storage and anthropogenic heat fluxes are not observed, which prevents constraining the turbulent heat fluxes with energy balance closure. If a lack of closure occurs, the unexplained energy over the whole model run is split between and the sensible heat flux according to the Bowen ratio based on the benchmark fluxes (Hirschi et al., 2017; Mauder et al., 2020; Twine et al., 2000):
The timing of modeled is assessed assuming exponential recession after rainfall based on the recession timescale estimated following the Jongen et al. (2022) methodology. This methodology considers only the first 10 days to exclude the influence of longer dry periods and irrigation. A daily timescale analysis circumvents observational gaps. Model and observations are assessed if they have the same distribution for the recession timescale with a Kolmogorov-Smirnov test (Chakravarti et al., 1967). The indicator is assigned a value of 1 when the p-value is below 0.05.
Water Storage
Indicator evaluates the water storage by comparing the modeled explicit and implicit water storage ranges (Section 2.1) over the analysis period with respect to the estimated water storage capacity. According to the literature, soil water storage capacity is maximally half the soil depth for all soil types (Saxton et al., 1986). The maximum is set as a storage capacity that models should not exceed rather than a realistic value. As urban soils are frequently disturbed making them spatially heterogeneous, reliable maps are rarely available (Van de Vijver et al., 2020). As the modeled soil depth depends on the model run, the soil water storage capacity is calculated for each separately. To account for interception storage, 3 mm is added to the estimated water storage capacity based on tree and impervious interception observations (Carlyle-Moses et al., 2020; Klaassen et al., 1998; Wouters et al., 2015). The two models not including soil moisture do not pass the first check of this indicator and are only evaluated based on the implicit water storage (Table 2). Other models receive 0.5 score when either the modeled explicit or implicit water storage range falls within the estimated water storage capacity (or 1 for both).
Indicator quantifies the internal temporal consistency between the change in explicit (Equation 2) and the implicit (Equation 1), which should be indicating the same flux. The coefficient of determination (Willmott, 1982) is calculated using storage changes using 30-min (or 60-min) model output depending on the site forcing data. This metric equals one if the timing between two fluxes is similar independent of the flux bias, unlike other indicators (e.g., ). The two models without soil moisture output are assigned a value of 0 for as their performance could not be evaluated.
Surface Runoff
Indicator assesses the magnitude relating total event precipitation to (Figure 1a). Without runoff observations, curve numbers are derived to evaluate modeled total event (Cronshey et al., 1985) based on the relation between the total event precipitation and the total event :
[IMAGE OMITTED. SEE PDF]
For each site, the is estimated using a linear interpolation of a look-up table considering the impervious fraction within the eddy-covariance footprint (Cronshey et al., 1985). Given soil texture influences , sand fraction (Brakensiek & Rawls, 1983; Nachtergaele, 2001) obtained from a global data set (OpenLandMap, (Hengl, 2018)) is used to constrain . Given the uncertainty of urban soil maps, using sand fraction is a repeatable way to assign the most uncertainty to the look-up tables, assuming a one-third change of from a one-level change in soil texture in either direction. If the site , including its uncertainty, overlaps with the model including its uncertainty, is assigned a value of 1.
Indicator addresses the rainfall- response times (Leopold, 1968). The lag time is calculated as the difference between centroids of rainfall and for the same events as the calculations (Figure 1a). Long-tail rainfall events are excluded when the comes before the . As eddy-covariance systems have a footprint on the sub-square-kilometer scale (Feigenwinter et al., 2012), lag time is expected to be much faster than 30–60 min (Berne et al., 2004; Morin et al., 2001; Yao et al., 2016), which is the model output resolution (Lipson et al., 2024). Therefore, the mean lag time needs to be less than 1 hour. The mean is preferred over the median to also pinpoint models that occasionally have long lag times that would not affect the median. Lag times of intermittent precipitation-runoff events will only decrease, as storages are already (partly) filled by earlier precipitation. Dry periods of less than 5 hours should also have lag times of less than 1 hour.
Models
The present study anonymously analyzes the water balance outputs from 19 Urban-PLUMBER ULSMs (Table 2). Other Urban-PLUMBER ULSMs did not submit the necessary outputs to allow for a water balance assessment. The outputs are for 20 sites covering a range of climates, impervious fractions, and observational periods (Table 3). As two models did not run all sites, 377 runs are analyzed.
Table 2 Overview of the 19 Urban Land Surface Models in the Water Balance Analysis Based on Lipson et al. (2024)
Model | Urban geometry | Vegetation | Soil hydrology | Snow accumulation | Irrigation | Water balance closure check | Reference |
ASLUMv2.0 | Canyon | Grass | Multi-layer | No | Nob | Noc | Z.-H. Wang et al. (2013) |
C. Wang et al. (2021) | |||||||
ASLUMv3.1 | Canyon | Grass + trees | Multi-layer | No | Nob | Noc | Z.-H. Wang et al. (2013) |
C. Wang et al. (2021) | |||||||
CABLE | Non-urban | Separate tiles | Multi-layer | Veg. | No | Yes | Kowalczyk et al. (2006) |
Y. P. Wang et al. (2011) | |||||||
ECLand | Non-urban | Separate tiles | Multi-layer | Veg. | No | Nod | Boussetta et al. (2021) |
ECLand-U | Two-tile | Separate tiles | Multi-layer | Veg. + urban | No | Nod | McNorton et al. (2021) |
Boussetta et al. (2021) | |||||||
CLMU5 | Canyon | Grass + shrubs | Multi-layer | Urban | No | Yes | Oleson and Feddema (2020) |
JULES 1T | One-tile | Separate tiles | Multi-layer | Veg. + urban | No | Yes | Best et al. (2011) |
JULES 2T | Two-tile | Separate tiles | Multi-layer | Veg. + urban | No | Yes | Best et al. (2011) |
JULES MOR | Two-tile | Separate tiles | Multi-layer | Veg. + urban | No | Yes | Best et al. (2011) |
Lodz-SUEB | One-tile | Lumped with urban | Multi-layera | Veg. + urban | No | No | Fortuniak (2003) |
Manabe 1T | One-tile | Manabe bucket | One-layer | Veg. + urban | No | No | Best et al. (2011) |
Manabe (1969) | |||||||
Manabe 2T | Two-tile | Manabe bucket | One-layer | Veg. + urban | No | No | Best et al. (2011) |
Manabe (1969) | |||||||
NOAH-SLAB | One-tile | Separate tiles | Multi-layer | Veg. + urban | No | No | Kusaka et al. (2001) |
Ek et al. (2003) | |||||||
NOAH-SLUCM | Canyon | Separate tiles | Multi-layer | Veg. + urban | No | No | Kusaka et al. (2001) |
Ek et al. (2003) | |||||||
SNUUCM | Canyon | Separate tiles | Multi-layera | Veg. | No | No | Ryu et al. (2011) |
Ek et al. (2003) | |||||||
SUEWS | Two-tile | Separate tiles | One-layer | Veg. + urban | Nob | Yes | Järvi et al. (2011) |
Ward et al. (2016) | |||||||
TERRA 4.11 | One-tile | Separate tiles | Multi-layer | Veg. | No | No | Wouters et al. (2015) |
Schulz and Vogel (2020) | |||||||
UCLEM | Canyon | Grass + shrubs | One-layer | Veg. + urban | Yes | No | Thatcher and Hurley (2012) |
Lipson et al. (2018) | |||||||
UT&C | Canyon | Grass + shrubs + trees | Multi-layer | No | Yes | Yes | Meili et al. (2020) |
Table 3 Model (Table 2) Outputs Are Analyzed for 20 Sites (Lipson et al., 2022b)
Country | City (site) | Name | Lat. () | Lon. () | Observed period (days) | Köppen-Geiger climate | LCZ | (m) | (m) | Reference | |
Australia | Melbourne (Preston) | AU-Preston | −37.73 | 145.01 | 475 | Cfb | 6 | 0.62 | 8 | 40 | Coutts et al. (2007a) |
Coutts et al. (2007b) | |||||||||||
Australia | Melbourne (Surrey Hills) | AU-SurreyHills | −37.83 | 145.10 | 148 | Cfb | 6 | 0.54 | 8 | 38 | Coutts et al. (2007a) |
Coutts et al. (2007b) | |||||||||||
Canada | Vancouver (Sunset) | CA-Sunset | 49.23 | −123.08 | 1,827 | Csb | 6 | 0.68 | 3 | 25 | Christen et al. (2011) |
Crawford and Christen (2015) | |||||||||||
Finland | Helsinki (Kumpula) | FI-Kumpula | 60.20 | 24.96 | 1,096 | Dfb | Mix | 0.46 | 6 | 31 | Karsisto et al. (2016) |
Finland | Helsinki (Torni) | FI-Torni | 60.17 | 24.94 | 1,096 | Dfb | 2 | 0.77 | 15 | 60 | Nordbo et al. (2013) |
Järvi et al. (2018) | |||||||||||
France | Toulouse (Capitole) | FR-Capitole | 43.60 | 1.45 | 375 | Cfa | 2 | 0.90 | 11 | 48 | Masson et al. (2008) |
Goret et al. (2019) | |||||||||||
Greece | Heraklion | GR-HECKOR | 35.34 | 25.13 | 367 | Csa | 3 | 0.92 | 17 | 27 | Stagakis et al. (2019) |
Japan | Tokyo (Yoyogi) | JP-Yoyogi | 35.66 | 139.68 | 1,461 | Cfa | 2 | 0.92 | 28 | 52 | Hirano et al. (2015) |
Ishidoya et al. (2020) | |||||||||||
South Korea | Seoul (Jungnang) | KR-Jungnang | 37.59 | 127.08 | 825 | Dwa | 3 | 0.97 | 15 | 42 | J.-W. Hong et al. (2020) |
S.-O. Hong et al. (2023) | |||||||||||
South Korea | Cheongju (Ochang) | KR-Ochang | 36.72 | 127.43 | 780 | Dwa | 5 | 0.47 | 4 | 19 | J.-W. Hong et al. (2019) |
J.-W. Hong et al. (2020) | |||||||||||
Mexico | Mexico City (Escandon) | MX-Escandon | 19.40 | −99.18 | 470 | Cwb | 2 | 0.94 | 8 | 37 | Velasco et al. (2011) |
Velasco et al. (2014) | |||||||||||
Netherlands | Amsterdam | NL-Amsterdam | 52.37 | 4.89 | 652 | Cfb | 2 | 0.68 | 10 | 40 | Steeneveld et al. (2020) |
Poland | Łódź (Lipowa) | PL-Lipowa | 51.76 | 19.45 | 1,827 | Dfb | 2 | 0.76 | 7 | 37 | Pawlak et al. (2011) |
Fortuniak et al. (2013) | |||||||||||
Poland | Łódź (Narutowicza) | PL-Narutowicza | 51.77 | 19.48 | 1,827 | Dfb | 2 | 0.65 | 11 | 42 | Fortuniak et al. (2006) |
Fortuniak et al. (2013) | |||||||||||
Singapore | Singapore (Telok Kurau) | SG-TelokKurau | 1.31 | 103.91 | 366 | Af | 3 | 0.85 | 7 | 24 | Roth et al. (2017) |
UK | London (King's college) | UK-KingsCollege | 51.51 | −0.12 | 638 | Cfb | 2 | 0.79 | 15 | 50 | Kotthaus and Grimmond (2014a) |
Kotthaus and Grimmond (2014b) | |||||||||||
Bjorkegren et al. (2015) | |||||||||||
UK | Swindon | UK-Swindon | 51.58 | −1.80 | 715 | Cfb | 6 | 0.49 | 4 | 13 | Ward et al. (2013) |
USA | Baltimore (Cub hill) | US-Baltimore | 39.41 | −76.52 | 1,826 | Cfa | 6 | 0.31 | 4 | 37 | Crawford et al. (2011) |
USA | Minneapolis | US-Minneapolis1 | 45.00 | −93.19 | 1,093 | Dfa | 6 | 0.21 | 3 | 40 | Peters et al. (2011) |
Menzer and McFadden (2017) | |||||||||||
USA | Phoenix (West) | US-WestPhoenix | 33.48 | −112.14 | 382 | Bwh | 6 | 0.48 | 3 | 22 | Chow et al. (2014) |
Chow (2017) |
For each site, modelers were provided with the site characteristics and meteorological forcing with 10-year spin-up data (Lipson et al., 2022b). The spin-up period required to reach equilibrium varies per model, with some requiring many years to come to hydrological equilibrium with the forcing meteorology (Best & Grimmond, 2016; Yang et al., 1995). The 10 years of spin-up before the evaluation observations allowed the soil moisture stores to equilibrate with local conditions prior to analysis. ERA5 reanalysis data (Hersbach et al., 2020) are used to derive hourly forcing with bias-correction including diurnal and seasonal effects for each site (Lipson et al., 2022b).
Depending on site data, evaluation is undertaken with 30- or 60-min fluxes for periods varying between 148 and 1,827 days (average 912 days, Table 3). Similar to the Urban-PLUMBER protocol, to minimize human errors, modelers received a preliminary analysis of the water balance to help identify major issues and were encouraged to update their results. This eliminated unit errors, added missing variables, and removed inactive soil moisture layers.
For this study, we harmonize the hydrological model output. If a model only provided (unit: ), it is converted to (unit: ) using latent heat of vapourization accounting for air temperature (Bringfelt, 1986). When snow is present the latent heat of fusion is added to the latent heat of vapourization to acquire the latent heat of sublimation (Petrucci et al., 2010). In the forcing, precipitation is split into snowfall and rainfall. At only 30% of the sites, snowfall amounts to more than 10% of the precipitation. It is added as rainfall for one model without snow hydrology, while the two others do not account for this input. Irrigation is simulated in two models. For all other models, irrigation is assumed to be zero.
Results
The 19 ULSMs show a wide spread in the average yearly water fluxes at all 20 sites based on all 377 model runs (Figure 2). Overall, the model spread (whiskers, Figure 2) is often wider than the modeled ensemble mean flux (bars, Figure 2). Models show more variation in than in runoff. Sites with higher annual water input have more variability in model output fluxes, for example, the relatively high fluxes in KR-Jungnang and SG-TelokKurau compared to the lower yearly fluxes in PL-Lipowa and US-WestPhoenix.
[IMAGE OMITTED. SEE PDF]
Water Balance Closure
Although the annual mean model ensemble almost closes the water balance at most sites (Figure 2), most individual models do not close the water balance (Figure 3). Here, closure is assumed when the sum of all fluxes (Equation 3) is less than 3% of P + I. This occurs in 57% of the model runs (, Figure 4). In 25% of the model runs, non-closure exceeds 10% of P + I. Closure is model-related as the bias is similar across sites for each model (Figure 3). Five models close the water balance in all runs, whereas four models account for 48% of unclosed model runs. Three models pass their internal water balance closure check but do not always pass this closure check possibly due to unreported, modeled water fluxes or inconsistencies in the way fluxes were reported. To assess the impact of model run length, the analysis is repeated with sites with more than 2 years of observations yielding similar results.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Evapotranspiration
Comparison of the modeled mean diurnal cycle of the (Figure 5) shows the highest inter-model spread at the peak of the diurnal cycle, with a range of 10%–600% of the model ensemble-mean flux. Along three sites with contrasting precipitation regimes (US-WestPhoenix, AU-Preston, and SG-TelokKurau), increases as expected at wetter sites. At US-WestPhoenix, all models but one underestimate peak . This underestimation likely results from the absence of irrigation in nearly all models, while irrigation is common at US-WestPhoenix (Templeton et al., 2018). The one overestimating model does not include irrigation. At the other two sites, around half the models underestimate (Figure 5). Although for these sites the model medians are better, the difficulty of capturing the correct flux magnitude is evident, as is passed by only 26% of the model runs (Figure 4). No model passes this indicator at more than half of the sites.
[IMAGE OMITTED. SEE PDF]
After different rainfall events, daily decreases with varying timescales in both the observations and the models (Figure 6). The variation is higher amongst the modeled than the observed drydown. In contrast with the magnitude, the recession timescale shows no link with annual precipitation. shows the recession timescale is captured correctly in 87% of the cases (Figure 4).
[IMAGE OMITTED. SEE PDF]
Water Storage
Not all models have explicit water storage values (Equation 2) that are equal to the implicit values (Equation 1, Figure 7), which is seen across all sites (not shown). However, the explicit water storage should reflect the implicit storage, as the explicit storage change is equal to the net of all water fluxes. For five models, the explicit storage change is equal to the implicit storage change at all sites. Minor differences occur in six models and large differences in six others. Two models have no differences at sites without snowfall (e.g., AU-Preston) but large differences at sites with snowfall (e.g., CA-Sunset). As these models do not account for the snowfall in the input we see an increasing difference between the explicit and implicit water storage. The models with larger differences follow a seasonal cycle likely caused by non-restricted implicit water storage combined with restricted explicit water storage by soil storage capacity.
[IMAGE OMITTED. SEE PDF]
The range of modeled water storage exceeds the estimated site water storage capacity in 64% of cases (Figure 4). Models 1 and 5 have the lowest score for this indicator, because they have an inconsistency between the inputs and outputs (Equation 3) causing non-closure of the water balance at nearly all sites. Three models never exceed the estimated water storage capacity.
How explicit relates to implicit water storage is linked to the individual models given the consistent results across sites (Figure 8). With magnitude represented by water balance closure, we focus on the timing by assessing the explicit relative to the implicit water storage (Figures 9a–9c). Model runs can have comparable directions but different patterns, for example, model 11 (Figure 9a), comparable patterns but different magnitudes of change, for example, model 9 (Figure 9b), or virtually no differences (e.g., model 18, Figure 9c). The explicit and implicit water storage changes (Figures 9d–9f) emphasize the difference in timing, which is why the indicator uses the of these derivatives. Only five models have virtually no differences and thus an of 1 (Figure 4). Over half of the models have greater than 0.9 indicating timing consistency (, Figure 4).
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Surface Runoff
All models have surface runoff triggered by precipitation, but the precipitation event size causing events differs between models (Figure 10). The model rather than the site seems to explain triggering event size despite the variation amongst sites in impervious fractions and precipitation regimes. This suggests that surface runoff parameterization may be critical. Thus, we find a large inter-model spread in the cumulative modeled (Figure 2). One model is excluded as it does not output separately from . Ten models show the expected increase of cumulative with increasing site impervious fraction (p0.05, Wald test (Wald, 1943)), whereas nine models do not (Figure S2 in Supporting Information S1).
[IMAGE OMITTED. SEE PDF]
Only in 43 of the 337 model runs, the (curve number: Section 2.1.4) is captured correctly, passing (Figure 4), so all other model runs have no overlap with the site estimates (see Section 2.1.4). Three models capture the correctly for at least half of their model runs and are responsible for 32 of the successful model runs. Most models do not match event precipitation and relation. Most models underestimate the relative to the site estimate (Figure S3 in Supporting Information S1). Underestimating the indicates a model is overestimating surface interception and/or soil infiltration, reducing (Equation 5).
One in four model runs accurately captures the fast response in the lag time (Figure 4) with passed by 25% of the model runs. With very short lag times expected, only overestimates are simulated. Most lag times averaged per model run are less than 5 hours, but exceptionally they are over 100 hr. Average lag times per model run are shown in Figure S4 of Supporting Information S1.
Urban Water Balance Representation (UWBR) Score
Across all model runs, the mean UWBR score amounts to 3.3 out of the possible 7 (Figure 4). Although the overall pass rate across all indicators and models is 47%, pass rates strongly vary per indicator. Notably, 87% passes , while only 11% passes . Pass rates also differ among models from 28% to 72%. Only one model run passes all indicators, while 10 model runs have a score of 6 out of 7. Model 19 accounts for five of these eleven high-scoring runs. If a model closes the water balance , it generally scores better on both storage indicators. In contrast, models with a high passing percentage for one indicator do not systematically score better for the other indicator. Overall, the timing is captured better than its cumulative magnitude . A similar pattern is seen in the indicators with the timing captured slightly better than magnitude .
Generally, pass rates per indicator show a dependence on the model (Figure 4). This dependence is not found for sites (Figure S5 in Supporting Information S1). There is no relation evident between UWBR score and model approach (e.g., built surface, soil hydrology, Table 2), but the model is more influential than the site on UWBR score. As the Lipson et al. (2024) classification (Table 2) was not developed with the water balance representation as its original goal, further work would be needed to identify what model attributes are key to better UWBR score.
Linking the Water and Energy Balance
Surprisingly, models do not appear to capture any aspect of the latent heat flux more accurately if their UWBR score is higher. The UWBR score does not significantly correlate with better ranking on any of the four metrics evaluating the (half-)hourly modeled : the , , , and (p0.05, Wald test, Figure S6 in Supporting Information S1). These correlations remain absent if one of the indicators is omitted from the analysis. The lack of correlation may be the result of the low number (11) of runs with a UWBR score higher than 5 (Figure 4) effectively reducing the UWBR score range. Given the lack of relations between the UWBR score and metrics, the is not better captured in model runs that pass more indicators of a realistic water balance representation, thus refuting our hypothesis that the urban water balance skill positively impacts simulated energy fluxes.
Discussion and Conclusions
This study assesses the water balance representation in 19 ULSMs from the Urban-PLUMBER project. It appears the water balance is not closed (within 3%) in 57% of the model-site runs. The considerable spread in water fluxes is as wide as the absolute flux magnitude at all sites. For both and , the timing is captured better than the flux magnitude. Modeled explicit water storage dynamics (Equation 2) are inconsistent with the implicit water storage (Equation 1) in 44% of the models. Refuting our hypothesis, a better water balance representation does not result in more accurate latent heat fluxes. However, it is clear that the urban water balance is imperfectly incorporated into ULSMs and more proper physically based representations are required.
Five models close the water balance at all sites (Models 6, 13, 15, 18, and 19), while three never reach closure (Models 1, 3, and 5). The other models close the water balance at some sites. For several non-closing models, we identify the causes. One model implicitly assumes an infinite source or sink of soil moisture by adapting the modeled soil moisture when it exceeds hard-coded limits adding or removing water to remain within these limits (Model 11). Two other models do not fully couple all processes, such as runoff and evaporation calculations occurring without water availability feedback between processes (Models 1 and 5). Such uncoupled processes may also explain inconsistent water storage dynamics. Three models pass their internal water balance closure check but do not provide the modeled groundwater flux in the model output (Models 8, 16, and 17). We call on the modeling community to include all fluxes required to diagnose water balance closure in the model output. Three models without a snow module disregarded all snowfall creating a mismatch between real and modeled input (Model 2, 7, and 12). For one model, we suspect a very shallow soil layer causes large numerical errors resulting in an unclosed water balance (Model 4). Fortunately, model improvements should be able to eliminate these issues for most models.
Evidence is found that the models would benefit from reevaluating their runoff parameterizations. The runoff volumes are poorly captured, resulting in having the poorest overall pass rate (Figure 4). Runoff has not been evaluated in previous ULSM comparisons and suffers here from a lack of direct observations and small areas being modeled . The lack of correlation between modeled cumulative and the impervious fraction is worrying given the well-documented relation (Jacobson, 2011; Shuster et al., 2005). However, many models use relatively simple approaches, such as a constant fraction of rainfall that runs off independent of site characteristics, rainfall intensity, or soil moisture state. Others use poorly constrained parameters, such as how much water is routed between sub-grid tiles. Future work could help to constrain such parameters, while the simple approaches could be improved relatively straightforwardly.
Despite the lack of evidence showing a link between the UWBR score and performance, the incomplete representation of the water balance may contribute to the poor latent heat flux performance of the ULSMs. The design of the UWBR score may not be successful in revealing an existing link between the UWBR score and performance, as the UWBR score indicators assess the water balance based on physical realism and expectations derived from the literature. While a higher UWBR score indicates a more physically consistent water balance, it may still be an incorrect simulation. The opposite is also true, as, without physical constraints, machine learning approaches show good results for (Vulova et al., 2021). Apart from that, a potential link between the water balance representation and the performance may be hidden by other elements affecting performance. These elements could be other components of the model (e.g., the energy balance representation) or human errors (e.g., erroneous parameters, assuming northern-hemisphere vegetation, and results reported in wrong units). Yet, we do find a poor performance for consistent with the literature showing is among the most challenging fluxes to model (Grimmond et al., 2011; Lipson et al., 2024). As the energy and water balance are directly connected, we hypothesize potential errors in the water balance are causing, and not being caused by, the poor performance of , as the short runoff timescales in urban areas on a neighborhood scale dictate the water availability for and not the other way around. Hence, good model performance for the latent and sensible heat flux cannot be achieved without properly representing both balances. Thus, we believe an improved representation of the water balance will assist in latent heat flux simulation and other energy fluxes.
This first systematic analysis of urban water balance modeling is an opportunistic study taking advantage of model outputs, model characterizations, and observations gathered for the Urban PLUMBER project (Lipson et al., 2022b, 2024). The Urban-PLUMBER setup affects this study via (a) the diversity of model outputs linked to their range of modeling approaches, and (b) a lack of observations for all the water balance terms. Intentionally, a wide range of modeling approaches are analyzed with both default parameters and provided parameters implemented by modelers (Lipson et al., 2024), impacting the model results and performance. For example, numerical discretization of soil layers can cause a flawed, reduced moisture drydown linked to irregular soil layer depths that enhance evaporation (MacKay et al., 2022). Ongoing land surface model developments to capture and link more processes increase both their scope and complexity, but the number of differing aspects complicates a systematic analysis aiming to attribute performance to certain aspects (Blyth et al., 2021; Fisher & Koven, 2020). To minimize human error, Urban-PLUMBER allowed resubmission of model outputs after web-based and manual checks. As these checks did not address the water balance, we provided an additional basic analysis of the water balance results to catch other human errors with encouragement to resubmit updated outputs. Unfortunately, resubmission reduces but does not eliminate human errors. All differences other than the water balance representation hinder the attribution of the model performance to the water balance concept as they explain the large variety in model performance amongst models that capture the water balance equally accurately. Ideally, these differences would be eliminated by developing a multi-model framework in the future (Sadegh et al., 2019) and characterizing model types based on water balance approaches. Such a characterization could allow for teasing out more detailed strengths and weaknesses of water balance representations.
Lack of observations (e.g., runoff, soil moisture) prevents direct assessment for many water balance terms. These observations are challenging as both energy and water balance closure need to be considered, so observations need to cover a relatively large uniform area that also constrains the natural and anthropogenic water flows (Grimmond & Oke, 1986, 1991). A large uniform area is needed as eddy-covariance footprints vary continuously (Feigenwinter et al., 2012; Grimmond & Oke, 1991), while catchment boundaries are static. Hence, we develop a new alternative using quantitative indicators. Each indicator addresses a water balance process and checks whether it complies with physical limits, the model itself, or previous research. We refrain from weighting the indicators to minimize the score subjectivity and prevent one indicator from controlling the outcome. The systematic removal of one of the seven indicators allows us to confirm the UWBR score is not driven by one indicator.
Here, we show ULSMs produce a wide range of water balance results but often do not realistically represent important hydrological processes. Output reporting errors may cause part of the low performance. Although our results are for offline ULSMs, we expect the identified issues will persist in a coupled setting on any scale (e.g., with mesoscale and global models). ULSMs could be improved by ensuring they close the water balance and updating runoff parameterizations. Ideally, future energy-water–carbon studies will try to gather both a wider range of observations but also modeled processes. This will aid improvement of model processes and their feedbacks. However, the complexity of the urban landscape (e.g., different definitions between eddy covariance footprints, and runoff catchments) will require nested model runs and observations to ensure consistency of all. We recommend routine assessment of water balance closure in ULSM development phase applying the indicators of the UWBR score. In a broader context, both model evaluations and comparisons should extend beyond the target variables of the model to all processes that directly influence these variables. This will benefit the broader delivery of integrated urban services (WMO, 2019) and facilitate urban resilience across time scales.
Acknowledgments
We acknowledge the Urban-PLUMBER project team and all observation and modeling participants providing the data set for this research. We would like to thank Judith Boekee, Andrew Frost, and Valentina Marchionni for the fruitful discussions. We want to express our appreciation to the three anonymous reviewers who took the time and effort to review and help improve the manuscript. Harro Jongen acknowledges this research was supported by the WIMEK PhD Grant 2020. Mathew Lipson acknowledges support from the Australian Research Council (ARC) Centre of Excellence for Climate System Science (Grant CE110001028), National Computational Infrastructure (NCI) Australia and the Bureau of Meteorology, Australia. Gert-Jan Steeneveld acknowledges support from the Amsterdam Institute for Advanced Metropolitan Solutions (AMS Institute, project VIR16002) and the Netherlands Organization for Scientific Research (NWO, Project 864.14.007). Sue Grimmond acknowledges support from ERC Urbisphere (Grant 855055). Matthias Demuzere was supported by the ENLIGHT project, funded by the German Research Foundation (DFG) under Grant number 437467569. Ting Sun is supported by UKRI NERC Independent Research Fellowship (NE/P018637/2). Ruidong Li is supported by CSC scholarship. Keith Oleson's contribution is based upon work supported by the NSF National Center for Atmospheric Research, which is a major facility sponsored by the U.S. National Science Foundation under Cooperative Agreement No. 1852977. Chenghao Wang acknowledges support from the National Science Foundation (NSF) under Grants numbers OIA-2327435 and CNS-2301858 and the National Oceanic and Atmospheric Administration (NOAA) under Grant number NA21OAR4590361.
Data Availability Statement
All observation data from this study are openly available at Zenodo via (Lipson et al., 2022a). Model results and benchmarks (Lipson & Best, 2022) for AU-Preston are archived at Zenodo. Model results for the other sites are visualized at and will be published together with Urban-PLUMBER Phase 2.
Berne, A., Delrieu, G., Creutin, J.‐D., & Obled, C. (2004). Temporal and spatial resolution of rainfall measurements required for urban hydrology. Journal of Hydrology, 299(3–4), 166–179. [DOI: https://dx.doi.org/10.1016/s0022-1694(04)00363-4]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Urban Land Surface Models (ULSMs) simulate energy and water exchanges between the urban surface and atmosphere. However, earlier systematic ULSM comparison projects assessed the energy balance but ignored the water balance, which is coupled to the energy balance. Here, we analyze the water balance representation in 19 ULSMs participating in the Urban‐PLUMBER project using results for 20 sites spread across a range of climates and urban form characteristics. As observations for most water fluxes are unavailable, we examine the water balance closure, flux timing, and magnitude with a score derived from seven indicators expecting better scoring models to capture the latent heat flux more accurately. We find that the water budget is only closed in 57% of the model‐site combinations assuming closure when annual total incoming fluxes (precipitation and irrigation) fluxes are within 3% of the outgoing (all other) fluxes. Results show the timing is better captured than magnitude. No ULSM has passed all water balance indicators for any site. Models passing more indicators do not capture the latent heat flux more accurately refuting our hypothesis. While output reporting inconsistencies may have negatively affected model performance, our results indicate models could be improved by explicitly verifying water balance closure and revising runoff parameterizations. By expanding ULSM evaluation to the water balance and related to latent heat flux performance, we demonstrate the benefits of evaluating processes with direct feedback mechanisms to the processes of interest.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




















1 Hydrology and Environmental Hydraulics, Wageningen University, Wageningen, The Netherlands, Meteorology and Air Quality, Wageningen University, Wageningen, The Netherlands
2 Bureau of Meteorology, Canberra, ACT, Australia
3 Hydrology and Environmental Hydraulics, Wageningen University, Wageningen, The Netherlands
4 Department of Meteorology, University of Reading, Reading, UK
5 School of Earth and Environmental Sciences, Seoul National University, Seoul, South Korea
6 Met Office, Exeter, UK
7 Department of Geography, Urban Climatology Group, Ruhr‐University Bochum, Bochum, Germany, B‐Kode, Ghent, Belgium
8 Department of Meteorology and Climatology, Faculty of Geographical Sciences, University of Łódź, Łódź, Poland
9 School of Meteorology, University of Oklahoma, Norman, OK, USA
10 School of Biological Sciences, University of Bristol, Bristol, UK
11 Institute for Risk and Disaster Reduction, University College London, London, UK, Department of Hydraulic Engineering, Tsinghua University, Beijing, China
12 European Centre for Medium‐Range Weather Forecasts (ECMWF), Reading, UK
13 Department of Civil and Environmental Engineering, National University of Singapore, Singapore, Singapore, Future Cities Laboratory Global, Singapore‐ETH Centre, Singapore, Singapore
14 U.S. National Science Foundation National Center for Atmospheric Research (NSF NCAR), Boulder, CO, USA
15 School of Environmental Engineering, University of Seoul, Seoul, South Korea
16 Institute for Risk and Disaster Reduction, University College London, London, UK
17 Meteorology and Air Quality, Wageningen University, Wageningen, The Netherlands, European Centre for Medium‐Range Weather Forecasts (ECMWF), Bonn, Germany
18 Faculty of Geography/Research Computing Center, Lomonosov Moscow State University, Moscow, Russia
19 School of Meteorology, University of Oklahoma, Norman, OK, USA, Department of Geography and Environmental Sustainability, University of Oklahoma, Norman, OK, USA
20 School of Sustainable Engineering and the Built Environment, Arizona State University, Tempe, AZ, USA
21 Meteorology and Air Quality, Wageningen University, Wageningen, The Netherlands