Full text

Turn on search term navigation

Introduction

Next-day streamflow predictions at the outlets of river basins are important for understanding water availability for consumption, irrigation, energy generation, flood and drought risk, recreation, and more. Reliable streamflow estimates are particularly valuable in the Colorado River Basin (CRB), a basin that provides water to over 40 million people. The CRB has diverse river basins whose streamflow is driven by precipitation or snowpack and can be heavily altered by upstream human activities like reservoir operations, agricultural irrigation, water exports, and changes in land use.

Despite the importance of streamflow estimates in diverse regions such as the CRB, there is little research in the field of regional streamflow forecasting for locations altered by human activity. Recently, Long Short-Term Memory models (LSTMs) have become a popular method of forecasting streamflow due to their ability to retain important information and discard irrelevant information from input time sequences to make accurate predictions (Kratzert, Klotz, et al., 2019). These studies achieved state-of-the-art results, but often avoid studying river basins with high anthropogenic impacts, especially basins containing reservoirs.

Streamflow predictions in basins containing reservoirs is challenging because the outflow of reservoirs depends heavily on human decisions. Reservoirs are one of the most direct form of humans altering streamflow, but basins can also be altered indirectly by humans changing landscapes with withdrawals for irrigation, deforestation, infrastructure construction, and more. In basins with anthropogenic impacts, it is difficult to develop a single model that will address all of the possible human alterations that is also generalizable. Many studies using LSTMs also struggle to predict in ephemeral river basins that experience long periods of low flow due to high aridity (Kratzert et al., 2018a). Streamflow in ephemeral basins is driven by precipitation events, which can be difficult to model due to the high uncertainty in precipitation inputs.

A variety of studies have attempted to address modeling streamflow with a range of anthropogenic impacts as well as studies modeling streamflow in ephemeral basins. Some studies directly address reservoir alterations to streamflow by training an LSTM using reservoir inputs such as inflow, volume, and power to simulate reservoir release (Fan et al., 2023; García-Feal et al., 2022; Zhang et al., 2019; Özdogan Sarıkoç et al., 2023). These methods require that reservoir inflow and volume information must be available for the entire study region, which is not always available in some regions of the globe. Additionally, these methods often include upstream streamflow data, which is similar to providing the previous day streamflow as a model input—both are a form of data assimilation that have high accuracy but are detrimental to model prediction in ungauged basins. Others found that alterations to streamflow (droughts, floods, and reservoir levels) can be detected with satellite imagery data such as land cover type, radar altimetry water level, and irrigation data (Chen et al., 2020; Vu et al., 2022). The increase in satellite imagery makes access to geospatial data sets widely available, which can provide the means to model streamflow altered by human influence.

With the wide success of streamflow prediction using LSTMs, different frameworks for training regional models have emerged. One approach investigated the use of land cover characteristics both as static catchment attributes and as dynamic inputs over time, in addition to meteorological inputs (Althoff et al., 2021). This study concluded that the dynamic inputs increased the performance of the LSTM model compared to static attributes. A different approach proposed in a study that trained regional LSTMs on pools of data organized by the size of the reservoirs (Ouyang et al., 2021). They found that regional LSTMs perform well when trained on basins with small reservoirs $(\text{Degree}\,\text{of}\,\text{Regulation}< 0.1)$ , but streamflow in basins with large reservoirs $(\text{Degree}\,\text{of}\,\text{Regulation}\ge 0.1)$ are better represented in a separate regional LSTM. While these results suggest that grouping basins may be a viable approach for LSTM streamflow modeling, they do not provide strong guidelines on the optimal ways to classify basins.

An open question also remains about the use of auxiliary, relevant data sets that may improve LSTM performance. For example, can information about direct water withdrawals or land use result in improved streamflow predictions in human-altered river basins. To better understand the contributions of the auxiliary data as well as gain confidence in the model predictions, scientists apply feature importance methods. In the field of deep learning, interpreting how complex models make decisions is not straightforward. In order to measure the value of auxiliary data to an LSTM, methods are needed that measure each predictor's relative contribution to the prediction. Previous studies have presented some techniques for assessing feature importance such as integrated gradients (Kratzert, Herrnegger, et al., 2019), expected gradients (Jiang et al., 2022), and Shapley additive explanations (SHAP) (Fan et al., 2023). Each of these are attribution methods that measure the magnitude and direction of a predictor variable's influence on the prediction at each time step. SHAP requires more computational power than the gradient-based methods but does not require assumptions about a baseline input or about the model complexity.

Our research aims to fill the gap in streamflow modeling for human-altered basins, such as those with reservoirs or considerable land use alterations, by understanding how to make a model more accurate and exploring methods that incorporate open-source geospatial data sets to improve the accuracy of streamflow forecasts in the diverse CRB. This research proposes a framework exploring river basin classification scenarios to answer the question of how to use machine learning to model next-day streamflow in a hydrologically diverse region that is heavily altered by human activity. Specifically, we aim to answer the following research questions: (a). Are watersheds in the CRB too diverse to generalize in a single, regional LSTM model? (b). Which basins have poor performance and require a different modeling approach and why? (c). Can satellite-derived land cover and hydroclimate data improve prediction accuracy in diverse watersheds in the CRB?

Methods

Study Area

The CRB is a 250,000 ${\text{mi}}^{2}$ watershed, extending over parts of Colorado, Wyoming, Utah, Arizona, Nevada, New Mexico, and California (Figure 1). The CRB originates in the headwaters of the Rocky Mountains and drains to the Gulf of Mexico. This region is a vital source of water to 40 million people (including 29 recognized indigenous tribes (U.S. Bureau of Reclamation & Ten Tribes Partnership, 2018)), providing water for consumption, power, crop irrigation, and recreation. Water is also exported outside the basin, supplying water to cities including Los Angeles, San Diego, Denver, Salt Lake City, and Albuquerque, along with irrigation for approximately 5.5 million acres of agricultural land.

[IMAGE OMITTED. SEE PDF]

River watersheds, or basins, in the CRB can be characterized as either snow-dominated or precipitation-dominated (Solander et al., 2019; Talsma et al., 2022). The upper CRB, located in parts of Wyoming, Utah, and Colorado, has snow-dominated basins where the temperature is below freezing in the winter months of December to February (Livneh et al., 2015; Lukas & Payton, 2020). In snow-dominated basins, snow accumulates and melts in the spring and early summer (April–July). Basins in the lower CRB—spanning parts of Arizona, New Mexico, and Nevada—tend to contain ephemeral rivers for which streamflow is nearly absent during the majority of the year except following large-enough precipitation events. These systems are most dynamic during monsoonal rains that occur in the summer and early fall (June through September).

Human activities greatly alter the natural flow of water in the CRB. The most direct human alterations to CRB streamflow are in the form of operating reservoirs, water exports, and irrigation canals. Reservoirs in this region are built to store water for irrigation and municipal, commercial, and industrial needs, but also prevent and reduce flooding, temper drought events, generate electricity, promote populations of fish and wildlife, and provide recreation (Richter et al., 2024). Reservoir operation typically acts to reduce or increase downstream flow, leading to changes in the timing of peak flow, which also prevents extremely high or low streamflow events. Water exports and irrigation canals similarly reduce streamflow in some locations while increasing flow in other locations.

Data

Nearly all the data used for this study was accessed through Veins of the Earth (VotE), an AI-ready, river-centric data platform that provides functionality to obtain paired streamflow, meteorologic, and characteristics data for basins in an organized format (Schwenk et al., 2021). Veins of the Earth provides global watershed delineations at the highest resolution of ${1\,\text{km}}^{2}$ based on the MERIT Hydro (Yamazaki et al., 2019) suite of topographical data sets. Veins of the Earth fuses hydrology data onto a common, global river and watershed network, including dams and streamgage data dating from 1950 to the present. For this study, all streamgage data was provided by (USGS, 2016). 1,185 USGS streamflow gages exist (ed) within the CRB. Veins of the Earth also contains pre-sampled, daily meteorological variables that temporally match streamgage observations; these are sampled from ECMWF's fifth generation atmospheric reanalysis land model (ERA5 Land Hourly) (Muñoz-Sabater et al., 2021). Meteorological variables were spatially aggregated by their mean values and temporally aggregated to daily statistics (mean, minimum, or maximum) accounting for time zone offsets between each gage and ERA5 Land Hourly's ”clock” (Kratzert et al., 2023). In addition to watershed delineations at gages and daily streamflow and meteorological time series, VotE also provides an API called rabpro (Schwenk et al., 2022) that was used for sampling additional geospatial rasters hosted on Google Earth Engine (Gorelick et al., 2017) over watershed polygons.

This study uses daily streamflow normalized to mm/day by basin area as the response variable from gauge data, focused on the years of 2000–2020 to ensure the gauges have sufficient data for training. Basins containing reservoirs are filtered to ensure that the reservoir construction date occurs before the training period. After filtering for these criteria, 296 basins are selected for this study (see Figure 1).

The input variables used to predict streamflow via deep learning models are meteorological forcings, basin attributes, and Climate/Anthropogenic variables. The initial experiments are trained on the meteorological forcings and basin attributes. Then, our best experiment is trained with the same previous variables in addition to Climate/Anthropogenic data with the aim to improve predictions in basins altered by human activities.

Table 1 provides details about the inputs used to train LSTMs in this study along with their units, time scale, resolution, source, and availability. Variables that are at a monthly or yearly resolution are resampled to a daily scale. Some variables do not vary annually, such as World Clim Precipitation, because these values are averaged over a period of time to monthly values (i.e., 12 per basin). For these inputs, the monthly values are repeated annually and resampled to a daily scale.

Table 1 Table of the Inputs Used for Training LSTMs in This Study

Type	Source	Variable	Units	Frequency	Resolution	Start	End
Meteorological forcing	ERA5 Land (Muñoz-Sabater et al. (2021))	Total precipitation	kgm⁻²s⁻¹	Daily	${\sim}$ 31 km	1940	Present
Average surface net solar radiation	Jm⁻²	Daily	${\sim}$ 31 km	1940	Present
Average temperature at 2 m	K	Daily	${\sim}$ 31 km	1940	Present
Max dew point temperature at 2 m	K	Daily	${\sim}$ 31 km	1940	Present
Average surface pressure	Pa	Daily	${\sim}$ 31 km	1940	Present
Average snow depth water equivalent (SWE)	m	Daily	${\sim}$ 31 km	1940	Present
Total potential evaporation	kgm⁻²s⁻¹	Daily	${\sim}$ 31 km	1940	Present
		Average volumetric soil water layer	${\mathrm{m}}^{3}{\mathrm{m}}^{-3}$	Daily	${\sim}$ 31 km	1940	Present
Index	N/A	Day of year	days	Daily, repeats annually	N/A	N/A	N/A
Climate/anthropogenic (C/A)	Level 1 Land Cover Classes (LULC) (Brown et al. (2020))	Developed cover	%	Yearly	30 m	1985	2021
Cropland cover	%	Yearly	30 m	1985	2021
Grass and shrub cover	%	Yearly	30 m	1985	2021
Tree cover	%	Yearly	30 m	1985	2021
Water cover	%	Yearly	30 m	1985	2021
Wetland cover	%	Yearly	30 m	1985	2021
Ice and snow cover	%	Yearly	30 m	1985	2021
Barren cover	%	Yearly	30 m	1985	2021
Monthly Water Recurrence (Pekel et al. (2016))	Water recurrence	%	Monthly, repeats annually	30 m	1984	2021
Rangeland Production (Agro) (Matthew O.Jones (2020))	Annual forbs and grasses cover	%	Yearly	30 m	1984	Present
Perennial forbs and grasses cover	%	Yearly	30 m	1984	Present
Shrubs cover	%	Yearly	30 m	1984	Present
Trees cover	%	Yearly	30 m	1984	Present
Bare ground cover	%	Yearly	30 m	1984	Present
Global Soil-Water Balance (Trabucco and Zomer (2019b))	Actual evapotranspiration	mm	Monthly, repeats annually	${\sim}$ 920 m	1950	2000
Global-PET (Trabucco and Zomer (2019a))	Potential evapotranspiration	mm	Monthly, repeats annually	${\sim}$ 1 km	1970	2000
WorldClim (Hijmans et al. (2005))	Precipitation	mm	Monthly, repeats annually	${\sim}$ 1 km	1950	2000
WorldClim and Global-PET	Climate moisture index	N/A	Monthly, repeats annually	${\sim}$ 1 km	1950	2000
EarthStat (Ramankutty and Foley (1999))	Potential natural vegetation extent	%	Monthly, repeats annually	${\sim}$ 9 km	1700	1992
Random value	Random float number between 0.0 and 1.0	N/A	Daily	N/A	N/A	N/A
Basin attribute	MERIT-Hydro (Yamazaki et al. (2019))	Average elevation	m	N/A	${\sim}$ 90 km	1987	2017
EarthEnv-DEM90 Robinson et al. (2014)	Average slope	degrees	N/A	90 m	N/A	N/A
VotE (Schwenk et al. (2022))	Drainage area	km²	N/A	${\sim}$ 90 km	1987	2017
GLC2000 (Bartholomé and Belward (2005))	Forest cover extent	%	N/A	1 km	2000	2000
SoilGrids1km (Tomislav Hengl and Gonzalez (2014))	Average depth to bedrock	cm	N/A	1 km	N/A	N/A
	Proportion of sand particles (>0.05 mm) in the fine earth fraction at 5 cm depth	%	N/A	1 km	N/A	N/A
	Proportion of silt particles ( ${\ge}$ 0.002 mm and ${\le}$ 0.05 mm) in the fine earth fraction at 5 cm depth	%	N/A	1 km	N/A	N/A
	Proportion of clay particles (<0.002 mm) in the fine earth fraction at 5 cm depth	%	N/A	1 km	N/A	N/A
	Soil organic carbon content in the fine earth fraction at 5 cm depth	%	N/A	1 km	N/A	N/A
ERA5 Land (Muñoz-Sabater et al. (2021))	Soil moisture index	N/A	N/A	${\sim}$ 31 km	1940	Present
Average daily precipitation	kgm⁻²s⁻¹	N/A	${\sim}$ 31 km	1940	Present
Average daily potential evapotranspiration	kgm⁻²s⁻¹	N/A	${\sim}$ 31 km	1940	Present
Average potential evapotranspiration over average precipitation ratio	%	N/A	${\sim}$ 31 km	1940	Present
Fraction of precipitation days with temperatures below ${0}^{{}^{\circ}}$ C	%	N/A	${\sim}$ 31 km	1940	Present
Number of days with precipitation ${\ge}$ 5 x average daily precipitation	days	N/A	${\sim}$ 31 km	1940	Present
Number of consecutive days with precipitation ${\ge}$ 5 x average daily precipitation	days	N/A	${\sim}$ 31 km	1940	Present
Number of days with precipitation <1 mm/day	days	N/A	${\sim}$ 31 km	1940	Present
Number of consecutive days with precipitation <1 mm/day	days	N/A	${\sim}$ 31 km	1940	Present

Meteorological forcings are aggregated as spatial means across each watershed and to the daily timestep derived from ERA5 Land Hourly data (Muñoz-Sabater et al., 2021). Following the success of Kratzert et al. (2018b) and considering common practices in streamflow prediction using deep learning, we selected the following meteorological variables: total precipitation, total potential evaporation, average surface net solar radiation, average temperature at 2 m, maximum dewpoint temperature at 2 m, average surface pressure, average snow-water equivalent (SWE), and average volumetric soil water layer. Basin-aggregated variables like precipitation, evaporation, and SWE are closely related to the amount of water exiting the basin in the form of streamflow. Variables such as temperature and soil moisture help to understand how the water is stored in the basin and when the water leaves. For example, temperature helps the model to identify when the basin conditions are warm enough to start to melt into streamflow. On the other hand, soil moisture can impact the speed of water passing through the system and can indicate drought or flooding.

In addition to time-varying predictors, we also included static basin attributes that capture basin characteristics that do not vary, or vary slowly, in time. These variables were used in a previous study and include total basin drainage area $\left({\text{km}}^{2}\right)$ , average basin elevation (m), soil moisture index, proportion of sand/silt/clay particles, average precipitation $\left({\text{kgm}}^{-2}{\mathrm{s}}^{-1}\right)$ , number/duration of days with low flow, and number/duration of days with high flow (Kratzert et al., 2018b).

The final type of inputs in this study are variables related to Climate/Anthropogenic (C/A) impacts derived from geospatial data sets. We refer to these variables as ”C/A data”. The C/A data includes land cover type and rangeland production cover type geospatial data sets. These variables are included to capture gradual shifts in streamflow as irrigation changes, infrastructures are built, forests are cleared, and cropland increases, for example, Monthly water recurrence is included as both a climate and an anthropogenic proxy. This measure of water pixels within a basin can reflect seasonal changes in the storage of reservoirs and lakes. We included some climate variables such as evapotraspiration, precipitation, and vegetation extent data sets. These have a worse temporal resolution than the ERA5 inputs, however, the spatial resolution is finer and could assist a model in predicting streamflow in an environment with heavy human alterations. Finally, a random variable is included in the C/A data as a baseline measure of how the model treats variables with no relationship to streamflow.

Experiments

This study proposes different methods to group basins, which we refer to as ”experiments.” The experiments allow us to investigate the relationships between basin diversity, additional data and its utility, and model performance across a wide range of basins in the CRB. Our initial analysis theorized that characteristics like C/A inputs were expected to impact streamflow in a dynamic way due to their temporal variability, making them well-suited for direct input. In contrast, characteristics like water balance and the number of non-zero flow days reflect fundamental, structural differences in hydrological behavior across basins. We hypothesized that these latter features could be better utilized as grouping criteria, as they capture the baseline flow and storage conditions unique to certain basin types (e.g., ephemeral vs. reservoir-altered basins), which may vary in response to broad environmental conditions rather than in direct relationship to daily or seasonal variability. Our experiments imply that a single, regional model using all available data is unable to capture the diverse hydrologic dynamics found across all basins within the CRB (see Section 3.1). Specifically, we hypothesize that the relationship between streamflow and our model inputs in basins that are largely unaffected by human activities differs from the relationship in basins where human activities have significantly modified the hydrological system. If that is the case, the experiments organize the basins so that separate models are able to learn different functions for basins with varying impacts. Figure 2 visualizes the five experiments used in this study.

[IMAGE OMITTED. SEE PDF]

Each subgroup within an experiment is modeled with a single, regional LSTM model that is trained and tuned to the optimal parameters (details found in Section 2.5). Each experiment (except for experiment A) separates the 296 study basins into three subgroups, resulting in subgroups of approximately 100 basins, although actual subgroup sizes are determined by quantitative analyses. We decided to separate the study basins into a total of three subgroups per experiment following the methodology in the Ouyang et al. (2021) study, which separated basins into zero-degree-of-regulation (dor), small-dor, and large-dor basins. Kratzert et al. (2024) found that training a regional LSTM on a large number of basins improves Nash-Sutcliffe Efficiency (NSE) scores significantly (increasing the training size from approximately 11 basins to 53 basins increased the median NSEs from 0.75 to 0.77), but at some point, the NSEs increase marginally (with 106 basins the median NSE increased to 0.78). The experiments and their partitions are described in more detail in the following sections.

Experiment A: All Basins

Experiment A consists of a single subgroup containing all of the study basins. Experiment A was constructed to resemble the traditional approach to regional LSTM modeling and serves as a benchmark regional model against the partitioning experiments.

Experiment B: Random Assignment

Basins in experiment B are randomly sorted into one of three subgroups using the function ”numpy.random.randint” from the Python NumPy package. This experiment is repeated for a total of 10 times, with the results per subgroup aggregated to a single experiment for analysis. Similar to experiment A, experiment B does not consider any basin characteristics when sorting into subgroups. This is another benchmark experiment to show the baseline results when training on smaller subsets of basins.

Experiment C: Model Performance

Experiment C subgroups are constructed based on experiment A's model performance. After the regional model is trained in experiment A, the NSE scores were used to divide the basins into subgroups for experiment C.

Low: $\text{NSE}< 0.0$
Medium: $0.0\ge \text{NSE}\le 0.5$
High: $\text{NSE} > 0.5$

This experiment hypothesizes that basins that are difficult to model with the traditional regional LSTM are hydrologically similar and vice versa. In particular, we expect ephemeral basins and basins with reservoirs to have lower model predictability and naturally cluster into the same subgroups.

Experiment D: Water Balance

Experiment D partitions basins based on the ratio of water going into and leaving the basin by computing the water balance. The water balance describes whether a basin loses water with respect to the amount of precipitation $(P)$ entering and total potential evapotranspiration $(PE)$ exiting. First, the ratio between the basin's average daily gauged streamflow $(Q)$ and ERA5 precipitation is computed to measure the amount of precipitation that is converted to streamflow in mm/day: $\frac{Q}{P}=1-\frac{PE}{P}.$

Then, the water balance $(WB)$ for a basin is computed to be, $WB=\frac{Q}{P}-\left(1-\frac{PE}{P}\right).$ where $\frac{Q}{P}$ is the runoff coefficient and $\frac{PE}{P}$ is the aridity index, so $WB$ is a unitless hydrological signature. A basin with a negative water balance experiences losses, while a basin with a positive water balance increases storage. The magnitude of the water balance is interpreted to be the magnitude of water lost or gained (Salwey et al., 2023).

Low WB: $\text{WB}< -2000$
Medium WB: ${-}2000\ge \text{WB}\le -100$
High WB: $\text{WB} > -100$

The thresholds were determined to achieve a balance between group size and hydrological distinction. The chosen values provided a reasonable separation of basins based on their water balance amounts. A previous study found that this measure of water balance can distinguish reservoir impacts from natural variability as reservoirs often cause losses in a basin's water balance (Salwey et al., 2023). Our hypothesis is that these WB subgroups will separate basins influenced by reservoirs from basins with fewer anthropogenic impacts.

Experiment E: Natural/Ephemeral/Reservoir

For experiment E, we partitioned basins into reservoir-altered, ephemeral, and naturalish subgroups. This categorization was made by Equation 1 grouping basins based on the number of nonzero flow days and Equation 2 whether a reservoir is present as defined below.

Reservoir: The basin contains at least one major reservoir.
Ephemeral: The basin does not contain a reservoir. The number of nonzero flow days is less than 40% of the training period.
Natural: The basin does not contain any reservoirs. The number of nonzero flow days is greater than or equal to 40% of the training period.

Previous studies have discovered that reservoir-altered and ephemeral basins are difficult to model (Feng et al., 2020; Kratzert et al., 2018a), so we hypothesize that training models on these subgroups will allow the models to learn more about basins in each subgroup.

Long Short Term Memory Model

A Recurrent Neural Network (RNN) is a type of neural network with a feedback loop that enables the network to learn relationships from sequences of inputs. An RNN cell contains weights and biases that are updated through backpropagation. The RNN is a network of connected RNN cells that, in total, equal the length of the input sequence. Each time step of the input sequence is fed through an RNN cell, which transforms the inputs through linear combinations and activation functions controlling how information is retained. The LSTM is a variation of the RNN with an additional series of gates to discard and retain information throughout the input sequence. These gates provide the ability to remove irrelevant information or update the memory with new information. Equation 1 through (6) define an LSTM as: 1 ${f}_{t}={\sigma }_{g}\left({W}_{f}\times {x}_{t}+{U}_{f}\times {h}_{t-1}+{b}_{f}\right)$ 2 ${i}_{t}={\sigma }_{g}\left({W}_{i}\times {x}_{t}+{U}_{i}\times {h}_{t-1}+{b}_{i}\right)$ 3 ${o}_{t}={\sigma }_{g}\left({W}_{o}\times {x}_{t}+{U}_{o}\times {h}_{t-1}+{b}_{o}\right)$ 4 ${c}_{t}^{\prime }={\sigma }_{c}\left({W}_{c}\times {x}_{t}+{U}_{c}\times {h}_{t-1}+{b}_{c}\right)$ 5 ${c}_{t}={f}_{t}\cdot {c}_{t-1}+{i}_{t}\cdot {c}_{t}^{\prime }$ 6 ${h}_{t}={o}_{t}\cdot {\sigma }_{c}\left({c}_{t}\right),$ where ${f}_{t}$ represents the forget gate, ${i}_{t}$ is the input gate, ${o}_{t}$ is the output gate, ${c}_{t}$ is the cell state or long term memory, and ${h}_{t}$ is the hidden state or short term memory (Gers et al., 2000; Hochreiter & Schmidhuber, 1997). These equations express the computations for time step $t$ and are repeated for every time step of the input sequence. The weight and bias matrices $W$ , $U$ , and $b$ remain the same for each time step. In these equations, ${\sigma }_{g}$ and ${\sigma }_{c}$ stand for the sigmoid and tanh functions respectively.

Figure 3 displays the input/target pairs for a regional model trained on $X$ basins. The input data for each basin is divided into sequences with a length of $y$ days. While training the model, all of the input sequences for Basin 1 are fed into the LSTM, then the inputs for Basin 2 are fed into the LSTM. The aim of the LSTM is to interpret the basin characteristics along with the basin-aggregated time-varying meteorological forcings to predict the next-day streamflow value as close to the observed value as possible. The regional LSTMs in this study are trained over the training period on all of the basins within a subgroup (defined in Section 2.3). These subgroups were constructed to be balanced across all three subgroups within an experiment and each contains a minimum of 48 basins. Training a regional LSTM with a smaller number of basins prevents the model from being able to generalize well to ungauged basins. It is noted that previous-day streamflow is not an input into the LSTMs. Including previous-day streamflow can significantly improve the accuracy of next-day streamflow predictions, in a method similar to performing data assimilation (Feng et al., 2020). However, excluding previous-day streamflow can improve the model's ability to make accurate predictions further into the future (Sabzipour et al., 2023).

[IMAGE OMITTED. SEE PDF]

Model Training

This study uses the NeuralHydrology Python package, which provides methods to train, test, and compare the results of machine learning models common to hydrology research (Kratzert et al., 2022).

In machine learning, data is commonly separated into training, validation, and test data sets to train a model, determine optimal hyperparameters, and assess the model performance. For this study, the training period was 2000–2009, validation was 2009–2010, and testing was 2010–2020. NeuralHydrology normalizes the input and target data by removing the mean and scaling the values to range between zero and one. Normalization encourages the neural network to converge faster during gradient descent by enforcing a common range of input values.

A hyperparameter grid search is used to determine the optimal model parameters for each LSTM (details found in Appendix A). After finding the optimal parameters, the model performance on the test data is computed to understand how well the LSTMs generalize to unseen data.

Model Evaluation

The accuracy of the models were evaluated against the test periods using NSE. Over the test period, the NSE is computed to be: $\text{NSE}=1-\frac{{\sum }_{t=1}^{T}{\left({Q}_{o}^{t}-{Q}_{m}^{t}\right)}^{2}}{{\sum }_{t=1}^{T}{\left({Q}_{o}^{t}-{\bar{Q}}_{o}\right)}^{2}}$ where ${Q}_{o}^{t}$ is the observed streamflow at time $t$ and ${Q}_{m}^{t}$ is the simulated streamflow at time $t$ . NSE values range from negative infinity to one, where 1 indicates perfect streamflow prediction and a value of 0 or less indicates that the model does not perform better than using the average of the time series as the prediction. The squaring of differences in the NSE means that larger errors are penalized more than smaller ones; therefore, NSE is a useful metric to assess how well the model simulations predict peak streamflow.

To compare two models, the NSE difference was computed for individual basins by subtracting the NSE from one model from the NSE of another model. The negative NSE values are clipped to zero to compute the NSE differences. The NSE difference values are classified as follows.

NSE Difference ${\le}$ −0.1 $\Rightarrow$ Significantly Worse
−0.1 ${< }$ NSE Difference ${\le}$ −0.05 $\Rightarrow$ Slightly Worse
−0.05 ${< }$ NSE Difference ${\le}$ 0.05 $\Rightarrow$ No Change
0.05 ${< }$ NSE Difference ${\le}$ 0.1 $\Rightarrow$ Slightly Better
NSE Difference ${\ge}$ 0.1 $\Rightarrow$ Significantly Better.

Basins that we considered to have improved are basins in the Slightly Better or Significantly Better categories with an NSE difference ${ >}$ 0.05. These threshold values are based on our experience and understanding of the typical range of NSE values. By considering the histogram of NSE differences as well as the visual results of the predicted and observed time series for each NSE difference, we determined that these thresholds provide a clear and interpretable way to categorize the NSE differences and facilitate comparisons between the models' performance across different basins and subgroups.

Feature Importance

SHAP (SHapley Additive exPlanations) is an algorithm that was developed using game theory to measure the importance of features in ML models (Lundberg & Lee, 2017). SHAP is an additive feature attribution method, which uses an interpretable explanation model as an approximation of the original model. This is defined as $g({z}^{\prime })={\phi }_{0}+\sum\limits _{i=1}^{M}{\phi }_{i}{z}_{i}^{\prime },$ where ${z}^{\prime }\in {\left\{0,1\right\}}^{M}$ , ${\phi }_{i}\in \mathbb{R}$ , and $M$ is the number of input features. The explanation model, $g\left({z}^{\prime }\right)$ , is a linear function of binary variables with ${\phi }_{i}$ representing the Shapley value for the ${i}^{\text{th}}$ variable of the function. Original inputs are simplified through a mapping function $x={h}_{x}\left({x}^{\prime }\right)$ with the SHAP method aiming to approximate $g\left({z}^{\prime }\right)\approx f\left({h}_{x}\left({z}^{\prime }\right)\right)$ , where $f$ is the original model.

Figure 4 depicts a diagram showing how the SHAP values were computed and aggregated to begin feature importance analysis. For a 100 day period (e.g., June 2012 to October 2012), the SHAP values are computed for the time varying input samples, so that there is a SHAP value for each predictor of each time step in an input sequence. These values are aggregated by computing the absolute sum of the sample sequences and tracking the index of the max SHAP value in each sequence. This removes the sliding window dimension and leaves a sequence of SHAP values for each input variable and a sequence of the corresponding time step of highest SHAP value. Finally, these values are aggregated by taking the average of the previously computed sums and the average of the time steps for each basin.

[IMAGE OMITTED. SEE PDF]

Results

Regional LSTM Modeling Under Different Experiments

An LSTM was trained and tested on each subgroup and NSE values were computed for the 296 study basins in all five experiments. In Figure 5, the boxplots illustrate the median NSE value (middle horizontal lines) and the upper and lower bounds of the colored box illustrate the ${75}^{th}$ and ${25}^{th}$ percentiles. The overall median NSEs were computed for each experiment by pooling together the NSEs from each subgroup. For experiments A through E, the overall median NSEs were similar (median NSE = 0.46, 0.44, 0.48, 0.48, 0.51, respectively). According to a Mann-Whitney U test the differences in median NSEs are not statistically significant at a 0.05 level. However, Figure 5 reveals the different ranges of NSE values across the individual subgroups.

[IMAGE OMITTED. SEE PDF]

Figure 5a examines the NSE distributions on a subgroup level with different colors representing a different experiment. Out of experiments B through E, B's subgroups, shown in light blue, were the least variable (median NSEs = 0.40, 0.56, 0.40, and IQRs = 0.78, 0.56, 0.68 moving from the leftmost to the rightmost subgroup) and C's subgroups, shown in teal, were the most variable (median NSEs = 0.0, 0.29, 0.81, and IQRs = 0.0, 0.39, 0.18). Experiments D and E, shown in green and gold, had similar amounts of variability (D: median NSEs = 0.33, 0.77, 0.02, and IQRs = 0.72, 0.26, 0.39, E: median NSEs = 0.55, 0.06, 0.72, and IQRs = 0.65, 0.16, 0.30). The top three subgroups with the lowest median NSEs were experiments C: Low NSE, D: High WB, and E: Ephemeral. The top three subgroups with the highest median NSEs were experiments C: High NSE, D: Medium WB, and E: Natural. See Figures S1 and S2 in Supporting Information S1 for a comparison using RMSE and KGE as the performance metrics.

Figure 5b considers the median subgroup NSE values from Panel (A) as a function of the average normalized standard deviation of streamflow. Low normalized standard deviation values had less variability and high values had higher variability. The correlation between the normalized standard deviation and the NSE was −0.75. Basins with higher variability (Low NSE, Medium NSE, Low WB, High WB, and Ephemeral subgroups) tended to have lower model performance (median NSE below 0.5). The top five subgroups with the lowest variability (High NSE, Medium WB, Reservoir, and Natural subgroups) had the highest median NSE values (above 0.5). Among the lowest median NSE subgroups (Low NSE, High WB, and Ephemeral subgroups), 4 basins were found in all three of the subgroups. Many of the High WB basins were classified as either Reservoir or Ephemeral in experiment E (57 Reservior, 13 Ephemeral, 6 Natural) and same for the Low WB basins (61 Reservoir, 34 Ephemeral, 23 Natural), while the Medium WB subgroup had few basins in the Ephemeral subgroup (81 Reservoir, 1 Ephemeral, 20 Natural). In the top three median NSE subgroups, 20 basins were found in all three of the subgroups.

Experiment E Versus Experiment A

We compared the results of experiment E with A to determine whether our partitioning strategy outperformed the traditional regional LSTM approach in the CRB. The NSE difference between model performance in experiment E and experiment A (experiment E NSE - experiment A NSE) was computed and classified according to the definitions in Section 2.6.

Figure 6 Panel (A) displays an example of a Reservoir basin that improved in experiment E (yellow line) compared to experiment A (purple line) and has a map of the NSE differences between the two experiments in Panel (B). The observed streamflow value, in gray, had high annual peak flows ${ >}$ 400 mm/day. The NSEs reflect how the peak flow predictions of experiment E (NSE: 0.80) were closer to the observed peak streamflow values than experiment A (NSE: 0.58). See Figure S5 in Supporting Information S1 for more examples.

[IMAGE OMITTED. SEE PDF]

Figure 6 Panel (B) shows how, for all three subgroups in experiment E, the majority of basins experienced No Change compared to their model performance in experiment A. In the Reservoir subgroup, 31 $\%$ of the basins performed Slightly Better (27 basins) or Significantly Better (35 basins) in experiment E. On the other hand, 17 $\%$ of the basins in the Reservoir subgroup did not improve (10 Slightly Worse and 24 Significantly Worse). In the Ephemeral subgroup, 25 $\%$ of the basins performed Slightly Better (4 basins) or Significantly Better (8 basins), while 17 $\%$ of the basins performed Slightly Worse (4 basins) or Significantly Worse (4 basins). For the Natural subgroup, 24 $\%$ of the basins performed Slightly Better (9 basins) or Significantly Better (3 basins) and 29 $\%$ performed Slightly Worse (5 basins) or Significantly Worse (9 basins).

In summary, the partitioning approach in experiment E, which categorized basins based on reservoir presence and the number of nonzero flow days led to improved streamflow predictions compared to the traditional regional LSTM model (experiment A). The most significant improvements were observed in the Reservoir subgroup, highlighting the benefits of separating basins with significant human alterations for targeted modeling efforts.

Impacts of Climate and Anthropogenic Predictors

In addition to studying different partitioning experiments to improve modeling, alternative data sets were explored to account for climate and anthropogenic influences that might alter streamflow in the CRB. The LSTMs from experiment E were trained with and without the C/A variables discussed in Section 2.2. The NSE difference (experiment E with C/A NSE - experiment E NSE) was computed for each study basin and displayed in Figure 7.

[IMAGE OMITTED. SEE PDF]

The average of the positive NSE differences were 0.11, 0.12, and 0.07 for the Reservoir, Ephemeral, and Natural subgroups respectively. The average of the negative NSE differences were −0.18, −0.12, and −0.10 for the same respective subgroups. Most of the basins had little change with the C/A data compared to without the C/A data (98 Reservoir, 29 Ephemeral, and 27 Natural basins were classified as No Change). In the Reservoir subgroup, 36/199 basins were Slightly Better or Significantly Better, while the Ephemeral subgroup had 9/48, and the Natural subgroup had 9/49 Slightly Better/Significantly Better basins. The basins that improved with C/A data were located all over the CRB but a majority (36 out of the 54 basins with an NSE difference above 0.05) were located in the lower CRB region (south of $37.3{}^{\circ}$ N).

Figure 8 displays an example of a Reservoir basin that improved with C/A data. The original experiment E model predictions, in gold, overestimated peak flows at the beginning of 2013, 2016, 2017, 2019, and 2020. After adding C/A data to train the LSTM, in purple, the predictions were closer to the observed annual peak flow, which is reflected in the higher NSE of 0.60 compared to 0.44. See Figure S6 in Supporting Information S1 for more examples.

[IMAGE OMITTED. SEE PDF]

Further analysis was conducted to interpret how much impact the C/A variables have on the LSTM outputs and which time step in the input sequence that impact occurs. SHAP values were computed for the basins with high NSE differences in each subgroup. Due to computational limitations, only the top five basins with the highest NSE difference were selected. Figure 9 displays the averaged absolute SHAP impact scores for the time-varying inputs colored by the time step where the maximum SHAP value occurs. A random number vector was included while training the LSTMs to serve as a baseline for the SHAP model interpretations. Variables that had the same impact or lower than the random variable were not considered important to the LSTM prediction.

[IMAGE OMITTED. SEE PDF]

Common variables that were important to the model predictions across all three subgroups were: precipitation, evaporation, solar radiation, temperature, and dewpoint temperature. The model outputs were impacted by ERA5 precipitation from the previous day and solar radiation from the previous day (Reservoir and Natural subgroups) or the previous month (Ephemeral subgroup). The remaining variables were important to the model outputs at the time step 3 months prior. The important C/A variables in the Reservoir subgroup were Agro Percent Tree Cover, Agro Percent Shrub Cover, and HydroAtlas Monthly Precipitation. In the Ephemeral subgroup, the Agro Percent Perennial Forbs and Grasses Cover from 100 days and HydroAtlas Precipitation from 200 days before the current time step had some influence over the model output. In the Natural subgroup, monthly water recurrence from a month before had an impact on the model output similar to the level of impact of ERA5 evaporation. The C/A variables were important in some basins, but their impact was localized to specific catchments.

We observed varying levels of spread in feature importance (see the error bars of Figure 9 and Figures S7–S9 in Supporting Information S1). The reservoir basins exhibited the widest spread in SHAP values across input variables, likely due to the diverse range of influences these basins encounter. Reservoir basins are located all across the CRB, whereas the majority of natural and ephemeral basins are in the upper and lower CRB, respectively. Additionally, the reservoirs have different primary purposes that could drive the release decisions of reservoir operators. This variety confirms that reservoir basins present a challenge for model generalization. Natural basins, while also showing a broad range of SHAP values, had a narrower spread compared to reservoir basins. This may be attributed to their dependency on more stable hydrological processes, particularly in snowmelt-dominated regions where variables such as solar radiation, temperature, SWE, and evaporation have significant impact. Ephemeral basins displayed the lowest spread in SHAP values, indicating a reliance on fewer predictors, primarily ERA5 precipitation. This aligns with the limited water flow and arid conditions typically characterizing ephemeral basins, where precipitation strongly influences model accuracy.

Precipitation was one of the top three important model predictors in the averaged SHAP Results (Figure 9). Figure 10 compares ERA5 Precipitation's impact on the model output with the NSEs across all 15 study basins used for the SHAP analysis. The Ephemeral subgroup results are shown in pink circles, the Natural subgroup in blue, and the Reservoir subgroup in yellow. Even though ERA5 Precipitation was the most important feature in the Ephemeral group and one of the few important features, this variable had low SHAP values compared to the Reservoir and Natural subgroups, which corresponds with low NSE values as well in Figure 10. For the selected basins in each subgroup, as precipitation becomes more important, NSE improves.

[IMAGE OMITTED. SEE PDF]

Comparison to the Traditional Regional LSTM Approach

Figure 11 displays the results of our best model, experiment E with C/A data, against the traditional regional model, experiment A, with only meteorological forcings.

[IMAGE OMITTED. SEE PDF]

In Figure 11a, the overall and reservoir subgroups did not have many differences in the CDFs for experiment A compared to experiment E with C/A. The Ephemeral subgroup had higher accuracy in experiment E with C/A than experiment A with values ranging from 0 to 0.68 compared to experiment E values ranging from 0 to 0.57. Most of the values occurred between 0.0 and 0.2 for both experiments. However, the 75 $\%$ percentile was 0.27 in experiment E with C/A and 0.19 in experiment A. The NSE values in the Natural subgroup were also higher in experiment E with C/A compared to experiment A. The 75 $\%$ percentile was 0.86 in experiment E with C/A compared to 0.83 in experiment A.

Figure 11b visualizes which basins had better NSE values in experiment E with C/A (blue marker) or in experiment A (red marker). In each subgroup, more than half of the basins improved (107/199 in the Reservoir subgroup, 28/48 in the Ephemeral subgroup, 28/49 in the Natural subgroup).

There were a high number of Significantly Better basins with C/A data in the Ephemeral subgroup (8 basins were classified as Significantly Better, 2 basins were Significantly Worse, and 25 basins had No Change). In the Natural subgroup, many basins had slight positive changes with C/A data (15 basins were Slightly Better, 10 basins were Slightly Worse, and 19 had No Change).

Figure 12 compares experiment A (benchmark experiment) with and without C/A data with experiment E (Reservoir/Natural/Ephemeral subgroups) with and without C/A data to determine whether clustering basins or including auxiliary geospatial data improves model performance the most. In experiment A, there are 78 basins (gray markers) that perform best with the benchmark model. There are 86 basins (green markers) that have the best performance under experiment E's grouping method. Adding auxiliary geospatial data improves model results in 66 basins (red markers) for experiment A with C/A data and 53 basins (blue markers) for experiment E with C/A data.

[IMAGE OMITTED. SEE PDF]

Retraining the regional model using our grouping variables as static inputs did not significantly change the regional model's performance (see Figures S3 and S4 in Supporting Information S1).

Discussion

Previous work has shown that modeling streamflow is challenging in the hydrologically diverse CRB. Our partitioning experiments (C-E) aimed to reduce this hydrological diversity by grouping basins with similar characteristics. The study demonstrates that partitioning can enhance the ability of LSTM models to capture the underlying relationships and improve streamflow predictions in some cases. However, the degree of improvement varies depending on the specific characteristics of each basin and the complexity of the hydrological processes involved. We also tested the potential for auxiliary human impacts and climate information to improve model performances.

Our major science questions addressed by this research are: (a). Are watersheds in the CRB too diverse to generalize in a single, regional model? (b). Which basins have poor performance and require a different modeling approach and why? (c). Can satellite-derived land cover and hydroclimate data improve prediction accuracy in diverse watersheds in the CRB? Afterward, we conclude with some caveats and next steps based on our findings.

Are watersheds in the CRB too diverse to generalize in a single, regional model?

Recent work with LSTM streamflow modeling has angled toward a one-model approach, where hydrologically diverse data are recommended to train a generalized model (Feng et al., 2020; Georgy Ayzel & Zhuravlev, 2020; James et al., 2013; Nearing et al., 2021) particularly in regions relatively unimpacted by humans (e.g., CAMELS or Caravan data sets (Newman et al., 2015)). Our work demonstrates that the traditional approach, using all study basins to train a single LSTM, performed well in the CRB in snowmelt-dominated basins, but may not be sufficient in the remaining basins with major dams or ephemeral basins (Cooley et al., 2021; Ouyang et al., 2021). Compared to our other experiments that reduced hydrological diversity, the single regional LSTM had high hydrological diversity and a lower median NSE, suggesting that basins in the CRB were too diverse to generalize in a single model.

2.
Which basins have poor performance and require a different modeling approach and why?

The study identified basins with poor performance and the need for a different modeling approach through several key findings and analyses.

Consistent with previous studies, a single regional LSTM (experiment A) had high accuracy in snow-dominated basins and poor accuracy in ephemeral basins and basins altered by reservoirs (Cooley et al., 2021; Feng et al., 2020). The results from the experiment that grouped basins into Reservoir, Ephemeral, or Natural subgroups (experiment E) convey that ephemeral basins were more difficult to predict than basins that contained a reservoir. This was likely due to the fact that some small reservoirs do not significantly alter downstream flow. In the experiment that grouped basins based on their water balance (experiment D), we noticed that basins that contained a reservoir were distributed throughout the subgroups in the water balance experiment. In particular, basins in the medium water balance subgroup had high predictability, even basins that contained a reservoir, which suggests that these reservoirs had low impacts on the natural streamflow. This supports a previous study which determined that water balance could be a useful indicator of the level of impact of a reservoir (Ouyang et al., 2021). In general, basins that had water entering the system and leaving too quickly or too slowly were more difficult to predict compared to basins that did not experience these extreme conditions.

Understanding the predictability of the basins is important for learning whether grouping basins could improve LSTM predictions in those low performance basins. The results in experiment C (based on model predictability from the single regional LSTM) revealed that training an LSTM on the poorly-performing basins provided no clear benefit. In other words, the single regional LSTM in experiment A was not degraded by the presence of Low NSE basins. On the other hand, we observed that grouping basins into Reservoir, Ephemeral, or Natural subgroups (Experiment E) achieved the highest overall median NSE compared to the other experiments. The standard deviation of streamflow in these subgroups suggest that the improved performance was from training a separate LSTM on ephemeral basins, which had the highest within-subgroup variability. This is consistent with previous findings that ephemeral basins are difficult to predict on, and that LSTMs have higher performance when difficult basins are removed from the training set (Feng et al., 2020; Ouyang et al., 2021).

3.
Can satellite-derived land cover and hydroclimate data improve prediction accuracy in diverse watersheds in the CRB?

Auxiliary geospatial data sets are becoming increasingly available and have the potential to improve streamflow forecasts in a diverse set of basins. Basins in the CRB are impacted by anthropogenic processes such as irrigation and climate differences such as aridity. Previous studies have found that auxiliary land cover data improves streamflow estimates in basins altered by human activities (Chen et al., 2020; Vu et al., 2022) and additional climate attributes improve LSTM predictions for some basins with low flow conditions (Althoff et al., 2021). In our study, the experiments with additional climate and anthropogenic variables yielded marginal performance improvements in some basins and just as many performance decreases in other basins, suggesting that basins in this subgroup do not benefit from the auxiliary information selected for this study. Our SHAP analysis in basins complemented these findings.

Based off of the feature importance analysis, the Reservoir basins relied on the widest range of predictors, followed by the Natural basins, and Ephemeral basins relied on the least amount of predictors. For Natural basins, the climate/anthropogenic variable that slightly mattered was water recurrence. Among the meteorological forcings, SWE and evaporation were also important, likely due to the snowmelt-dominated basins in the upper CRB. These variables for Natural basins are expected based on physical knowledge of hydrological processes and previous studies (Solander et al., 2019). In the Reservoir basins, the range of important variables (evaporation, precipitation, and vegetation cover) may reflect the struggle of the LSTM to generalize across a wide range of reservoir impacts (sizes, operation strategies). Ephemeral basins relied the most on ERA5 precipitation followed by long-term Hydro precipitation signal, which was not surprising for basins in arid regions (Feng et al., 2020). It also implies that auxiliary data may not be that useful in Ephemeral basins and that the quality of the input precipitation is more important in these basins than in other subgroups. ERA5 precipitation was a common driver behind model outputs in all three subgroups, but a comparison of the relative importance for these subgroups showed a positive relationship as precipitation becomes more important. Despite being the most important variable in the Ephemeral subgroup, the relative importance was low compared to Reservoir and Natural basins, which also hints at the ability of an LSTM to detect the relationship between streamflow and precipitation in Ephemeral basins.

In general the climate and anthropogenic data does not significantly increase model performance compared to a classification approach (Figure 12). A possible explanation could be that the climate/anthropogenic variables may have large uncertainties that make them not useful. Alternatively, this could be attributed to the LSTM learning biases from the climate/anthropogenic data, which could prevent basins from improving or could even decrease model performance. For example, land cover data sets have been shown to have reliable forest and water classifications, but uncertain classification in other categories including the pasture category (Valle et al., 2023). In the CRB, this bias could cause an LSTM to wrongly identify signals from a class that is not truly present in a basin. Additionally, different data sets have different land cover definitions, so more consideration could be needed to determine whether the land type definitions are meaningful to hydrological predictions (Congalton et al., 2014).

While our partitioning approach was overall more successful than a single regional model, there were several limitations associated with this technique. The Reservoir/Ephemeral/Natural partitions did not distinguish between a large reservoir with strong downstream alterations and a small reservoir with few alterations to the natural flow. Since experiment D (groupings based on water balance) was successful in determining the level of impact of reservoirs, a recommendation would be to separate the ephemeral basins into the Ephemeral category and to use water balance to separate the remaining basins into Reservoir or Natural categories. This would allow basins containing low-impact reservoirs to be trained alongside basins with no reservoirs that often have high predictability.

Consistent with previous studies, we observed that many ephemeral basins had poor performance. Additional climate/anthropogenic data did little to improve results in these basins. In addition, our SHAP analysis revealed that precipitation was the main predictor used for model outputs in ephemeral basins despite having a low relative model importance signal. A suggestion for future work would be to try to incorporate hourly precipitation to see if that improves model accuracy (Hu et al., 2018) or to consider another model besides the LSTM such as transformers, which have stronger time series forecasting abilities in some settings (Li et al., 2024).

Overall, we determined that partitioning the data into subgroups had a more positive impact on model performance than adding auxiliary data. It is possible that having a globally available resource to simulate large reservoirs would be valuable for future streamflow modeling in the CRB. Additionally, improving the quality of precipitation data would improve model predictions in ephemeral basins, which solely relied on that variable to make predictions. Finally, this approach is applicable to the diverse basins in the CRB, however, additional studies would need to be conducted to determine how well this approach generalizes to other regions. We found that reservoir presence and the number of nonzero flow days in a basin mattered to modeling streamflow in the CRB, however, it is possible that other metrics matter to basins in another study region (i.e., land use type).

Conclusion

This study investigated the feasibility of using a single regional LSTM model to predict streamflow in the hydrologically diverse CRB. Our results demonstrate that, while a regional LSTM can achieve reasonable performance in some basins, particularly snowmelt-dominated ones, it struggles to generalize across the entire region due to the wide range of hydrological characteristics and human impacts.

The partitioning approach applied in this work identified ephemeral basins and basins altered by reservoirs as particularly challenging to predict using a single model. These basins often exhibit distinct hydrological behaviors that deviate from the general patterns captured by the regional model. Our findings highlight the importance of considering basin-specific characteristics when developing streamflow models, especially in regions with diverse hydrological conditions.

While auxiliary geospatial data sets, such as satellite-derived land cover and hydroclimate data, have the potential to improve model performance in some cases, their effectiveness varies across different basin types. In particular, ephemeral basins may not benefit significantly from these additional variables, suggesting that other factors, such as the quality of precipitation data, may be more critical for accurate predictions in these regions.

This study provides valuable insights into the challenges and opportunities associated with using LSTM models for streamflow prediction in the CRB. Our findings suggest that a more tailored approach, potentially involving multiple models or hybrid methods, may be necessary to achieve accurate and reliable predictions in this hydrologically diverse region. Future research should consider addressing basin-specific characteristics and human impacts within a regional model. Additionally, further investigation into the role of auxiliary data and the potential benefits of combining multiple data sources is warranted.

Appendix

Appendix A - Hyperparameter Grid Search

LSTMs have hyperparameters that can be tuned to achieve optimal model performance. Tuning the hyperparameters helps adjust the model complexity to better fit to the training data while also avoiding overfitting. For this study, two hyperparameters were tuned for each model in all of the experiments with the following range of values.

Hidden Size [32, 121, 256]
Sequence Length [10, 90, 365].

A hyperparameter grid search is performed by training an LSTM with all possible combinations of the parameters listed and using the validation Nash-Sutcliffe Efficiencies (see Section 2.6) to determine the optimal parameters. Having a hidden size that is too large will cause the model to overfit to the training data and perform worse on the test data, while having a hidden size that is too small will prevent the model from having enough complexity to understand the relationship between the input sequences and the target value. Other important hyperparameters to tune would be the learning rate, the dropout rate, the optimizer, the number of epochs, and the batch size. However, performing this hyperparameter grid search for each subgroup within each experiment involves training 27 models for each experiment with 3 subgroups, so this study only explores the two parameters.

Appendix

Appendix B - Model Architectures

A set of hyperparameters were tested on each subgroup and the validation set was used to determine the optimal parameters. These parameters are as follows.

Experiment A
- (a)
  Sequence Length: 90
- (b)
  Hidden Size: 32
Experiment B: Random Group 1
- (a)
  Sequence Length: 365
- (b)
  Hidden Size: 256
Experiment B: Random Group 2
- (a)
  Sequence Length: 90
- (b)
  Hidden Size: 256
Experiment B: Random Group 3
- (a)
  Sequence Length: 10
- (b)
  Hidden Size: 32
Experiment C: Low NSE
- (a)
  Sequence Length: 10
- (b)
  Hidden Size: 121
Experiment C: Medium NSE
- (a)
  Sequence Length: 90
- (b)
  Hidden Size: 121
Experiment C: High NSE
- (a)
  Sequence Length: 365
- (b)
  Hidden Size: 121
Experiment D: Low Water Balance
- (a)
  Sequence Length: 10
- (b)
  Hidden Size: 121
Experiment D: Medium Water Balance
- (a)
  Sequence Length: 365
- (b)
  Hidden Size: 32
Experiment D: High Water Balance
- (a)
  Sequence Length: 10
- (b)
  Hidden Size: 121
Experiment E: Reservoir
- (a)
  Sequence Length: 90
- (b)
  Hidden Size: 256
Experiment E: Ephemeral
- (a)
  Sequence Length: 90
- (b)
  Hidden Size: 256
Experiment E: Natural
- (a)
  Sequence Length: 90
- (b)
  Hidden Size: 32

These LSTMs all consisted of a single LSTM layer. All of the final subgroup LSTMs were trained for 50 epochs with the learning rate changing from 0.001, to 0.0005 to 0.0001 for epochs 0, 20, and 25 respectively. Additionally, the regression head output layer had a dropout rate of 0.40.

Acronyms

Agro
Rangeland Production annual percent of vegetation cover type

C/A
Climate/Anthropogenic

CDF
Empirical Cumulative Distribution Function

CRB
Colorado River Basin

ECMWF
Center for Medium-Range Weather Forecasts

Hydro
HydroATLAS monthly data

IQR
Interquartile Range

LSTM
Long Short-Term Memory

LUL
Level 1 Land Cover Classes annual percent of land cover type

NSE
Nash-Sutcliffe Efficiency

SWE
Snow Water Equivalent

WB
Water Balance

Acknowledgments

This research was supported by funding from the Center for Earth and Space Sciences at Los Alamos National Laboratory through the Laboratory Directed Research and Development program under Grant 20240477CR-SES and 20210213ER. We thank Claire Bachand and Lauren Thomas for comments on an early draft, and two anonymous reviewers whose comments were helpful in revising and adjusting our manuscript's messages.

Data Availability Statement

This study used the following data.

USGS Water Data (USGS, 2016)
National Inventory of Dams (US Army Corps of Engineers, 2018)
ERA5 Land (Muñoz-Sabater et al., 2021)
Level 1 Land Cover Classes (Brown et al., 2020)
Monthly Water Recurrence (Pekel et al., 2016)
Rangeland Production (Matthew O. Jones, 2020)
Global Soil-Water Balance (Trabucco & Zomer, 2019b)
Global-PET (Trabucco & Zomer, 2019a)
WorldClim (Hijmans et al., 2005)
EarthStat (Ramankutty & Foley, 1999)
MERIT-Hydro (Yamazaki et al., 2019)
EarthEnv-DEM90 (Robinson et al., 2014)
VotE (Schwenk et al., 2022)
GLC2000 (Bartholomé & Belward, 2005)
SoilGrids1km (Tomislav Hengl & Gonzalez, 2014)

The model input data and Python code used to generate the results and figures of this study are located in the following GitHub repository: (smaebius, 2024) and use the Neuralhydrology software package (Kratzert et al., 2022).

References

Althoff, D., Rodrigues, L. N., & da Silva, D. D. (2021). Addressing hydrological modeling in watersheds under land cover change with deep learning. Advances in Water Resources, 154, 103965. https://doi.org/10.1016/j.advwatres.2021.103965

Bartholomé, E., & Belward, A. S. (2005). Glc2000: A new approach to global land cover mapping from earth observation data [Dataset]. International Journal of Remote Sensing, 26(9), 1959–1977. https://doi.org/10.1080/01431160412331291297

Brown, J. F., Tollerud, H. J., Barber, C. P., Zhou, Q., Dwyer, J. L., Vogelmann, J. E., et al. (2020). Lessons learned implementing an operational continuous United States national land change monitoring capability: The Land Change Monitoring, Assessment, and Projection (LCMAP) approach [Dataset]. Remote Sensing of Environment, 238, 111356. https://doi.org/10.1016/j.rse.2019.111356

Chen, Y., Niu, J., Sun, Y., Liu, Q., Li, S., Li, P., et al. (2020). Study on streamflow response to land use change over the upper reaches of Zhanghe Reservoir in the Yangtze River basin. Geoscience Letters, 7(1), 6. https://doi.org/10.1186/s40562‐020‐00155‐7

Congalton, R. G., Gu, J., Yadav, K., Thenkabail, P., & Ozdogan, M. (2014). Global land cover mapping: A review and uncertainty analysis. Remote Sensing, 6(12), 12070–12093. https://doi.org/10.3390/rs61212070

Cooley, S. W., Ryan, J. C., & Smith, L. C. (2021). Human alteration of global surface water storage variability. Nature, 591(7848), 78–81. https://doi.org/10.1038/s41586‐021‐03262‐3

Fan, M., Zhang, L., Liu, S., Yang, T., & Lu, D. (2023). Investigation of hydrometeorological influences on reservoir releases using explainable machine learning methods. Frontiers in Water, 5. https://doi.org/10.3389/frwa.2023.1112970

Feng, D., Fang, K., & Shen, C. (2020). Enhancing streamflow forecast and extracting insights using long‐short term memory networks with data integration at continental scales. Water Resources Research, 56(9), e2019WR026793. https://doi.org/10.1029/2019WR026793

García‐Feal, O., González‐Cao, J., Fernández‐Nóvoa, D., Astray Dopazo, G., & Gómez‐Gesteira, M. (2022). Comparison of machine learning techniques for reservoir outflow forecasting. Natural Hazards and Earth System Sciences, 22(12), 3859–3874. https://doi.org/10.5194/nhess‐22‐3859‐2022

Georgy Ayzel, E. K., Kurochkina, L., & Zhuravlev, S. (2020). Streamflow prediction in ungauged basins: Benchmarking the efficiency of deep learning. E3S Web of Conferences, 163, 01001. https://doi.org/10.1051/e3sconf/202016301001

Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with lstmtm. Neural Computation, 12(10), 2451–2471. https://doi.org/10.1162/089976600300015015

Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google earth engine: Planetary‐scale geospatial analysis for everyone. Remote Sensing of Environment, 202, 18–27. https://doi.org/10.1016/j.rse.2017.06.031

Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G., & Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas [Dataset]. International Journal of Climatology, 25(15), 1965–1978. https://doi.org/10.1002/joc.1276

Hochreiter, S., & Schmidhuber, J. (1997). Long short‐term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Hu, C., Wu, Q., Li, H., Jian, S., Li, N., & Lou, Z. (2018). Deep learning with a long short‐term memory networks approach for rainfall‐runoff simulation. Water, 10(11), 1543. https://doi.org/10.3390/w10111543

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in r. Springer. Retrieved from https://faculty.marshall.usc.edu/gareth‐james/ISL/

Jiang, S., Zheng, Y., Wang, C., & Babovic, V. (2022). Uncovering flooding mechanisms across the contiguous United States through interpretive deep learning on representative catchments. Water Resources Research, 58(1), e2021WR030185. https://doi.org/10.1029/2021WR030185

Kratzert, F., Gauch, M., Klotz, D., & Nearing, G. (2024). Hess opinions: Never train an LSTM on a single basin. Hydrology and Earth System Sciences Discussions, 2024, 1–19. https://doi.org/10.5194/hess‐2023‐275

Kratzert, F., Gauch, M., Nearing, G., & Klotz, D. (2022). Neuralhydrology — A python library for deep learning research in hydrology [Software]. Journal of Open Source Software, 7(71), 4050. https://doi.org/10.21105/joss.04050

Kratzert, F., Herrnegger, M., Klotz, D., Hochreiter, S., & Klambauer, G. (2019). Neuralhydrology – Interpreting LSTMS in hydrology [Software]. In W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, & K.‐R. Müller (Eds.), Explainable ai: Interpreting, explaining and visualizing deep learning (pp. 347–362). Springer International Publishing. https://doi.org/10.1007/978‐3‐030‐28954‐6_19

Kratzert, F., Klotz, D., Brenner, C., Schulz, K., & Herrnegger, M. (2018a). Rainfall–runoff modelling using Long Short‐Term Memory (LSTM) networks. Hydrology and Earth System Sciences, 22(11), 6005–6022. https://doi.org/10.5194/hess‐22‐6005‐2018

Kratzert, F., Klotz, D., Brenner, C., Schulz, K., & Herrnegger, M. (2018b). Rainfall–runoff modelling using Long Short‐Term Memory (LSTM) networks. Hydrology and Earth System Sciences, 22(11), 6005–6022. https://doi.org/10.5194/hess‐22‐6005‐2018

Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., & Nearing, G. (2019). Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large‐sample datasets. Hydrology and Earth System Sciences, 23(12), 5089–5110. https://doi.org/10.5194/hess‐23‐5089‐2019

Kratzert, F., Nearing, G., Addor, N., Erickson, T., Gauch, M., Gilon, O., et al. (2023). Caravan‐a global community dataset for large‐sample hydrology. Scientific Data, 10(1), 61. https://doi.org/10.1038/s41597‐023‐01975‐w

Li, W., Liu, C., Xu, Y., Niu, C., Li, R., Li, M., et al. (2024). An interpretable hybrid deep learning model for flood forecasting based on transformer and lstm. Journal of Hydrology: Regional Studies, 54, 101873. https://doi.org/10.1016/j.ejrh.2024.101873

Livneh, B., Bohn, T. J., Pierce, D. W., Munoz‐Arriola, F., Nijssen, B., Vose, R., et al. (2015). A spatially comprehensive, hydrometeorological data set for Mexico, the US, and southern Canada 1950–2013. Scientific Data, 2(1), 1–12. https://doi.org/10.1038/sdata.2015.42

Lukas, J. J., & Payton, E. A. (2020). Colorado River basin climate and hydrology: State of the science. In Western water assessment. University of Colorado.

Lundberg, S., & Lee, S.‐I. (2017). A unified approach to interpreting model predictions. arXiv. Retrieved from http://arxiv.org/abs/1705.07874

Matthew, O., Jones, D. E. N. J. D. M. M. C. R. R. W. L. B. W. A., & Nathaniel, P. R. (2020). Annual and 16‐day rangeland production estimates for the western United States [Dataset]. bioRxiv. https://doi.org/10.1101/2020.11.06.343038

Muñoz‐Sabater, J., Dutra, E., Agustí‐Panareda, A., Albergel, C., Arduini, G., Balstmflashyamo, G., et al. (2021). ERA5‐Land: A state‐of‐the‐art global reanalysis dataset for land applications [Dataset]. Earth System Science Data, 13(9), 4349–4383. https://doi.org/10.5194/essd‐13‐4349‐2021

Nearing, G. S., Kratzert, F., Sampson, A. K., Pelissier, C. S., Klotz, D., Frame, J. M., et al. (2021). What role does hydrological science play in the age of machine learning? Water Resources Research, 57(3), e2020WR028091. https://doi.org/10.1029/2020WR028091

Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L. E., Bock, A., et al. (2015). Development of a large‐sample watershed‐scale hydrometeorological data set for the contiguous USA: Data set characteristics and assessment of regional variability in hydrologic model performance [Dataset]. Hydrology and Earth System Sciences, 19(1), 209–223. https://doi.org/10.5194/hess‐19‐209‐2015

Ouyang, W., Lawson, K., Feng, D., Ye, L., Zhang, C., & Shen, C. (2021). Continental‐scale streamflow modeling of basins with reservoirs: Towards a coherent deep‐learning‐based strategy. Journal of Hydrology, 599, 126455. https://doi.org/10.1016/j.jhydrol.2021.126455

Özdogan Sarıkoç, G., Sarıkoç, M., Celik, M., & Dadaser‐Celik, F. (2023). Reservoir volume forecasting using artificial intelligence‐based models: Artificial neural networks, support vector regression, and long short‐term memory. Journal of Hydrology, 616, 128766. https://doi.org/10.1016/j.jhydrol.2022.128766

Pekel, J.‐F., Cottam, A., Gorelick, N., & Belward, A. S. (2016). High‐resolution mapping of global surface water and its long‐term changes [Dataset]. Nature, 540(7633), 418–422. https://doi.org/10.1038/nature20584

Ramankutty, N., & Foley, J. A. (1999). Estimating historical changes in global land cover: Croplands from 1700 to 1992 [Dataset]. Global Biogeochemical Cycles, 13(4), 997–1027. https://doi.org/10.1029/1999GB900046

Richter, B. D., Lamsal, G., Marston, L., Dhakal, S., Sangha, L. S., Rushforth, R. R., et al. (2024). New water accounting revealstm why the Colorado River no longer reaches the sea. Communications Earth & Environment, 5(1), 134. https://doi.org/10.1038/s43247‐024‐01291‐0

Robinson, N., Regetz, J., & Guralnick, R. P. (2014). Earthenv‐dem90: A nearly‐global, void‐free, multi‐scale smoothed, 90m digital elevation model from fused aster and SRTM data [Dataset]. ISPRS Journal of Photogrammetry and Remote Sensing, 87, 57–67. https://doi.org/10.1016/j.isprsjprs.2013.11.002

Sabzipour, B., Arsenault, R., Troin, M., Martel, J.‐L., Brissette, F., Brunet, F., & Mai, J. (2023). Comparing a Long Short‐Term Memory (LSTM) neural network with a physically‐based hydrological model for streamflow forecasting over a Canadian catchment. Journal of Hydrology, 627, 130380. https://doi.org/10.1016/j.jhydrol.2023.130380

Salwey, S., Coxon, G., Pianosi, F., Singer, M. B., & Hutton, C. (2023). National‐scale detection of reservoir impacts through hydrological signatures. Water Resources Research, 59(5), e2022WR033893. https://doi.org/10.1029/2022WR033893

Schwenk, J., Stachelek, J., Katrina Bennett, e. a., Prior, E., Zussman, T., & Rowland, J. (2021). Veins of the earth: A flexible framework for mapping, modeling, and monitoring the earth’s river networks [Software]. ESS Open Archive. https://doi.org/10.1002/essoar.10509913.1

Schwenk, J., Zussman, T., Stachelek, J., & Rowland, J. C. (2022). RABPRO: Global watershed boundaries, river elevation profiles, and catchment statistics. Journal of Open Source Software, 7(73), 4237. https://doi.org/10.21105/joss.04237

smaebius. (2024). smaebius/crb‐human‐impacts: v1.0.0 release [Software]. Zenodo. https://doi.org/10.5281/zenodo.13729982

Solander, K. C., Bennett, K. E., Fleming, S. W., & Middleton, R. S. (2019). Estimating hydrologic vulnerabilities to climate change using simulated historical data: A proof‐of‐concept for a rapid assessment algorithm in the Colorado River basin. Journal of Hydrology: Regional Studies, 26, 100642. https://doi.org/10.1016/j.ejrh.2019.100642

Talsma, C. J., Bennett, K. E., & Vesselinov, V. V. (2022). Characterizing drought behavior in the Colorado River basin using unsupervised machine learning. Earth and Space Science, 9(5), e2021EA002086. https://doi.org/10.1029/2021ea002086

Tomislav Hengl, R. A. M. N. H. B. G. B. M. H. E. R. A. S.‐R. B. K. J. G. B. L. M. G. W., Mendes de Jesus, J., Gonzalez, M. R., Batjes, N. H., Heuvelink, G. B. M., Ribeiro, E., et al. (2014). Soilgrids1km–global soil information based on automated mapping [Dataset]. PLoS One, 9(8), e105992. https://doi.org/10.1371/journal.pone.0105992

Trabucco, A., & Zomer, R. (2019a). Global aridity index and potential evapotranspiration (ET0) database: Version 3 [Dataset]. https://doi.org/10.6084/m9.figshare.7504448.v5

Trabucco, A., & Zomer, R. J. (2019b). Global high‐resolution soil‐water balance [Dataset]. https://doi.org/10.6084/m9.figshare.7707605.v3

US Army Corps of Engineers. (2018). National inventory of dams [Dataset]. https://nid.sec.usace.army.mil/#/

U.S. Bureau of Reclamation and Ten Tribes Partnership. (2018). Colorado River basin ten tribes partnership tribal water study. Retrieved from https://www.usbr.gov/lc/region/programs/crbstudy/tws/finalreport.html

USGS. (2016). National water information system data available on the world wide web (USGS water data for the nation) [Dataset]. U.S. Geological Survey. https://doi.org/10.5066/F7P55KJN

Valle, D., Izbicki, R., & Leite, R. V. (2023). Quantifying uncertainty in land‐use land‐cover classification using conformal statistics. Remote Sensing of Environment, 295, 113682. https://doi.org/10.1016/j.rse.2023.113682

Vu, D. T., Dang, T. D., Galelli, S., & Hossain, F. (2022). Satellite observations reveal 13 years of reservoir filling strategies, operating rules, and hydrological alterations in the Upper Mekong River basin. Hydrology and Earth System Sciences, 26(9), 2345–2364. https://doi.org/10.5194/hess‐26‐2345‐2022

Yamazaki, D., Ikeshima, D., Sosa, J., Bates, P. D., Allen, G. H., & Pavelstmky, T. M. (2019). Merit hydro: A high‐resolution global hydrography map based on latest topography dataset [Dataset]. Water Resources Research, 55(6), 5053–5073. https://doi.org/10.1029/2019wr024873

Zhang, D., Peng, Q., Lin, J., Wang, D., Liu, X., & Zhuang, J. (2019). Simulating reservoir operation using a recurrent neural network algorithm. Water, 11(4), 865. https://doi.org/10.3390/w11040865

Word count: 11588

Show less

© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Streamflow in the Colorado River Basin (CRB) is significantly altered by human activities including land use/cover alterations, reservoir operation, irrigation, and water exports. Climate is also highly varied across the CRB which contains snowpack‐dominated watersheds and arid, precipitation‐dominated basins. Recently, machine learning methods have improved the generalizability and accuracy of streamflow models. Previous successes with LSTM modeling have primarily focused on unimpacted basins, and few studies have included human impacted systems in either regional or single‐basin modeling. We demonstrate that the diverse hydrological behavior of river basins in the CRB are too difficult to model with a single, regional model. We propose a method to delineate catchments into categories based on the level of predictability, hydrological characteristics, and the level of human influence. Lastly, we model streamflow in each category with climate and anthropogenic proxy data sets and use feature importance methods to assess whether model performance improves with additional relevant data. Overall, land use cover data at a low temporal resolution was not sufficient to capture the irregular patterns of reservoir releases, demonstrating the importance of having high‐resolution reservoir release data sets at a global scale. On the other hand, the classification approach reduced the complexity of the data and has the potential to improve streamflow forecasts in human‐altered regions.

Details

Title

Machine Learning Classification Strategy to Improve Streamflow Estimates in Diverse River Basins in the Colorado River Basin

Author

Maebius, Sarah¹

; Bennett, K. E.¹

; Schwenk, J.¹

¹ Earth and Environmental Sciences Division, Los Alamos National Laboratory, Los Alamos, NM, USA

Section

Research Article

Publication year

2024

Publication date

Dec 1, 2024

Publisher

John Wiley & Sons, Inc.

e-ISSN

2333-5084

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1029/2024EA003798

ProQuest document ID

3148790507

Machine Learning Classification Strategy to Improve Streamflow Estimates in Diverse River Basins in the Colorado River Basin

Jump to:

Full text

Abstract

Details

Suggested sources