Introduction
Observational data and predictive models (or experimentation and theory) are the two pillars of science, although both are significantly impacted by computation [Vardi, ]. Physically based, mathematical models summarize our current best knowledge of how physical systems operate, and they are used to build computational models that predict future states from initial conditions. Data sets and computational models therefore represent two fundamental resource types that geoscientists use to work on problems of societal interest, such as watershed management and the effects of climate change. While modern geoscientists have online access to an abundance of different data sets and models, these resources differ from each other in myriad ways and this heterogeneity works against interoperability. Interoperability is typically necessary, however, because a single data set or model is seldom sufficient to tackle a nontrivial geoscience problem. For example, models typically require several different types of input data which they may obtain from reading input files or from other models. It is therefore often necessary to couple together many different heterogeneous resources into a computational workflow, and geoscientists spend a large fraction of their time just setting up and executing these workflows. An extensive, empirical analysis of scientific workflows [Garijo et al., ] (and references therein) suggests that geoscientists typically spend between 60% and 80% of their time on dealing with such issues, leaving the remainder as the time available for the science. The phrase data friction [Edwards, ; Edwards et al., ] has also been used to describe this problem.
This challenge of interoperability leads to significant complexity and is one of the main barriers to reproducibility, that is, the ability to reproduce a scientific study or experiment. Reproducibility is considered to be one of the cornerstones of science, but complex workflows often cannot be replicated unless they are fully described and documented. In recent years this topic has been receiving increased attention. Hutton et al. [] cite examples from a broad array of scientific disciplines and then focus on this issue in the context of computational hydrology. Since most modern scientific workflows rely on numerous digital resources, such as data sets, data preparation and analysis software, and computational models, it is now recognized that simply describing a workflow is not enough—all of the digital resources used in a study should also be made available with persistent identifiers (i.e., URLs and DOIs).
This paper begins with a description of a spatial, hydrologic model called TopoFlow developed by the first author and colleagues over the last 16 years. This is followed by an example application to a small, Arctic watershed in the Caribou‐Poker Creek Research Watershed (CPCRW). This application starts with a description of the study site and then walks through all of the steps involved in obtaining the required input data sets, preparing them for use by TopoFlow, setting up a model run, executing the model, and then analyzing the results. By focusing on late summer rainfall events, the dominant hydrologic processes are rainfall and infiltration—in fact, a large percentage of the rainfall volume is lost to infiltration and does not contribute to the volume flow rate (discharge) at the basin outlet. A well‐known, simplified, but physically based model for the infiltration process, known as the Green‐Ampt method, is used within the model and also in our interpretation of the results.
Overview of the TopoFlow Model Toolkit
TopoFlow is a spatially distributed, process‐based and open‐source hydrologic model [Peckham, ]. Development of TopoFlow started in 2000 through a collaboration between Peckham (University of Colorado, Boulder) and several colleagues at WERC, the Water and Environmental Research Center (University of Alaska, Fairbanks). Peckham had just developed a spatially distributed rainfall‐runoff model that used the D8 method for determining flow directions over topography given as a digital elevation model (DEM). That model distinguished between DEM grid cells on hillslopes and those containing channels, using overland flow for the former and open‐channel flow hydraulics for the latter [Henderson, ]. Each channel grid cell contained a prismatic channel with its own trapezoidal cross section and roughness parameters. It supported three different methods of channel flow routing, namely kinematic wave, diffusive wave, and dynamic wave. It also provided the option of using either Manning's formula or the logarithmic law of the wall to model flow resistance. Hinzman's group at WERC had just published a paper on their ARHYTHM model [Zhang et al., ], a spatially distributed hydrologic model for use with Arctic watersheds with an advanced treatment of thermal processes. ARHYTHM used a computational grid of triangles and supported multiple treatments of snowmelt and evaporation, based on many years of prior work in the Arctic, e.g., Hinzman et al. [] and Hinzman et al. []. ARHYTHM supported energy balance treatments of both snowmelt and evaporation in cases where shortwave and longwave radiation measurements were available. It also supported simplified treatments of these two processes that required less input data, namely the Degree‐Day method for snowmelt and the Priestley‐Taylor method for evaporation.
Single‐Application, Interactive Data Language (IDL) Version of TopoFlow With GUI
This collaboration led to the merging of the two models into a single model called TopoFlow‐IDL that supported multiple methods of modeling each hydrologic process and that also had a user‐friendly graphical user interface (GUI). While ARHYTHM was written in Fortran and used triangular grid cells, TopoFlow was written in IDL (Interactive Data Language) and used rectangular grid cells with D8‐based flow directions. The process treatments of snowmelt, evaporation, and shallow subsurface flow from ARHYTHM were converted to IDL, and array‐based best programming practices (e.g., avoidance of spatial loops) were used to ensure good runtime performance. A wizard‐style GUI for TopoFlow‐IDL was developed, and from 2000 to 2007 support for many additional hydrologic processes were added. These additions can be summarized as follows:
- Meteorology. The first major addition was a meteorology module based on celestial mechanics with shortwave and longwave radiation calculators, using the approach outlined in Dingman [] (Appendix E in the supporting information). This allowed the energy balance snowmelt and evaporation process modules to be used even when shortwave and longwave radiation measurements were unavailable. This module required measurements of precipitation rate, air temperature, relative humidity, and wind velocity as its primary inputs, to be read from files, but computed many output variables from these.
- Infiltration. Three different methods for modeling infiltration were added, based on the work of Smith et al. [], namely the well‐known Green‐Ampt method, the Smith‐Parlange three‐parameter method, and the Richards equation (1‐D) method.
- Diversions. Support for flow diversions were added (e.g., canals and tunnels) as well as support for sources and sinks that add or remove some or all of the flow that passes through a given grid cell. Flow diversions are relatively common and must sometimes be modeled to properly account for mass balance. (e.g., the 37.3 km long Harold D. Roberts Tunnel under the Continental Divide from Dillon Reservoir to the South Platte River near Denver, Colorado). This capability has also been used to model flow in river delta distributaries [Hannon et al., ].
- D8 Toolkit. The D8 method [Jenson, ] allows flow direction (aspect) to be computed from a DEM. Given a grid of D8 flow direction codes, several additional geometric attributes can be computed including topographic slope, total contributing areas (TCA), and channel lengths. TCA can then be used to compute grid‐based estimates of channel attributes including widths and roughness parameters as explained in Peckham []. This component contains all of the code necessary to compute D8‐based attributes. (See detailed instructions in the supporting information appendices.)
All of the TopoFlow components are described in detail in Peckham [] and also in HTML‐based help files online that are listed in Appendix A in the supporting information. In addition to the hydrologic process components, TopoFlow‐IDL includes a number of tools for preparing the input data required by the process components.
TopoFlow‐IDL has been used in a number of studies, such as Schramm [], Bolton [], Coe et al. [], Liljedahl [], and Pohl et al. [], several of which involved Arctic hydrology.
The New, Component‐Based, Python Version of TopoFlow
The NSF‐funded Community Surface Dynamics Modeling System (CSDMS) project began in 2007, with Peckham serving as its Chief Software Architect until 2013. The CSDMS modeling framework was designed to support flexible, component‐based, “plug‐and‐play” modeling and to promote reuse and coupling of open‐source models written by different authors [Peckham et al., ]. For component‐based modeling, one has to decide on an appropriate level of granularity, that is, the level of functionality that components should encapsulate. In any given geoscience modeling context, there are typically multiple physical processes that contribute to conservation of mass, momentum, and energy, and there are typically multiple methods (from very simple to sophisticated) for modeling each process. Different methods may differ in various ways, such as (1) the set of equations and variables used to represent the process, (2) the numerical scheme for solving the equations, (3) the programming language, (4) the computational grid, or (5) the use of processors (e.g., parallel versus serial). Several examples of hydrologic processes were described in the previous section. It is therefore natural to encapsulate individual physical process treatments in interchangeable components, and this sets an appropriate and useful granularity for plug‐and‐play modeling.
While the original IDL version of TopoFlow already supported multiple methods for modeling each of several hydrologic processes—with users selecting methods from process droplists—all of these modules were dependent on the application to connect the selected modules. A user could only choose from modules that were already included in the application and could not easily replace some of them with modules written by others, nor use a TopoFlow process module outside of the TopoFlow‐IDL app. This type of “all in one” model is typical, where the application basically serves as a self‐contained framework that allows all of the process modules to interoperate.
As an efficient means of exploring different architectural designs to support plug‐and‐play modeling, most of the source code for TopoFlow‐IDL was converted from IDL to Python/NumPy. TopoFlow was also decomposed into a set of independent process‐level model components as shown in Figure . Each component then had its own time loop, its own configuration or CFG file (a text file read by the component at startup to set parameters, etc.), its own HTML help page, and so on. Any variables that needed to be passed between model components were made available through a standardized component interface (i.e., set of functions), and many different interface prototypes were tested against numerous design criteria. TopoFlow therefore became a vehicle for designing the CSDMS modeling framework [Peckham et al., ], including the Basic Model Interface [CSDMS‐BMI, ] and the CSDMS Standard Names [Peckham, ].
A diagram of all TopoFlow components, where arrows between components indicate dynamic coupling and the passing of one or more variables between components. Each component provides a method of modeling a particular hydrologic process, and the dashed boxes contain alternate methods for modeling the same process, ranging from simple to complex. Users choose one process component from each dashed box.
The Basic Model Interface (BMI) enables simplified, plug‐and‐play reusability and coupling of model components, often written by different people, even when those models differ in terms of programming language, computational grid, time‐stepping scheme, and the names, units, and data types of their input and output variables. The standardized set of functions in the BMI provide a modeling framework with (1) fine‐grained control of model execution, (2) descriptive information needed for coupling, and (3) variable getter and setter functions to support exchanging the values of variables. Based on information retrieved from BMI functions, the modeling framework is able to automatically call its built‐in mediators (e.g., regridders, unit converters, and time interpolators) to deal with differences between the coupled models. A unique feature of BMI is that it also addresses the problem of semantic mediation that arises from each model using its own set of names for input and output variables—its own internal vocabulary. A BMI implementation includes a simple mapping from a model's internal variable names to a set of standard variable names called the CSDMS Standard Names (CSNs). This mapping allows the framework to perform automatic semantic mediation. The CSNs are a systematic, unambiguous, cross‐domain set of variable naming conventions which have evolved into a formal ontology called the Geoscience Standard Names [GSN, ].
Prior to 2012, CSDMS had developed a web‐based graphical tool called the Component Modeling Tool (CMT) that CSDMS members could download as a lightweight Java application. CMT allowed users to select components from a palette and drag them into an arena to become part of a composite model. Once in the arena, CMT provided a tabbed‐dialog GUI for each model component to collect the user settings necessary to automatically create a configuration file from a model‐specific template. The GUI also provided standardized, HTML help pages for each component. Users could create new models from components, configure them, then run them on a high‐performance cluster and also monitor their progress through the CMT. CMT has since been superseded by the CSDMS Web Modeling Tool (WMT), which is completely browser‐based as well as more robust and efficient. TopoFlow was available in the CMT and is now available in the WMT, [CSDMS‐WMT, ] by choosing wmt‐hydrology. The WMT allows models to be composed and configured without logging into the high‐performance cluster until they are ready for execution. Note that both CMT and WMT are simply graphical front ends to the CSDMS modeling framework, which itself is a software stack running on a cluster. They collect and store information in a modeling framework configuration file that is passed to the underlying framework. However, these configuration files may also be created with a text editor.
As a result of deconstructing TopoFlow into separate process‐level model components, the Python version of TopoFlow required the presence of a model coupling framework like CSDMS in order to be used. However, some TopoFlow users wanted to be able to run TopoFlow on their own computers, without relying on a cluster with the full CSDMS software stack. This led to the development of a lightweight, experimental modeling framework written entirely in Python called EMELI (Experimental Modeling Environment for Linking and Interoperability) [Peckham, ]. Users simply list the names of the model components they want to use in a provider file, and then EMELI 1.0 automatically connects each component to other components in the set that are able to provide the input variables they require, based on semantic matching with CSDMS Standard Names. EMELI requires each model component to have an implementation of BMI with Python bindings. EMELI also includes two mediators, a time interpolator and a unit converter, that are automatically invoked when model components use different time steps or units.
All of the TopoFlow model components, utilities, documentation, and a few example data sets are now available in a single, stand‐alone Python package that is bundled with EMELI [Peckham, , ]. Source code for EMELI is in the framework folder of this package. Source code for TopoFlow model components is in the components folder, and each has a BMI interface and uses CSDMS Standard Names. This includes a component called d8_global.py with a complete D8 toolkit for computing D8 flow direction grids and associated grids (e.g., total contributing area grids). Also included is a D8‐based, fluvial landscape evolution model called Erode (erode_d8_global.py). Source code for a variety of shared, low‐level TopoFlow utilities is in the utils folder, and these are used by all components. Some complete sets of input files for testing are included in the examples folder of the Python package. Altogether, the TopoFlow 3.5 package consists of over 71,000 lines of Python code, including comments. The TopoFlow 3.5 package has been tested on both the MacOS and Windows platforms, on top of an Anaconda Python distribution. It should also run on Linux platforms.
As part of an NSF EarthCube building block project called GeoSemantics, an alternate version of EMELI, EMELI‐Web [], has been developed that can couple TopoFlow model components running on different servers as web services [Jiang et al., ]. Values of variables that must be passed between components running on different servers are bundled in NetCDF files for transmission. It also provides a browser‐based GUI, similar to WMT, for configuring each model component prior to a model run.
TopoFlow continues to be used in different contexts and in support of different cyber‐infrastructure projects. For example, TopoFlow is used by CSDMS staff for teaching. Individual TopoFlow components can also be run within an iPython notebook, which makes it possible to follow every detail of model execution. TopoFlow is also being used in the NSF EarthCube building block project OntoSoft to help drive the development of standardized metadata and ontologies for describing geoscience models [OntoSoft‐CSDMS, ].
The Typical TopoFlow Workflow
With physically based, spatially distributed hydrologic models like TopoFlow, most of the work in using them has to do with acquiring and preparing the input data for the study site, configuring the model, and then running the model a large number of times with different parameter settings. (Without using a standard component interface like BMI, coupling process models can also be very time consuming.) Digital elevation data, which play a very important role in this type of modeling, are generally easy to find and download for any region of interest, at a variety of resolutions, thanks to the efforts of the U.S. Geological Survey and other agencies and projects. Many of the other types of input data, including meteorological, soil, and snowpack data, may not exist for a basin of interest, may have only been measured at a station some distance away, or may be difficult to find with an online search. However, this information is generally available online for experimental watersheds such as LTER (Long‐Term Ecological Research) sites. Data at the scale of individual river channels, such as their widths, depths, and bed roughness, are generally not available and must be parameterized with known empirical relationships and limited field observations. Figure provides a high‐level illustration of the steps in a typical workflow. Details on how to perform steps 2 through 4 are included in multiple appendices in the supporting information. Note that steps 6 through 8 may be repeated many times with different choices of process components and different parameter settings in an effort to understand the physical processes at work in the basin and to make predictions that compare favorably with observations. For hydrologic modeling, one typically compares the predicted and observed hydrographs at the outlet of the basin of interest, which show the volume of water per unit time as a function of time (i.e., Q(t)) that flows through the outlet.
Preparing Input Files for TopoFlow
One of the unique features of TopoFlow is that, by design, almost any input variable that occurs in any of the process components can be provided as any of the following: (1) a scalar (a single value that does not vary in space or time), (2) a time series (a value that varies in time but not in space), (3) a grid (values that vary over space, but do not change over time), or (4) a grid stack (values that are variable in both space and time).
For example, a spatially uniform, steady rainfall rate would be provided as a scalar (to be used for all grid cells and all times), while a space‐time rainfall field would be provided as a grid stack. A grid stack is essentially a time series of grids. Implementation of this feature is simplified because both IDL and Python are dynamically typed programming languages, and upcasting therefore occurs automatically in expressions that contain a mixture of scalar values and grids. So functions in TopoFlow usually do not need to check whether arguments are scalar values or grids and can handle mixtures. Note that the term scalar is used in TopoFlow to mean a single numerical value; this could be confusing because a grid of values can be called a scalar field.)
For consistency, simplicity, and runtime performance, when TopoFlow reads the values of input variables from files, it expects (and requires) that (1) any time series is stored in a single‐column text file (one value per line), (2) any grid is stored in an IEEE binary grid file, in row‐major order, and (3) any grid stack is stored as a succession of binary grids, stored in a single file.
Binary grid files are a very common and computationally efficient file format. Appendix B in the supporting information describes binary grid files in more detail and explains how to read and write them with Python. Appendix C in the supporting information describes the binary grids that are needed to run TopoFlow. Appendix D in the supporting information explains how to prepare D8‐based, binary grid input files for TopoFlow from a DEM with the D8 Global component. (The DEM is also assumed to be stored in a binary grid file.)
While all of the hydrologic process components in TopoFlow are optional and can be disabled with a setting in their configuration file, there are very few simulations that can be performed without using a channel flow component such as the kinematic wave component. (One example is snow depth accumulation using a snow component and meteorology component.) However, channel flow components require input files that describe both the channel geometry and bed roughness as spatial grids with the same dimensions as the DEM, as well as a D8 channel slope grid. This type of data is not readily available online, so TopoFlow users must prepare these grids. As explained in Peckham [], these can be estimated by applying power law functions of the form V = c(A + b)p to a D8 total contributing area (TCA) grid, because both channel width and bed roughness vary with discharge (and therefore with TCA). However, the power law function parameters should be chosen carefully to obtain reasonable results. The IDL version of TopoFlow (TopoFlow‐IDL) has a Create menu in its GUI that includes dialogs for computing these grids. However, one could also use Python commands similar to those in Appendices C and D in the supporting information to read a TCA grid into a variable, A, apply a power law function, and then write the resulting grid to a binary grid file. The compound filename extensions _chan‐w.rtg, _chan‐a.rtg, and _chan‐n.rtg are commonly used in TopoFlow for grids of channel widths, channel trapezoidal bank angles (i.e., flare angles), and the channel roughness parameter, Manning's n, respectively.
Many of the other TopoFlow input variables represent either initial conditions—such as depth of water or snow, soil water content, and position of the water table—or forcing variables (or drivers), such as precipitation rate, air temperature, relative humidity, shortwave and longwave radiation fluxes, and wind speed [Peckham, ]. These can be acquired from a variety of sources, including federal agencies, but must be either downloaded as or converted to binary grid files (or single‐column text files for time series data). They may also need to be clipped or resampled to have the same dimensions and spatial extent as the DEM, which can be done using GIS software.
For output files, TopoFlow users can choose to save a time series, profile series, grid stack, or cube stack of data values to a NetCDF file or to binary grid files. A cube stack is a succession of 3‐D arrays indexed by time, while a profile series is a succession of 1‐D arrays indexed by time, such as a soil moisture profile that varies over time. These four types of output files include 0D, 1D, 2D, or 3D in their filenames, respectively. The 1‐D and 3‐D options are currently only used for subsurface flow variables. Users set flags in TopoFlow configuration files to choose which computed variables to save, using one of these four types of files. Often, one is interested in saving the values of some output variable (e.g., discharge, flow depth, or flow speed) at one or more specific locations (e.g., basin outlets) as a time series. This option can be turned on with a setting in a component's configuration file but requires the user to first create an outlet file—a simple text file with the following format:
The column and row values are essential and must be obtained with GIS software, such as RiverTools, but the area and relief values are nonessential. Multiple grid cells can be indicated for monitoring by adding rows.
Many applications provide a graphical user interface (GUI) that can make it much easier to prepare, process, edit, analyze, and visualize binary grid input files. For example, Appendix E in the supporting information explains how RiverTools and TopoFlow‐IDL can be used. Additional information on preparing input files for TopoFlow can be found in Peckham [] and in the
Example Application—Caribou‐Poker Creek Research Watershed
Description of the Watershed
To illustrate the workflow of setting up and running TopoFlow, we use the data from the Caribou‐Poker Creek Research Watershed (CPCRW) located 48 km north of Fairbanks 65° ° W Alaska. The CPCRW site is part of the LTER (Long‐Term Ecological Research) network. Parts of this watershed are underlain by permafrost, where the maximum seasonal thaw depth thickness is about 0.52 m at a low elevation point near the confluence of Poker and Caribou Creeks [Bolton et al., , ; Bolton, ]. Black spruce is generally found along poorly drained north facing slopes and valley bottoms. Aspen, birch, alder, and sporadic white spruce are found on the well‐drained, south facing slopes. Tussock tundra, feather moss, and sphagnum mosses are also found along valley bottoms [Bolton, ]. The watershed encompasses an area of 101.5 km2 as shown in Figure . CPCRW is located within the boreal forest area. The watershed has six subwatersheds, where three of them (C2, C3, and C4) have been continuously monitored over the last few decades. We chose to model the C2 subwatershed within the CPCRW because it is south facing and has almost no permafrost. South facing slopes usually correspond to a warmer microclimate and have a thinner organic layer and well‐drained soils. The snowmelt at this site usually happens in a span of 1 or 2 weeks at the end of the spring season. Previous studies (e.g., Bolton et al. []) have compared results for the C2 basin to those of the north facing, C3 basin, which has about the same basin area but has 54% permafrost versus 4% for C2. See Figure .
A map of the present measurement sensors in Caribou‐Poker Creek Research Watershed study site.
Locations of the C2 and C3 subbasins and the four met stations within the Caribou‐Poker Creek Research Watershed. North is at the top of the image, and the circles have a diameter of about 3 km.
Figure is a soil map for the vicinity of the C2 basin, clipped from a larger soil map contained in Rieger et al. [] (p. 14). It shows that the soil in the C2 basin is mostly Gilmore silt loam, with Fairplay silt loam near the drainage divide. Table 2 in Rieger et al. [] (p. 13) provides estimates of Ks (which they call permeability, as explained on p. 12), with values between 0.6 and 2.0 in/h or 4.23 × 10−6 to 1.41 × 10−5 m/s.
Soil map for part of the Caribou‐Poker Creek Research Watershed that includes the C2 basin, from Rieger et al. []. Soil types are all specific types of silt loam and corresponding slope ranges, as indicated with the following symbols: Bradway, Br; Ester: EsD (12 to 20%), EsE (20 to 30%), and EsF (30 to 40%); Fairplay: FpB (3 to 7%), FpC (7 to 12%), FpD (12 to 20%), and FpE (20 to 30%); Gilmore: GmD (12 to 20%), GmE (20 to 30%), and GmF (30 to 45%); Karshner: KaB (3 to 7%) and KaC (7 to 12%); Olnes: OnB (3 to 7%), OnC (7 to 12%), OnD (12 to 20%), OnE (20 to 30%), and OnF (30 to 45%); and Saulich: SuB (3 to 7%), SuC (7 to 12%), and SuD (12 to 20%).
Acquiring a Digital Elevation Model for the Study Site
The TopoFlow model relies on a digital elevation model (DEM) in order to determine the flow directions and slopes that it uses to route water across the landscape. We used the following steps to obtain a DEM for our study site. While we think it is illustrative to describe these steps, we note that they are specific to a browser‐based, graphical user interface (GUI) provided by the U.S. Geological Survey (USGS) as it exists at the time of writing—one that is likely to change in the future. While GUIs like this are typical and fairly easy to use, a hydrologist may need to use many different ones in order to acquire all of the input needed for a modeling study. This illustrates another challenge for reproducibility that is partially alleviated when data providers make their data available for download via a web service with a standardized API.
- Step 1. The USGS provides an online tool called
The National Map (TNM) for downloading data. In the navigation section on the left side of the page, check the box labeled Elevation Products (3DEP) in the Data section. This displays additional check boxes with different horizontal resolutions. We used 1 arcsec DEMs for this study, so we checked the box with this label. In the section labeled File Format, we checked the radio button labeled GridFloat. This is a simple, nonproprietary and efficient format that stores elevation values in row‐major order (IEEE 4 byte, floating‐point binary values), with the filename extension “.flt”. Metadata—such as number of rows and columns, bounding box and grid cell dimensions—is saved in a small, companion text file with filename extension “.hdr”. - Step 2. Using the TNM browser‐based map, we zoomed into the region labeled Yukon Flats National Wildlife Refuge, in the area southwest of (and downstream from) the confluence of the Porcupine River and the Yukon River. This is about 38 km north and slightly east of Fairbanks. The Caribou‐Poker Creek Research Watershed is contained within the 1° by 1° tile with its southwest corner at 66° north latitude and 148° west longitude. We used the plus button to zoom into this region and clicked the radio button at the top labeled Current Extent. We then clicked on the blue button labeled Find Products. This resulted in a list of 1° DEM tiles, and we then selected the one labeled USGS NED 1 arcsec n66w148 1×1 degree GridFloat 2016 by clicking on the shopping cart plus button to the right. The size of this file (one tile) is 42.99 MB. (A higher‐resolution DEM for the same tile, with a grid spacing of 1/3 arcsec and file size of 377.98 MB, is also available.) Note that 1 arcsec of latitude is always roughly 93.6 m, while 1 arcsec of longitude decreases with latitude as (lat) and is significantly less than 93.6 m for these Arctic latitudes.
Acquiring Data From the Bonanza Creek LTER Station
The Institute of Arctic Biology at the University of Alaska, Fairbanks maintains a Long‐Term Ecological Research (LTER) site (funded by NSF) called the
Rainfall rates, measured with a tipping bucket, are stored in text files where the header and first line of data for the CARSNOW station look like this
Similarly, discharge measurements for the C2, C3, and C4 subbasins are available in text files where the header and first line of data look like this
Volume flow rates (discharges) at the C2 basin outlet have been measured every summer since 1979. The measurement frequency was hourly until the summer of 2001, after which it was measured every 15 min. This long record presents an opportunity for investigating the possible effects of climate change. Additional metadata is available in a file called README.rtf.doc that comes with the data from the LTER website.
Preparing LTER Data for Model Use
Discharge and precipitation data were cleaned with the Python preprocessing scripts and notebooks that we have made available on GitHub [Stoica, ]. Version 1.0.0 of this GitHub repository was also archived and assigned a DOI via Zenodo‐GitHub integration [Stoica, ]. The time‐date format for both sets of data was converted to seconds with respect to a common reference date so that data could be properly aligned. Discharge data for the three basins (C2, C3, and C4) were interleaved, so we selected the desired data using the data frame manipulation utilities in the Pandas Python package. A sample of the data was selected based on the desired date‐time interval values. Both data sets contained outliers that were mislabeled with “G” (to indicate that the data are good), so these were detected using histograms and then filtered out. Details and analysis of the data sets can be found at the GitHub link presented above.
Observed Response of the C2 Basin to a Late Summer Rainfall Event
Figure shows rainfall rates that were measured with a tipping bucket at 1 h time intervals at the CPEAK met station for a rainfall event that took place over a 2 day period from 28 to 30 July 2008. Figure shows the corresponding volume flow rates (discharges) measured at the outlet of the C2 basin. The C2 hydrograph shows that there is a base flow discharge of approximately 0.036m3/s prior to the rainfall event. (In our model runs we start with dry channels and then add this amount of base flow to the resulting hydrograph.) The four main rainfall peak values in the measured rainfall time series for the July rainfall event at the CPEAK met station and the three corresponding peak discharges at the C2 basin outlet are given in Table . As shown in Figure , observed rainfall rates at the other met stations display a similar temporal intensity pattern, which supports treating rainfall as spatially uniform over the C2 basin as a first approximation. However, methods such as inverse distance weighting (IDW) could be used to create a grid sequence of rainfall rates to be used as input to TopoFlow. An IDW tool for this purpose is included with TopoFlow‐IDL.
Measured rainfall rates at the CPEAK met station for a summer rainfall event: 28–30 July 2008.
Volume flow rates measured at the outlet of the C2 basin for a summer rainfall event: 28–30 July 2008.
Time (min) | Rainfall Rates (mm/h) ([m/s]) | Time (min) | C2 Discharges (L/s) |
120–240 | P1= 3.51 (9.75 × 10−7) | 255 – 285 | Q1= 45.4 |
600–660 | P2a= 3.24 (9.00 × 10−7) | — | — |
780–840 | P2b= 4.59 (1.28 × 10−6) | 915 – 930 | Q2= 73.6 |
1920–1980 | P3= 3.24 (9.00 × 10−7) | 2130 – 2145 | Q3= 67.9 |
Measured rainfall rates at the CPEAK, CARSNOW, and HR1A met stations for a summer rainfall event: 28–30 July 2008.
Model Component Selection and Setup
While TopoFlow includes numerous hydrologic process components, we deliberately chose to model a relatively simple situation where the dominant processes are surface flow, rainfall, and infiltration. We chose a late summer rainfall event so that contributions from snowmelt could be neglected. We also chose the C2 subbasin, which is south facing and almost free of permafrost. The near absence of permafrost, the relatively uniform soil type (Gilmore silt loam), and the selection of a rainfall event preceded by several days of no rainfall mean that the assumptions of the Green‐Ampt infiltration model should be approximately satisfied. In addition, evaporation is considered to be negligible compared to the rainfall and infiltration rates. Finally, the relatively steep slopes throughout the C2 basin mean that the kinematic wave method of channel flow routing should be appropriate (i.e., no backwater effects, etc.) With these simplifications, we can focus on surface flow, as described by Manning's formula, and the infiltration process physics included in the Green‐Ampt model.
Choosing a Soil Water Retention Model for the Infiltration Process
In infiltration theory, there are four interrelated variables of interest that represent 3‐D scalar fields below the land surface, namely K, the hydraulic conductivity, v, the vertical component of the Darcy velocity, θ, the soil water content, and ψ the pressure head. In order to create a mathematical model that can solve for these four variables, four equations are needed. Two equations are conservation of mass and Darcy's Law, and combining them results in the well‐known Richards equation for modeling the flow of water through a porous medium (e.g., soil), driven by gravity as well as capillary suction. However, two additional equations are needed, and these are empirical relations of the form K(θ) and ψ(θ) known as soil characteristic relations, which allow K and ψ to be computed as functions of θ. Brooks and Corey [] proposed functional forms for K(θ) and ψ(θ) that depend on three parameters, namely ψB<0, the bubbling pressure, and two model parameters η and λ, where η = 2+3λ. The parameters η > 0 and λ > 0 can be set to different values in order to provide good fits to observational data for the flow dynamics of different soil texture types. However, while their model has K = Ks at saturation (θ = θs), it yields ψ = ψB<0 instead of ψ = 0 at saturation. van Genuchten (1980) proposed an alternate pair of relations that also depend on three parameters, namely αg<0, m > 0, and n > 0. While this model has ψ = 0 at saturation, its parameters are less physically meaningful and it has a complicated functional form for K(ψ) [Smith et al., ] (p. 21) that is difficult to integrate. (This makes it difficult to compute the parameter G in .)
Smith [] introduced a third pair of relations based on the Brooks‐Corey model, which he called the transitional Brooks‐Corey (TBC) model, which combines the benefits of the Brooks‐Corey and van Genuchten models. This is the soil water retention model that is used by every infiltration component in TopoFlow. It introduces two new parameters c > 0 and ψA, such that the Brooks‐Corey model is obtained in the limit as
. For Smith's TBC model, the soil characteristic relations are given by
Green‐Ampt Infiltration Model Component—Theory
The Green‐Ampt infiltration model [Green and Ampt, ; Smith et al., ] can be derived as a physically based approximation to Richards Equation, which in turn is considered the best‐available mathematical model for the process of infiltration. This approximation conserves the mass of water and incorporates a soil model (e.g., Brooks‐Corey) but assumes that the initial soil moisture profile is uniform with depth and that lateral flow in the unsaturated zone can be neglected. It also assumes that there is a single, deep soil layer with uniform properties. Unlike the Richards equation model, the Green‐Ampt model treats the variation of soil water content, θ, with depth below the land surface, z, as simple piston flow with a sharp wetting front—that is, with θ = θs above the wetting front and θ = θi below the wetting front.
Both the Green‐Ampt and Smith‐Parlange (three parameter) models of infiltration make use of what Smith et al. [] calls the Infiltrability‐Depth Approximation or IDA. Instead of expressing the infiltration rate at the surface, v0, as a function of time, t, the IDA instead expresses v0 as a function of the cumulative infiltration depth, F, given by
For the TBC soil model used by TopoFlow, G can be computed by inserting into , and for the typical case where ψA=0, Peckham [] found the following closed‐form expression for the resulting integral
Matching Rainfall Peaks to Hydrograph Peaks With Green‐Ampt
Assuming no other gains from snowmelt or base flow, and no losses from evaporation, the runoff available to generate discharge is given by R = (P − v0), the difference between the rainfall rate and the surface infiltration rate. In the special case where the soil is saturated at the start of a rainfall event, we clearly have Ki=Ks, as well as θi=θs, in view of . Equation then reduces to fc=Ks and equation simplifies to
In addition, the total contributing area of the C2 basin is A = 4.84km2 and the longest channel in the C2 basin is 2.93 km. Assuming a mean flow velocity of roughly 1 m/s in the channels, then once water reaches a channel, most water should reach the basin outlet in less than half an hour.
At a given point in the C2 basin, runoff will not occur unless the rainfall rate, P, exceeds the infiltrability, fc=Ks+C/F; and therefore, it must exceed Ks. It follows that in order for the hydrograph peaks to have been generated by the corresponding rainfall peaks (although routed to the basin outlet by the channel network), we must have P1>Ks, P2>Ks, and P3>Ks. To satisfy all three inequalities, we must have Ks<9.0 × 10−7 m/s. Note, however, that the rainfall rates were measured at 1 h intervals while the model time steps for both the infiltration and channel routing components were 2 s and the hydrograph at the outlet was measured at a time interval of 15 min. The actual rainfall peak values were likely higher (and no less than) than the recorded peak values. If the range estimate of 4.23 × 10−6<Ks<1.41 × 10−5 m/s given by Rieger et al. [] is accurate, then the instantaneous rainfall rates must have been higher than the low end of this range. But since these higher values were not measured, it would appear to be necessary to use values of Ks in the model at least 10 times smaller in order to match the observed hydrograph.
Model Results and Analysis
We use equation to guide our search for a good combination of parameters for the watershed. Ks is the offset that determines which pulses in the precipitation data are completely suppressed, since when P < Ks, the model indicates that all water is infiltrated. It is important to note, then, that the sampling rate for the precipitation, which in the case of this data set is 1 h, is critical to being able to utilize a Ks value that reflects the observed value. Since the precipitation sampling rate is low, we need to utilize a much lower Ks value than the reported value to preserve the peaks. Figure shows the discharge at the tracked outlet in C2 for different values of Ks when θi=θs so that fc=0. We notice that it is the second peak that is attenuated the most by an increase in Ks. Noting that there is a shoulder in the observed discharge output in Figure indicates that the second peak is not attenuated and in fact looks to have approximately the same amplitude as the first peak. Therefore, selecting a Ks value for which a 1:1 ratio may be maintained between these two peaks is desirable; a Ks value above 8.0 × 10−7 seems to completely suppress the second peak and is thus not desirable.
Effect of varying Ks with θi=θs on precipitation peak response suppression. Notice that the observed and modeled hydrographs have different y axes.
For any given Ks value, introducing the second term in equation by decreasing θi will further attenuate the peaks. The amount of attenuation and which peaks are attenuated most is determined by the weighting factor, G. Figure illustrates how increasing G preferentially attenuates to first peaks more than latter peaks. The first three peaks all experience attenuation while the fourth peak is relatively unaffected.
Effect of varying G with Ks=7.7 × 10−7 m/s and θi=0.01 on precipitation peak response suppression. Notice that the observed and modeled hydrographs have different y axes.
With this understanding of parameter adjustment on modeled output, it is possible to find an optimum combination of parameters that will yield a reasonable output volume. Attempting to achieve the correct volume output while maintaining all the peaks results in incorrect peak response matching with the third peak being much larger than the other three peaks. Allowing suppression of the first two peaks and attempting to optimize the ratio of the third and fourth peak heights as well as the total volume discharge yields the result in Figure . In order to obtain this simulated output, several adjustments were made to the input data. First, the precipitation data were resampled at 2 s intervals (from the original 1 h), so that instead of a single pulse of precipitation at the beginning of each hour, the precipitation was added to the model in smaller increments. This resulted in a better overall precipitation volume estimate (within 100 m3) in comparison to that modeled with the hour‐interval precipitation (off by more than 6000 m3). Additionally, the default surface drag effect from the Manning n coefficient was increased by a factor of 6. Without this drag effect, each pulse response occurs sooner and decays faster.
Best match result in terms of total volume discharge when Ks=4.5 × 10−7 m/s, θi=0.17, and G = 1.1 m. The precipitation data were resampled for 2 s intervals using a uniform distribution, and the drag factor due to the Manning n coefficient was multiplied by a factor of 6.
Figure illustrates the way the model responds to a given input. It can be seen that the model behaves like a simple integrator upon being fed an impulse. It is the amount of cumulative precipitation during a short period of time that determines whether the model responds to a rain event, not the number of peaks during an event. Effectively, the precipitation event mimics a sequence of two step functions, and the outputs show how the motion of water through the landscape behaves similar to the discharging of a capacitor in an integrator circuit.
Best match result in terms of total volume discharge when Ks=4.5 × 10−7 m/s, θi=0.17, and G = 1.1 m plotted against the cumulative precipitation.
Lastly, in Figure , we show an example of the influence of increased surface drag and four‐peak response on the simulated output. From this example, we can see the importance of these two effects in smoothing out the discharge curve.
Illustrative example of the effect of increased surface drag (factor of 30 versus 6) on the curve shape of the modeled discharge. This model result yielded a volume discharge 10 times higher than observed but shows the features that could be captured for higher precipitation data sampling rates. Notice that the observed and modeled hydrographs have different y axes.
A detailed, reproducible workflow of how the results in this section were obtained is available on GitHub as an
Conclusions and Recommendations
This paper has attempted to document the entire workflow of a spatial hydrologic modeling study to the extent that it can be easily reproduced and extended by other scientists. In support of reproducibility, the entire TopoFlow modeling toolkit, including components, utilities, and framework, is (1) open source and accessible on GitHub (MIT license); (2) version controlled; (3) easily installed as a Python package; (4) citable with a DOI [Peckham, ]; (5) extensible by adding new components; and (6) runnable with the CSDMS, EMELI, and EMELI‐Web frameworks.
In addition, every TopoFlow model component has (1) object‐oriented source code that takes advantage of inheritance; (2) a Basic Model Interface (BMI) to support plug‐and‐play reuse in frameworks; (3) all input and output variables mapped to CSDMS Standard Names; (4) additional, standardized metadata at OntoSoft portal (CSDMS section); (5) its own HTML help page that describes all variables and all equations used; (6) its own easy‐to‐read and edit configuration file (read at startup); (7) a graphical user interface (GUI): i.e., WMT for Python and TopoFlow‐IDL for IDL; (8) the ability to run as a web service with EMELI‐Web; and (9) the ability to save its output to standard format NetCDF files or generic binary grids. (These NetCDF files can be viewed with many visualization software toolkits, such as VisIt [].)
New components can easily be created by copying and then editing the Python source code of existing components. Components of a given process type can also inherit many capabilities from existing base classes. In addition, BMI‐to‐Framework‐X adapters are under development by the EarthCube Earth System Bridge project that will allow components to run in several other modeling frameworks. These adapters are available in the BMI‐Forum [] on GitHub.
All software used in this hydrologic modeling study has been made available in two GitHub repositories. The first one [Peckham, ] includes the complete TopoFlow 3.5 Python Package, which includes all documentation, components, utilities, the EMELI framework, and some example data sets for testing. The version 3.5.0 release has been archived and assigned a DOI with Zenodo‐GitHub integration [Peckham, ]. The second one [Stoica, ] includes the complete set of Python scripts and iPython notebooks used in this specific study for (1) preprocessing the LTER data, (2) preparing CFG files for model runs, (3) plotting model output to create figures, and (4) analyzing model results. While the original LTER data files are too large for GitHub, direct links to these files can be found in the second GitHub repository. The DEM, D8‐derived grids, CFG files, and other TopoFlow input files have been zipped and are also included in the second repository. Version 1.0.0 of this second repository was archived and assigned a DOI with Zenodo‐GitHub integration [Stoica, ].
While used mainly as a vehicle for highlighting the issue of reproducibility and best practices, and while only tapping a small fraction of the capabilities of the TopoFlow model toolkit, the hydrologic modeling study for the C2 watershed led to interesting results that could be pursued further in several different directions. We showed how the features of a hydrograph resulting from a late summer storm could be interpreted using the Green‐Ampt infiltration model. However, this study also clearly illustrated how insufficient temporal resolution in rainfall rate measurements suppresses the magnitudes of rainfall rate peaks, and how this prevents the model from matching observations unless parameters such as Ks are adjusted away from observed values. In this analysis, a robust, but lesser‐known soil water retention model used by TopoFlow—known as transitional Brooks‐Corey—was also highlighted, and a new, closed‐form expression for G was provided.
Acknowledgments
We would like to thank Matt Nolan, who initiated the collaboration that led to TopoFlow, Larry Hinzman, who helped to guide and provide funding for its development, as well as the entire Arctic hydrology team at WERC (Water and Environmental Research Center), at the University of Alaska, Fairbanks. S. Peckham is the primary author of TopoFlow, all versions of which are free and open source, as well as the product RiverTools 4.0, which is sold commercially by Rivix, LLC. To avoid any perceived conflict of interest, this paper has described workflows that do not require RiverTools. This work was partially funded by NSF grants ICER 1440332 (GeoSoft), 1440333 (GeoSemantics), and PLR 1503559.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2017. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Modern geoscientists have online access to an abundance of different data sets and models, but these resources differ from each other in myriad ways and this heterogeneity works against interoperability as well as reproducibility. The purpose of this paper is to illustrate the main issues and some best practices for addressing the challenge of reproducible science in the context of a relatively simple hydrologic modeling study for a small Arctic watershed near Fairbanks, Alaska. This study requires several different types of input data in addition to several, coupled model components. All data sets, model components and processing scripts (e.g., for preparation of data and figures, and for analysis of model output) are fully documented and made available online at persistent URLs. Similarly, all source codes for the models and scripts are open source, version controlled, and made available online via GitHub. Each model component has a Basic Model Interface to simplify coupling and its own HTML help page that includes a list of all equations and variables used. The set of all model components (TopoFlow) has also been made available as a Python package for easy installation. Three different graphical user interfaces for setting up TopoFlow runs are described, including one that allows model components to run and be coupled as web services.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details





1 Institute of Arctic and Alpine Research, University of Colorado Boulder, Boulder, Colorado, USA
2 Los Alamos National Laboratory, Los Alamos, New Mexico, USA
3 International Arctic Research Center, University of Alaska, Fairbanks, Alaska, USA