Full text

Turn on search term navigation

Introduction

Global water resources are increasingly recognised to be a major concern for the sustainable development of a society e.g.. Ongoing changes in demography, land use and climate will likely exacerbate the current circumstances . Water availability and distribution support both ecosystem and human demand for drinking water, food, sanitation, energy, industrial production, transport and recreation. Water is also recognised as the most important environmental hazard: floods , droughts and water-borne diseases cause thousands of casualties, famine, significant disruption and damage worth billions every year e.g.. Efficient water management is thus crucial for the sustainable development of human society. As a consequence, a sound coherent science underpinning decision making is urgently needed. Many studies have already acknowledged the needs for a scientific advancement in water resources management and improved computational models for decision support, which should be capable of predicting the implications of a changing world . Unfortunately, the large diversity of hydrological systems (i.e. catchments) makes it very difficult to identify overarching, scale-independent organising principles of hydrological functions that are required for sustainable and systematic global water management . p. 4 noted that, as hydrologists, we do not have a single object of study. Many hydrological research groups around the world are studying different objects, i.e. different catchments with different response characteristics, thus contributing to the fragmentation of hydrology at various levels. In addition, environmental data are often not easily accessible for hydrological comparisons to enable universal principles to be identified . Data are often not provided in appropriate formats, quality checked and/or adequately documented. The hydrological community has therefore recently started to urge for more collaboration between different research groups, to establish large data samples, improve interoperability and comparative hydrology . Sharing data and tools, embedded within virtual observatories, may be a way forward to advance hydrological sciences in a coherent way. In Europe, a major recent development has been the implementation of the INSPIRE Directive (2007/2/EC) in 2007, which provides a general framework for spatial data infrastructure (SDI) in Europe. This directive requires that common implementing rules are adopted in all member states for a number of specific areas (e.g. metadata, data specifications, network services, data and service sharing, monitoring and reporting) by 2020. Worldwide, similar initiatives can be found by the World Meteorological Organisation, WMO (http://www.whycos.org/whycos/), the Earth Observation Communities, GEOSS (http://www.earthobservations.org/geoss.php), and the World Water Assessment Programme by UNESCO (). However, sharing of open data and source codes does not automatically lead to good research and scientific advancement.

Reproducibility and repeatability of experiments are the core of scientific theory for ensuring scientific progress. Reproducibility is the ability to perform and reproduce results from an experiment conducted under near-identical conditions by different observers in order to independently test findings. Repeatability refers to the degree of agreement of tests or measurements on replicate specimens by the same observer under the same control conditions. Thus, only providing data through open online platforms (or any other way) is not enough to ensure that reproducibility objectives can be met. In fact, the inference previously drawn may be ambiguous to different observers if insufficient knowledge of the experimental design is available. highlighted the impact of modellers' decisions on hydrological predictions. Hydrology is therefore likely to be similar to other sciences that have not yet converged to a common approach to modelling their entities of study. In such cases, meaningful interpretations of comparisons are problematic, as illustrated by many catchment – or model – inter-comparison studies in the past. Model inter-comparison studies at a global scale, including social interactions with the natural system, like e.g. ISLSCP (http://daac.ornl.gov/ISLSCP_II/islscpii.shtml), EU-WATCH (http://www.eu-watch.org/) and ISI-MIP (https://www.pik-potsdam.de/research/climate-impacts-and-vulnerabilities/research/rd2-cross-cutting-activities/isi-mip), but also comparative model inter-comparison experiments in hydrology (i.e. performed by different and independent research groups) such as MOPEX , DMIP or LUCHEM , though successful with respect to data sharing, have contributed little to disentangle the causes of performance differences between different models and to increase our understanding of underlying hydrological processes. This was ultimately often rooted in the problems that see e.g. : (i) there are considerable differences in model structures which hinder the identification of particular features that make it perform better or worse; (ii) different research groups make various different decisions for pre-processing data and calibrating models (although often thought to be negligible, this may, cumulatively, prevent a valid comparison of differences in the results); and (iii) comparing model outputs without analysis of model states and internal fluxes provides limited insight into the workings of a model. Hence, greater acknowledgement is required of the dependency of scientific experiments on the applied procedure and choices made in observation and modelling to identify causal relationships (e.g. setting up of boundary conditions, forcing conditions, narrowing of degrees of freedom), both in empirical field work and modelling studies . This would ensure more transparency in the data and methods used in experiments. In particular, hydrology suffers from the perceived difficulty of reporting detailed experiment protocols in the research literature, largely under-exploiting the convenient option to provide supplementary information in scientific journals. Thus, in the presence of open data platforms, setting up strategies to guarantee experiment reproducibility and thereby a means for meaningful inter-experiment comparison is a challenging target. It requires a concerted and interdisciplinary effort, involving information technology, environmental sciences and dissemination policy in developing and communicating strict, detailed, coherent and generally unambiguous experiment protocols.

In this paper we explore the potential of a virtual water-science laboratory to overcome the aforementioned problems. A virtual laboratory provides a platform to share data, tools and experimental protocols . In particular, experimental protocols constitute an essential part of a scientific experiment, as they guarantee quality assurance and good practice e.g. and, we argue, are at the core of repeatability and reproducibility of the scientific experiments. More specifically, a protocol is a detailed plan of a scientific experiment that describes its design and implementation. Protocols usually include detailed procedures and lists of required equipment and instruments, information on data, experimenting methods and standards for reporting the results through post-processing of model outputs. By including a collection of research facilities, such as e-infrastructure and protocols, virtual laboratories have the potential to stimulate entirely new forms of scientific research through improved collaboration. Pilot studies, such as the Environmental Virtual Observatory (EVO – http://www.evo-uk.org), have already explored a number of these issues and, additionally, the legal and security challenges to overcome. Other example projects related to hydrology, which are exploring community data sharing and interoperability, include DRIHM (http://www.drihm.eu), NEON in the USA (http://www.neoninc.org), and the Organic Data Science Framework (http://www.organicdatascience.org/). To sum up, virtual laboratories aim at (i) facilitating repetition of numerical experiments undertaken by other researchers for quality assurance, and (ii) contributing to collaborative research. Virtual laboratories therefore provide an opportunity to make hydrology a more rigorous science. However, virtual laboratories are relatively novel in environmental research and their essential requirements to ensure the repeatability and reproducibility of experiments are still unclear. Therefore, we have undertaken a collaborative experiment, among five universities and research institutes, to explore the possible critical issues that may arise in the development of virtual laboratories. This paper presents a collaborative simulation experiment on reproducibility in hydrology, using the Virtual Water-Science Laboratory, established within the context of the EU funded research project “Sharing Water-related Information to Tackle Changes in the Hydrosphere – for Operational Needs (SWITCH-ON)”, (http://www.water-switch-on.eu/), which is currently under development. The paper aims to address the following questions:

What factors control reproducibility in computational scientific experiments in hydrology?
What is the way forward to ensure reproducibility in hydrology?

After presenting the structure of the Virtual Water-Science Laboratory (VWSL), we describe in detail the collaborative experiment, carried out by the research groups in the VWSL. We deliberately decided to design the experiment as a relatively traditional exercise in hydrology in order to better identify critical issues that may arise in virtual laboratories' development and dissemination and that are not associated with the complexity of the considered experiment. This experiment therefore supports subsequent research within the VWSL, and provides an initial guidance to design protocols and share evaluation within virtual laboratories by the broad scientific community.

The SWITCH-ON Virtual Water-Science Laboratory

The purpose of the SWITCH-ON VWSL is to provide a common workspace for collaborative and meaningful comparative hydrology. The laboratory aims to facilitate, through the development of detailed protocols, the sharing of data tools, models and any other relevant supporting information, thus allowing experiments on a common basis of open data and well-defined procedures. This will not only enhance the general comparability of different experiments on specific topics carried out by different research groups, but the available data and tools will also facilitate researchers to more easily exploit the advantages of comparative hydrology and collaboration, which is widely regarded as a prerequisite for scientific advance in the discipline . In addition, the VWSL aims to foster cooperative work by actively supporting discussions and collaborative work. Although the VWSL is currently used only by researchers who are part of the EU FP7-project SWITCH-ON, it is also open to external research groups to obtain feedback and to establish a sustainable infrastructure that will remain after the end of the project. Any experiment formulated within the VWSL needs to comply with specific stages, shown as an 8-point workflow described in detail below, which outlines the scientific process and the structure for using the facilitating tools in the VWSL.

STAGE 1: define science questions

This stage allows researchers to discuss through a dedicated on-line forum (available at https://groups.google.com/forum/#!forum/virtual-water-science-laboratory-forum) specific hydrological topics to be elaborated upon by different research groups in a collaborative context. Templates are available to formulate new experiments.

STAGE 2: set up experiment protocols

In this step a recommended protocol for collaborative experiments needs to be developed. This protocol formalises the main interactions between project partners and acts as a guideline for the experiment outline in order to ensure experiment reproducibility and thus controlling the degree of freedom of single modellers.

STAGE 3: collect input data

The VWSL contains a catalogue of relevant external data available as open data from any source on the Internet in a format that can be directly used in experiments. Stored data are organised in Level A (pan-European scale covering the whole of Europe) and Level B (local data covering limited or regional domains). Currently Level A includes input data to the E-HYPE model with some 35 000 sub-basins covering Europe such as precipitation, evaporation, soil and land use, river discharge and nutrients data, while Level B includes hydrological data (i.e. precipitation, temperature and river discharge) for 15–20 selected catchments across Europe. In addition, a Spatial Information Platform (SIP) has been created. This platform includes a catalogue with a user interface for browsing among metadata from many data providers. So far, the data catalogue has been filled with 6990 items of files for download, data viewers and web pages. The SIP also includes functionalities for linking more metadata, and visualisation of data sets. Therefore, through stored data and the SIP, researchers can easily find and explore data deemed to be relevant for a hydrological experiment.

STAGE 4: repurpose data to input files

In this step, raw original data from STAGE 3 can be processed (i.e. transformed, merged, etc.) to create suitable input files for hydrological experiments or models. For example, the World Hydrological Input Set-up Tool (WHIST) can tailor data to specific models or resolutions. An alternative example, planned to be used for future activities in the VWSL, is provided by land use data, which can be aggregated to relevant classes and adjusted to specific spatial discretisations (e.g. model grid or sub-basin areas across Europe). Both raw original and repurposed data (STAGES 3 and 4) should be accompanied by detailed metadata (i.e. a protocol), which specify e.g. data origin, spatial and temporal resolution, observation period, description of the observing instrument, information on data collection, measures of data quality, coherency of the measured method and instrument, and any other relevant information. Data should be provided to international open source data standards (i.e. http://www.opengeospatial.org) and, for water-related research in particular, it should be compliant with the WaterML2 international initiatives (see above site for more information).

STAGE 5: compute model outputs

By employing open source model codes, freely available via the VWSL, or through links to model providers, researchers can perform hydrological model calculations using the same tools. Results can then be compared, evaluated, reused and/or repurposed for new experiments. In addition, templates for protocols are available to ensure the reproducibility and repeatability of model analysis and results. The protocol may include, for instance, a description of the hydrological experiment, and information on the model, input data and metadata, employed algorithms and temporal scales. Protocols for model experiments will thus create a framework for a generally accepted, scientifically valid and identical environment for specific types of numerical experiments within the VWSL, and will promote transparency and data sharing, therefore allowing other researchers to download and reproduce the experiment on their own computer.

STAGE 6: share results

Links to model results are uploaded to the VWSL in order to ensure the post-audit analyses and transparency of the performed experiments, which can be reproduced by other research groups.

STAGE 7: explore the findings

Here, researchers can extract, evaluate and visualise experiment results gathered at STAGE 5. A separate space for discussion and comparisons of results, through the on-line forum, additionally facilitates direct and open knowledge exchange between researchers and research teams.

STAGE 8: publish and access papers

Links to scientific papers and technical reports on comparative research resulting from collaboration and experiments based on data in the VWSL will be found in the VWSL.

Geographical location and runoff seasonality (average among the observation period listed in Table ) (mm month $^{- 1}$ ) for the 15 catchments considered in the first collaborative experiment of the SWITCH-ON Virtual Water-Science Laboratory.

[Figure omitted. See PDF]

The first collaborative experiment in the SWITCH-ON Virtual Water-Science Laboratory

Description and purpose of the experiment

The first pilot experiment of the SWITCH-ON VWSL aims to assess the reproducibility of the calibration and validation of a lumped rainfall–runoff model over 15 European catchments (Fig. ) by different research groups using open software and open data (STAGE 1). Calibration and validation of rainfall–runoff models is a fundamental step for many hydrological analyses , including drought and flood frequency estimation see, for instance,. The rainfall–runoff model adopted in the experiment is a HBV-like model called TUWmodel , which is designed to estimate daily streamflow time series from daily rainfall, air temperature and potential evaporation data (STAGE 5). The TUWmodel code (see Supplement for further information), written as a script in the R programming environment , is run for each of the selected catchments by five research groups, based at the Swedish Meteorological and Hydrological Institute (SMHI), University of Bologna (UNIBO), Technical University Wien (TUW), Technical University Delft (TUD), and University of Bristol (BRISTOL). The R script is run by the five research groups using different operating systems (i.e. Linux by UNIBO, TUW and TUD; Windows 7 by SMHI and BRISTOL). The groups a priori agreed on a rigorous protocol for the experiment (STAGE 2), which is described in detail below, conducted the experiment (STAGES 3, 4, 5), and subsequently engaged in a collective discussion of the results (STAGES 6, 7). Despite the relatively simple hydrologic exercise, this experiment is expected to benefit from a comparison of model outcomes, an exchange of views and modelling strategies among the research partners in order to identify and assess potential sources of violations of the condition of reproducibility. Indeed the experiment has the purpose of bringing scientists to work together collaboratively in a well-defined and controlled hydrological study for result comparison. By exploring reproducibility, this experiment places itself as a base-line for comparative hydrology.

Summary of the key geographical and hydrological features for the 15 catchments considered in the first collaborative experiment of the SWITCH-ON Virtual Water-Science Laboratory.

Catchment	Area	Mean	Observation period	Mean	Mean	Mean
	(km $^{2}$ )	(min, max)	start–end	catchment	catchment	observed
		elevation		rainfall	temperature	streamflow per
		(m a.s.l.)		(mm year $^{- 1}$ )	( $^{\circ}$ C)	unit area
						(mm year $^{- 1}$ )
Gadera at Mantana (Italy)	394	1844 (811, 3053)	1 Jan 1990–31 Dec 2009	842	5.2	640
Tanaro at Piantorre (Italy)	500	1067 (340, 2622)	1 Jan 2000–31 Dec 2012	1022	8.6	692
Arno at Subbiano (Italy)	751	750 (250, 1657)	1 Jan 1992–31 Dec 2013	1213	11.5	498
Vils at Vils (Austria)	198	1287 (811, 2146)	1 Jan 1976–31 Dec 2010	1768	5.5	1271
Großarler Ache at Großarl (Austria)	145	1694 (859, 2660)	1 Jan 1976–31 Dec 2010	1314	3.5	1113
Fritzbach at Kreuzbergmauth (Austria)	155	1169 (615, 2205)	1 Jan 1976–31 Dec 2010	1263	5.7	799
Große Mühl at Furtmühle (Austria)	253	723 (252, 1099)	1 Jan 1976–31 Dec 2010	1075	7.2	696
Gnasbach at Fluttendorf (Austria)	119	311 (211, 450)	1 Jan 1976–31 Dec 2010	746	9.8	218
Kleine Erlauf at Wieselburg (Austria)	168	514 (499, 1391)	1 Jan 1976–31 Dec 2010	973	8.6	545
Broye at Payerne (Switzerland)	396	714 (391, 1494)	1 Jan 1965–31 Dec 2009	899	9.1	647
Loisach at Garmisch (Germany)	243	1383 (716, 2783)	1 Jan 1976–31 Dec 2001	2010	5.8	957
Treene at Treia (Germany)	481	25 ( $-$ 1, 80)	1 Jan 1974–31 Dec 2004	905	8.4	413
Hoan at Saras Fors (Sweden)	616	503 (286, 924)	27 Apr 1988–31 Dec 2012	739	2.3	428
Juktån at Skirknäs (Sweden)	418	756 (483, 1247)	19 May 1980–31 Dec 2012	941	$-$ 1.4	739
Nossan at Eggvena (Sweden)	332	168 (91, 277)	10 Oct 1978–31 Dec 2012	894	6.4	344

Study catchment and hydrological data

European catchments characterised by a drainage area larger than 100 km $^{2}$ with at least 10 years of daily hydro-meteorological data, as lumped information on rainfall, air temperature, potential evaporation and runoff are considered (STAGE 3). The selected 15 catchments are located in Sweden, Germany, Austria, Switzerland and Italy (Fig. ). Daily time series of rainfall, temperature and streamflow, gathered from national environmental agencies and public authorities (see Acknowledgements for more details), are pre-processed by the partner who contributed the data set to the experiment (e.g. to homogenise units of measurement) to be employed in the TUWmodel (STAGE 5). Potential evaporation data are derived, as repurposed data (STAGE 4), from hourly temperature and daily potential sunshine duration by a modified Blaney–Criddle equation for further details, see. Table reports the foremost features of the 15 study catchments investigated.

Experiment protocols

As detailed above, the objective of this experiment is to test the reproducibility of the TUWmodel results on the 15 study catchments when implemented and run independently by different research groups. Consequently, the experiment provides an indication of the experimental implementation uncertainty see e.g. due to combined effects of insufficiently developed protocols, human error or computational architecture. To this aim, identical implementations (the R code) of the TUWmodel are distributed to the research groups, and two different protocols (i.e. Protocol 1 and Protocol 2) establishing how to perform the experiment are defined (STAGES 2, 5). Protocol 1 is characterised by a rigid setting, such that the researchers are required to strictly follow pre-defined rules for model calibration and validation, as specified in the distributed R script. By following Protocol 1, all research groups are expected to obtain the same results in terms of comparable model performance. The alternative Protocol 2 allows researchers more flexibility in order to explore and compare several different model calibration options. In this case, research groups have the opportunity to add their personal experience to assess model performance. This will likely provide less comparable results among research groups, but the expected added value of Protocol 2 would be a more extended exploration of different modelling options, which could be synthesised and used for future hydrological experiments in the VWSL. In both protocols the observation period ( $n$ years) is divided into two equal-length sub-periods ( $n / 2$ years): the first period is used for calibration, and the second for validation as in a classical split-sample test. In Protocol 1, we also switched the two periods (i.e. first period for validation and second period for calibration). Detailed model specifications for the two protocols are described in what follows and their main settings are summarised in Tables and .

Main settings of Protocol 1 of the first collaborative experiment of the SWITCH-ON Virtual Water-Science Laboratory.

Component	Description and link
Model version	TUWmodel, http://cran.r-project.org/web/packages/TUWmodel/index.html
Input data	Rainfall, temperature and potential evaporation data; catchment area
Objective function	Mean square error (MSE)
Optimisation algorithm	DEoptim, http://cran.r-project.org/web/packages/DEoptim/index.html
Parameter values or ranges		Lower limits	Upper limits
	SCF [–]	0.9	1.5
	DDF [mm $^{\circ}$ C $^{- 1}$ day $^{- 1}$ ]	0.0	5.0
	Tr [ $^{\circ}$ C]	1.0	3.0
	Ts [ $^{\circ}$ C]	$-$ 3.0	1.0
	Tm [ $^{\circ}$ C]	$-$ 2.0	2.0
	LPrat [–]	0.0	1.0
	FC [mm]	0.0	600.0
	BETA [–]	0.0	20.0
	k0 [day]	0.0	2.0
	k1 [day]	2.0	30.0
	k2 [day]	30.0	250.0
	lsuz [mm]	1.0	100.0
	cperc [mm day $^{- 1}$ ]	0.0	8.0
	bmax [day]	0.0	30.0
	croute [day $^{2}$ mm $^{- 1}$ ]	0.0	50.0
Calibration and validation periods	Divide the observation period in two subsequent pieces of equal length. First calibrate on the first period and validate on the second and then invert the calibration and validation periods
Initial warm-up period	365 days for both calibration and validation periods
Temporal scales of model simulation	Daily
Additional data used for validation (state variables, other response data)	None
Uncertainty analysis (Y/N)	None
Method of uncertainty analysis	None
Post-calibration evaluation metrics (skills)	MSE, RMSE, NSE, log(NSE), bias, MAE, MALE, VE

Comparison among Protocol 1 and Protocol 2 settings of the first collaborative experiment of the SWITCH-ON Virtual Water-Science Laboratory.

	Protocol 1	Protocol 2
	All research groups	BRISTOL	SMHI	TUD	TUW	UNIBO
Identification of unreliable data	All data are considered	Runoff coefficient analysis	All data areconsidered	Visual inspection of unexplained hydrograph peaks	All data are considered	Exclusion of 25 % of calibration years with high MSE
Parameter ranges	See Table	See Table	See Table	See Table	See Table except for Tr, Ts, Bmax, croute (fixed values)	See Table
Optimisation algorithm	Differential evolution optimisation (DEoptim) – 10 times, 600iterations	Differential evolution optimisation (DEoptim) –10 times, 1000 iterations	Latin hypercube approach	Dynamically dimensioned search (DDS) – 10 times, 1000 iterations	Shuffle complex evolution (SCE)	Differential evolution optimisation (DEoptim) –10 times, 600iterations
Objective function	Mean squareerror (MSE)	Mean absolute error (MAE)	Mean squareerror (MSE)	Kling–Gupta efficiency (KGE)	Objective function from , Eq. (3)	Mean squareerror (MSE)
Warm-up period	1 year for calibration and validation	1 year for calibration and validation	1 year for calibration and validation	1 year for calibration and validation	1 year for calibration and validation	1 year for calibration and validation

Protocol 1

For Protocol 1, the calibration of the TUWmodel is based on the Differential Evolution optimisation algorithm DEoptim,. This global optimisation tool with differential evolution is readily embedded in the R package that was used to run the entire experiment. Protocol 1 pre-defines the uniform prior model parameter distributions (Table ). 10 calibration runs, each of them based on different random seeds, are performed in order to identify the best calibration run. The objective function used to determine the optimal model parameters is the mean square error (MSE). Model parameters estimated during the calibration phase are then used to test the TUWmodel in the validation period. For the validation period, Protocol 1 further requires the computation of MSE; root mean square error, RMSE; Nash–Sutcliffe efficiency, NSE; NSE of logarithmic discharges, log(NSE); bias; mean absolute error, MAE; MAE of logarithmic discharges, MALE; and volume error, VE. A model warm-up period of 1 year for both calibration and validation (i.e. model calibration and validation are applied on $n / 2 - 1$ years), was adopted in order to minimise the influence of initial conditions. The model realisations of the individual research groups were then compared based on the performance metrics and the obtained optimal parameter values. The R script describing Protocol 1 is presented as Supplement.

Protocol 2

In Protocol 2, the different research groups could make individual choices in an attempt to improve model performances. More specifically, during model calibration on the first half of the observation period, users could (i) shorten the calibration period by excluding what they believe are potentially unreliable pieces of data and providing detailed justifications, (ii) modify the prior parameter distributions, (iii) change the optimisation algorithm and its settings, (iv) select alternative objective functions, and (v) freely choose the model warm-up period (see Table and Supplement for a detailed description). Similarly to Protocol 1, the calibrated parameter values are used as inputs for the evaluation of the simulated discharge during the validation period, and the same goodness-of-fit statistics evaluated in Protocol 1 are also computed.

Results

A web-based discussion (STAGES 6, 7) was engaged among the researchers to collectively assess the results, by comparing the experiment outcomes and benefiting from their personal knowledge and experience. The results revealed that reproducibility is ensured when:

experiment and modelling purpose are outlined in detail, which requires a preliminary agreement on semantics and definitions,
a standardised format of input data (e.g. file format, data presentation, and units of measurement) and pre-defined variable names are proposed,
the same model tools (i.e. code and software) are used.

Within a collaborative context, this can be achieved only if the involved research groups completely agreed on the detailed protocol of the experiment. In what follows we report the experiences gained from the experiment, and we finally suggest a process that enables research groups to improve the set-up of protocols.

Protocol 1

The variability in the optimal calibration performance obtained from all research groups for Protocol 1, ordered by catchments, is shown in Fig. . For some catchments, notably the Gadera (ITA) and Großarler Ache (AUT), optimal calibration performance is very similar between groups, indicating that the Protocol has been executed properly by each research group. However, for some other catchments including the Vils (AUT), Broye (SUI), Hoan (SWE) and Juktån (SWE), more variability in optimal performance between groups was obtained. Given that Protocol 1 is not deterministic, as the optimisation algorithm contains a random component, variability in optimal performance will be expected even if the protocol were repeated by a given research group. Thus, in order to make proper comparison between research groups – e.g. assess the reproducibility of an experiment – an understanding of this within-group variability, or repeatability, is required. The range in optimal performance obtained by one research group (BRISTOL) when the optimisation algorithm was run 100 times, instead of 10 times as per Protocol 1, is also plotted in Fig. to give an indication of the within-group variability. With the exception of the second calibration period for the Vils (AUT) catchment, where UNIBO found a lower RMSE, the between-group variability in calibration performance falls within the bounds of the within-group variability, which indicates a successful execution of the Protocol across all catchments. Of the 100 optimisation runs conducted for the Vils (AUT) catchment during the second calibration, 99 were at the upper end of the range in Fig. , alongside the results of all groups except UNIBO, and only one result at the lower end of the range. In this case, and in the case of the poorer performance of the BRISTOL calibration for the Broye (SUI), where early stopping of the optimisation algorithm consistently occurred, the results suggest the algorithm became trapped in a local minimum and struggled to converge to a global minimum – or at least to an improved solution, as identified by other groups/runs. In addition to convergence issues causing differences in the results of each group, differences in the identified optimal parameter sets suggest that divergence in performance may also result from parameter insensitivity and equifinality (Fig. ). Furthermore, performance is also affected by the presence of more complex catchment processes which are not fully captured by the chosen hydrological model (e.g. snowmelt or soil moisture routines in catchments with large altitude range or diverse land covers). Thus, from a hydrological viewpoint, the results were not completely satisfactory, and detailed analysis at each location is required. However, given that in the majority of cases the between-group variability in performance (reproducibility) was within the range of within-group variability (repeatability) identified, it can be concluded that Protocol 1 ensured reproducibility between groups for the proposed model calibration.

Optimal RMSE of runoff (square root of the objective function) obtained for calibration period 1 and calibration period 2 by each research group for the 15 catchments. The black bars show the range in optimal performance obtained by a single research group (BRISTOL) from 100 calibration runs initiated from different random seeds.

[Figure omitted. See PDF]

Parallel coordinate plots of the optimal parameter set estimates derived from each participant group in each of the 15 catchments for Protocol 1. Model parameters are shown on the $x$ axis and catchments on the right-hand $y$ axis. The parameters have been scaled to the ranges shown in Table .

[Figure omitted. See PDF]

Protocol 2

To overcome the problems arising from Protocol 1 and possibly improve model performances, the effects of personal knowledge and experience of research groups were explored in Protocol 2. Here, researchers were allowed to more flexibly change model settings, which may introduce a more pronounced variability in the results among the individual research groups, due to different decisions in the modelling processes. Given that flexibility allows a more proficient use of expert knowledge and experience, one may expect an improvement of model performances. Flexibility indeed enables modellers to introduce new choices in order to improve model performance in terms of process representation and consequently correct automatic calibration artefacts for model parameter value selection (as in Protocol 1), which could lead to unexpected model behaviour. The increase in flexibility in Protocol 2 led to a significant divergence in model performance between groups, as exemplified in Fig. for the NSE performance metric. Such changes reflect the different approaches taken in an attempt to improve model performance in terms of process representation, and to correct problems from Protocol 1. In turn, these changes delineate the effects of different personal knowledge and experience of the different research groups. More specifically, BRISTOL and UNIBO both chose to exclude potentially unreliable data from the calibration data. In the case of BRISTOL, following visual inspection of the data, it was felt that a more thorough data evaluation procedure prior to calibration was required. Based on the calculation of event runoff coefficients, a subset of the time series in nine catchments was excluded. Researchers from UNIBO decided to exclude nearly one quarter of available data for each study watershed. Data were removed by looking for the highest MSE for each separate year by using the parameter set that allowed the best results on the calibration set in the Protocol 1 experiment. Data removal appeared to lead to improved calibration performance, and to a lesser extent, improved validation performance. As per Protocol 2, data were not removed from the validation period. Conversely, researchers from TUW and TUD decided not to remove any data in the calibration period but to adopt alternative optimisation procedures to enhance the robustness of the calibration (see Table ). The discussion among modellers pointed out that changing the objective function from MSE to different formulations did not lead to an actual decay of the model performances, but only to lower values of the NSE, due to assigning lower priority to the simulation of the peak flows, while other features of the hydrograph were better simulated. For instance, the Kling–Gupta efficiency was used by TUD as it provides a more equally weighted combination of bias, correlation coefficient and relative variation compared to NSE. This led to reduced bias and volume error compared to the results of the other groups, but in a trade-off, it worsened the performances in terms of the NSE. Similarly, the use of MSE by BRISTOL led to improvements in log(NSE), MAE and MALE for nearly all catchments in calibration and validation, but increased bias and volume errors in some cases. As there was no uniquely defined objective of Protocol 2, such choices reflected attempts by the groups to achieve an appropriate compromise across performance metrics. SMHI adopted a hydrological process-based approach, where the modellers accepted small performance penalties in terms of NSE if the conceptual behaviour of the model variables looked more appropriate during the calibration procedure. This was done to get a good model for the right reasons, and expert knowledge on hydrological processes and model behaviour was then included along with the statistical criteria. The evaluation of the goodness-of-fit by SMHI was performed by visual comparison and an analysis of several (internal) model variables, e.g. soil moisture, evapotranspiration rates and snow water equivalents, instead of simply using a different objective function. These analyses pointed to conceptual model failures in several catchments (e.g. Loisach (GER) catchment, Fig. ), leading to the adoption of a calibration approach which considered the structural limitations of the TUWmodel and their implications for model performance (see also Supplement).

Nash–Sutcliffe efficiency (NSE) estimated for model validation, obtained by the five research groups, for the 15 catchments, according to Protocols 1 and 2.

[Figure omitted. See PDF]

Identified issues in a collaborative experiment

Collaboration implies communication between scientists. During this first experiment, researchers engaged in a frequent and close communication both via e-mail and through the VWSL forum in order to highlight encountered problems, discuss about model results and their interpretation, and also identify challenges for future improvement of the VWSL itself. In particular, during this experiment several incidents showed the importance of well-defined terms to be able to cope with reproducibility between the research groups. These problems pointed out that communication between different groups through the web may be problematic. Indeed, the hydrological community is not well acquainted with inter-group cooperation. Detailed guidelines, including a preliminary rigorous setting of definitions and terminology, are needed to get a virtual laboratory properly working.

Flowchart of the suggested procedure to establish protocols for collaborative experiments.

[Figure omitted. See PDF]

Suggested procedure to establish protocols for collaborative experiments

Based on the experiment results, we were able to identify a recommended workflow sequence for collaborative experiments, to streamline the work among largely disjoint and independent working partners. The workflow covers three distinct phases: Preparation, Execution, and Analysis (Fig. ). The Preparation phase contains the bulk of processes specific to collaboration between independent partner groups. Starting from an initial experiment idea, partners are brought together and a coordination structure is chosen. A lead partner, who is responsible for coordination of the experiment preparation, needs to be identified. There are two main tasks in the Preparation phase: establishment and clear communication of the experiment protocol as well as the compilation of a project database. The definition of protocol specifications can be chosen by the partners, but they must provide detailed and exhaustive instructions regarding (i) driving principles of the protocol, which include and reflect the purpose of the experiment; (ii) data requirements and formatting, (iii) experiment execution steps, and (iv) result reporting and formatting. An initial protocol version is prepared and then evaluated by single partners and returned for improvement if ambiguities are found. Personal choices, independently made by partner groups during a test execution of the experiment, might be included. Such choices need to be well defined, and a comparability of results must be ensured through requirements in the protocol. Once the experiment protocol is agreed, partners collect, compile and publish the data necessary for the experiment using formal version-control criteria, following again a release and evaluation cycle. The Execution phase starts immediately after the completion of these tasks, and the protocol is released to all partners, who perform the experiment independently. The protocol execution can include further interaction between partners, which must be well defined in the protocol. During this phase, there should be a formal mechanism to notify partners of unexpected errors that lead to an experiment abort and return to the protocol definition. Errors can then be corrected in a condensed iteration of the Preparation phase. All partners report experiment results to the coordinating partner, who then compiles and releases the overall outcomes to all partners. The Analysis phase requires partners to analyse experiment results with respect to the proposed goals of the experiment. Partners communicate their analyses, leading to (i) rejection of experiment results as inconclusive regarding the original hypothesis, or (ii) publication of the experiment to a wider research community. This formalised workflow can then be filled by the experiment partners with more specific agreements on the infrastructure for a specific experiment. These may include:

technical agreements, as data documentation standards to adhere to or computational platforms to be used by the partners;
means of communication between partners, which could range from simple solutions as the establishment of an e-mail group to more complex forms, as an online communication platform with threaded public and private forums as well as online conferencing facilities;
file exchange between partners, including data, metadata, instructions, and experiment result content. This could be implemented through informal agreements as a deadline-based collection–compilation–release system, or formal solutions as the use of version-controlled file servers with well-defined release cycles.

Discussion and conclusions

Hydrology has always been hindered by the large variability of our environment. This variability makes it difficult for us to derive generalisable knowledge given that no single group can assess many locations in great detail or build up knowledge about a wide range of different systems. Open environmental data and the possibilities of a connected world offer new ways in which we might overcome these problems.

In this paper, we present an approach for collaborative numerical experiments using a virtual laboratory. The first experiment that was carried out in the SWITCH-ON VWSL suggests that the value of comparative experiments can be improved by specifying detailed protocols. Indeed, in the context of collaborative experiments, we may recognise two alternative experimental conditions: (i) experimenters want to do exactly the same things (i.e. same model with same data) or (ii) researchers decide to accomplish different model implementations and assumptions based on their personal experience. In the first case, the protocol agreed upon by project participants needs to be accurately defined in order to eliminate personal choices from experiment execution. Under this experimental condition, reproducibility of experimental results among different research groups should be consistent with repeatability within a single research group. The experience from using Protocol 1 showed the importance of an accurate definition of experiment design and a detailed selection of appropriate tools, which helped to overcome several incidents during experimental set-up and execution. Problems related to insensitive parameters, local optima and inappropriate model structure for the study catchments led to variability in performance across research groups. Our experience revealed that quantifying the within-group variability (i.e. repeatability) is necessary to adequately assess reproducibility between-groups. In turn, residual variability may indicate a lack of reproducibility, and aid in the identification of specific issues, as considered above. In the second case, the experiment is similar to traditional model intercomparison projects e.g., where each group is allowed to perform the experiment by making personal choices and using their own model concept. These choices may lead to major differences in the model set-up and parameters . Under these more flexible experimental conditions, the main goal of the experiment should be clearly defined. In Protocol 2, all research groups aimed at improving model performances, even though we did not deliberately specify what “model improvement” meant a priori: this could be either reaching a higher statistical metric, less equifinality among parameter values or a more reliable model in terms of realistic internal variables. In this case, the main goal of the experiment was to profit from researchers' personal experience in order to improve model performances. Indeed, each interpretation could be justified and different considerations could be normally taken by the modeller depending on the purpose of the experiment. Through this process, the modellers were able to engage in a collective discussion that pointed out the model limitations and the sensitivity of the results to different modelling options. Even though results from Protocol 2 are less comparable than the outcomes from Protocol 1, the collective numerical experiment allowed comparison between different approaches suggested by individual experience and knowledge.

Multi-basin applications of hydrological models allowed the experimenters to identify links between physical catchment behaviours, efficient model structures and reliable priors for model parameters – all based on expertise with different systems by different groups. Even though we engaged in a relatively simple collaborative hydrological exercise, the results discussed here show that it is important to revisit experiments that are seemingly simpler than existing inter-group model comparisons to understand how small differences affect model performance. What is clear is that it is fundamental to control for different factors that may affect the outcomes of more complex experiments, such as modeller choice and calibration strategy. In more complex situations the virtual experiments could be conducted through comparisons at different levels of detail. For example, if models with different structures were to be compared there will be no one-to-one mapping of the state variables and model parameters and the comparison would be applied to a higher level of conceptualisations. There are a number of examples in the literature where comparisons at different levels of conceptualisation have been demonstrated to provide useful results. One such example is Chicken Creek model intercomparison where the modellers were given an increasing amount of information about the catchment in steps, and in each step the model outputs in terms of water fluxes were compared. The Chicken Creek intercomparison involved models of vastly different complexities, yet provided interesting insights in the way models made assumptions about the hydrological processes in the catchment and the associated model parameters. Another example is the Predictions in Ungauged Basins (PUB) comparative assessment where a two-step process was adopted. In a first step (Level 1 assessment), a literature survey was performed and publications in the international refereed literature were scrutinised for results of the predictive performance of runoff, i.e. a meta-analysis of prior studies performed by the hydrological community. In a second step (Level 2 assessment) some of the authors of the publications from Level 1 were approached with a request to provide data on their runoff predictions for individual ungauged basins. At Level 2 the overall number of catchments involved was smaller than in the Level 1 assessment but much more detailed information on individual catchments was available. Level 1 and Level 2 were therefore complementary steps. In a similar fashion, virtual experiments could be conducted using the protocol proposed in this paper at different, complementary levels of complexity. The procedure for protocol development (Fig. ), which notably checks on independent model choices between partners and feedback to earlier stages in protocol development, will help in developing protocols for more complex collaborative experiments, addressing real science questions on floods, droughts, water quality and changing environments. More elaborated experiments are part of ongoing work in the SWITCH-ON project, and the adequacy of the protocol development procedure itself will be evaluated during these experiments. The modelling study presented in this paper therefore represents a relatively simple, yet no less important first step towards collaborative research in the Virtual Water-Science Laboratory.

To sum up, in this study we set out to answer to the following specific scientific questions related to the concepts of reproducibility of experiments in computational hydrology, previously outlined in the Introduction.

What factors control reproducibility in computational scientific experiments in hydrology?The reproducibility is preliminarily governed by shared data and models along with experiment protocols, which define data requirements (metadata, also indicating versions of data sets) and format (for example, units of measurement, identification of no data, significant observation period), experiment execution (e.g. selection of a well-documented hydrological model code), and result analysis (e.g. criteria for judging model performances). These protocols aim at providing a common agreement and understanding among the involved research groups about data and experiment purpose. Human errors (e.g. ambiguity in variable names, small oversights during model execution) and unclear file-exchange procedures can be considered the main cause of a reduced reproducibility in the case researchers want to do the same thing. Conversely, if different model implementations are allowed, reduced reproducibility may depend on the lack of means of communication and clarity of the purpose of the modelling exercise or on the condition of multiple choices at once.
What is the way forward to ensure reproducibility in hydrology?In the case different research groups use the same data input and model code, an essential prerequisite to set up a reliable experiment is to formalise a rigorous protocol that has to be based on an agreed taxonomy along with a technical environment to avoid human mistakes. If, on the other hand, researchers are allowed to perform different model implementations, the main purpose of the modelling exercise needs to be clearly defined. For instance, in Protocol 2, the added value of researchers scientific knowledge was capable of extensively exploring alternative modelling options, which can be helpful for future hydrological experiments in the VWSL. Furthermore, the experiment should be designed such that the relationship between experimental choices (e.g. cause) and the experimental results (e.g. the effects of these choices) can be clearly determined. This is required to avoid a form of equifinality that results from experimental set-up, where the relative benefits of different choices made between research groups cannot be established. Also in this second case, a controlled technical environment will help to produce reproducible experiments. Therefore, version management of databases, code documentation, metadata, preparation of protocols, and feedback mechanisms among the involved partners are all issues that need to be considered in order to establish a virtual laboratory in hydrology. Virtual laboratories provide the opportunity to share data, knowledge and facilitate scientific reproducibility. Therefore they will also open the doors for the synthesis of individual results. This perspective is particularly important to create and disseminate knowledge and data on water science and open the way to more coherence of hydrological research.

Acknowledgements

The SWITCH-ON Virtual Water-Science Laboratory for water science is being developed within the context of the European Commission FP7 funded research project “Sharing Water-related Information to Tackle Changes in the Hydrosphere – for Operational Needs” (grant agreement number 603587). The overall aim of the project is to promote data sharing to exploit open data sources. The study contributes to developing the framework of the “Panta Rhei” Research Initiative of the International Association of Hydrological Sciences (IAHS). The authors acknowledge the Austrian Hydrographic Service (HZB); the Regional Agency for the Protection of the Environment – Piedmont Region, Italy (ARPA-Piemonte); the Regional Hydrologic Service – Tuscany Region, Italy (SIR Toscana); the Hydrographic Service of the Autonomous Province of Bolzano, Italy; the Global Runoff Data Centre (GRDC); European Climate Assessment & Dataset (ECA&D) for providing hydro-meteorological data that were not yet fully open. Edited by: F. Laio

Word count: 7760

Show less

© 2015. This work is published under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Reproducibility and repeatability of experiments are the fundamental prerequisites that allow researchers to validate results and share hydrological knowledge, experience and expertise in the light of global water management problems. Virtual laboratories offer new opportunities to enable these prerequisites since they allow experimenters to share data, tools and pre-defined experimental procedures (i.e. protocols). Here we present the outcomes of a first collaborative numerical experiment undertaken by five different international research groups in a virtual laboratory to address the key issues of reproducibility and repeatability. Moving from the definition of accurate and detailed experimental protocols, a rainfall–runoff model was independently applied to 15 European catchments by the research groups and model results were collectively examined through a web-based discussion. We found that a detailed modelling protocol was crucial to ensure the comparability and reproducibility of the proposed experiment across groups. Our results suggest that sharing comprehensive and precise protocols and running the experiments within a controlled environment (e.g. virtual laboratory) is as fundamental as sharing data and tools for ensuring experiment repeatability and reproducibility across the broad scientific community and thus advancing hydrology in a more coherent way.

Details

Title

Virtual laboratories: new opportunities for collaborative water science

Author

Ceola, S¹

; Arheimer, B²

; Baratti, E¹; Blöschl, G³; Capell, R²; Castellarin, A¹

; Freer, J⁴; Han, D⁵; Hrachowitz, M⁶

; Hundecha, Y²; Hutton, C⁷; Lindström, G²; Montanari, A¹

; Nijzink, R⁶

; Parajka, J³

; Toth, E¹

; Viglione, A³

; Wagener, T⁸

¹ Department DICAM, University of Bologna, Bologna, Italy
² Hydrology Research Section, Swedish Meteorological and Hydrological Institute (SMHI), Norrköping, Sweden
³ Institute of Hydraulic Engineering and Water Resources Management, Vienna University of Technology, Vienna, Austria
⁴ School of Geographical Sciences, University of Bristol, Bristol, UK
⁵ Department of Civil Engineering, University of Bristol, Bristol, UK
⁶ Water Resources Section, Faculty of Civil Engineering and Geosciences, Delft University of Technology, Delft, the Netherlands
⁷ School of Geographical Sciences, University of Bristol, Bristol, UK; Department of Civil Engineering, University of Bristol, Bristol, UK
⁸ Department of Civil Engineering, University of Bristol, Bristol, UK; Cabot Institute, University of Bristol, Bristol, UK

Pages

2101-2117

Publication year

2015

Publication date

2015

Publisher

Copernicus GmbH

ISSN

10275606

e-ISSN

16077938

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.5194/hess-19-2101-2015

ProQuest document ID

2414748024

Virtual laboratories: new opportunities for collaborative water science

Jump to:

Full text

Abstract

Details

Suggested sources