Introduction
CMIP6 , the latest Coupled Model Intercomparison
Project (CMIP), can trace its genealogy back to the “Charney report”
. This seminal report on the links between
and climate was an authoritative summary of the state of the
science at the time and produced findings that have stood the test of time
. It is often noted
Beyond its enduring findings on climate sensitivity, the Charney report also
gave rise to a methodology for the treatment of uncertainties and gaps in
understanding, which has been equally influential, and is in fact the basis
of CMIP itself. The report can be seen as one of the first uses of the
“multi-model ensemble”. At the time, there were two models available
representing the equilibrium response of the climate system to a change in
forcing, one from Syukuro Manabe's group at NOAA's Geophysical
Fluid Dynamics Laboratory (NOAA-GFDL) and the other from James Hansen's group
at NASA's Goddard Institute for Space Studies (NASA-GISS). Then as now, these
groups marshalled vast state-of-the-art computing and data resources to run
very challenging simulations of the Earth system. The report's results were
based on an ensemble of three runs from the Manabe group
The Atmospheric Model Intercomparison Project
“Klima” is German for “climate”.
had been standardized (AMIP, a preindustrial control, 1 % year increase to doubling, etc.). The future “scenarios” had emerged as well, for a total of five different experimental protocols. Fast-forwarding to today, CMIP6 expects more than 100 modelsAlongside the experiments themselves is the “data request”
The simulation output is now a primary scientific resource for researchers the world over, rivaling the volume of observed weather and climate data from the global array of sensors and satellites . Climate science and observed and simulated climate data have now become primary elements in the “vast machine” serving the global climate and weather research enterprise.
Managing and sharing this huge amount of data is an enterprise in its own
right – and the solution established for CMIP5 was the global Earth System
Grid Federation
Sites participating in the Earth System Grid Federation in May 2017. Figure courtesy of the IS-ENES data portal.
[Figure omitted. See PDF]
The sheer size and complexity of this infrastructure emerged as a matter of great concern at the end of CMIP5, when the growth in data volume relative to CMIP3 (from 40 TB to 2 PB, a 50-fold increase in 6 years) suggested the community was on an unsustainable path. These concerns led to the 2014 recommendation of the WGCM to form an infrastructure panel (based upon a proposal
This paper provides a summary of the findings by the WIP in the first 3 years of activity since its formation in 2014, and the consequent recommendations – in the context of existing organizational and funding constraints. In the text below, we refer to “findings”, “requirements”, and “recommendations”. Findings refer to observations about the state of affairs: technologies, resource constraints, and the like, based upon our analysis. Requirements are design goals that have been shared with those building the infrastructure, such as the ESGF software and security stack. Recommendations are our guidance to the community: experiment designers, modelling centres, and the users of climate data.
The intended audience for the paper is primarily the CMIP6 scientific community. In particular, we aim to show how the scientific design of CMIP6 as outlined in translates into infrastructural requirements. We hope this will be instructive to the MIP chairs and creators of multi-model experiments highlighting resource implications of their experimental design, and for data providers (modelling centres), to explain the sometimes opaque requirements imposed upon them as a requisite for participation. By describing how the design of this infrastructure is severely constrained by resources, we hope to provide a useful perspective to those who find data acquisition and analysis a technical challenge. Finally, we hope this will be of interest to general readers of the journal from other geoscience fields, illuminating the particular character of global data infrastructure for climate data, where the community of users far outstrip in numbers and diversity, the Earth system modelling community itself.
In Sect. , the principles and scientific rationale underlying the requirements for global data infrastructure are articulated. In Sect. the CMIP6 data request is covered: standards and conventions, requirements for modelling centres to process a complex data request, and projections of data volume. In Sect. , the recent evolution in how data are archived is reviewed alongside a licensing strategy consistent with current practice and scientific principle. In Sect. issues surrounding data as a citable resource are discussed, including the technical infrastructure for the creation of citable data, and the documentation and other standards required to make data a first-class scientific entity. In Sect. the implications of data replicas, and in Sect. issues surrounding data versioning, retraction, and errata are addressed. Section provides an outlook for the future of global data infrastructure, looking beyond CMIP6 towards a unified view of the “vast machine” for weather and climate data and computation.
Principles and constraints
This section lays out some of the principles and constraints which have resulted from the evolution of infrastructure requirements since the first CMIP experiment – beginning with a historical context.
Historical context
In the pioneering days of CMIP, the community of participants was small and well-knit, and all the issues involved in generating datasets for common analysis from different modelling groups were settled by mutual agreement (Ron Stouffer, personal communication, 2016). Analysis was performed by the same community that performed the simulations. The Program for Climate Model Diagnosis and Intercomparison (PCMDI), established at Lawrence Livermore National Laboratory (USA) in 1989, had championed the idea of a more systematic analysis of models, and in close cooperation with the climate modelling centres, PCMDI assumed responsibility for much of the day-to-day coordination of CMIP. Until CMIP3, the hosting of datasets from different modelling groups could be managed at a single archiving site; PCMDI alone hosted the entire 40 TB archive.
From its earliest phases, CMIP grew in importance, and its results have provided a major pillar that supports the periodic Intergovernmental Panel on Climate Change (IPCC) assessment activities. However, the explosive growth in the scope of CMIP, especially between CMIP3 and CMIP5, represented a tipping point in the supporting infrastructure. Not only was it clear that no one site could manage all the data, the necessary infrastructure software and operational principles could no longer be delivered and managed by PCMDI alone.
For CMIP5, PCMDI sought help from a number of partners under the auspices of the Global Organisation of Earth System Science Portals (GO-ESSP). Many of the GO-ESSP partners who became the foundation members and developers of the Earth System Grid Federation re-targeted existing research funding to help develop ESGF. The primary heritage derived from the original US Earth System Grid project funded by the US Department of Energy, but increasingly major contributions came from new international partners. This meant that many aspects of the ESGF system began from work which was designed in the context of different requirements, collaborations, and objectives. At the beginning, none of the partners had funds for operational support for the fledgling international federation, and even after the end of CMIP5 proper (circa 2014), the ongoing ESGF has been sustained primarily by small amounts of funding at a handful of the primary ESGF sites. Most ESGF sites have had little or no formal operational support. Many of the known limitations of the CMIP5 ESGF – both in terms of functionality and performance – were a direct consequence of this heritage.
With the advent of CMIP6 (in addition to some sister projects such as obs4MIPs, input4MIPs, and CREATE-IP), it was clear that a fundamental reassessment would be needed to address the evolving scientific and operational requirements. That clarity led to the establishment of the WIP, but it has yet to lead to any formal joint funding arrangement – the ESGF and the data nodes within it remain funded (if at all, many data nodes are marginal activities supported on best efforts) by national agencies with disparate timescales and objectives. Several critical software elements also are being developed on volunteer efforts and shoestring budgets. This finding has been noted in the US National Academies Report on “A National Strategy for Advancing Climate Modeling” , which warned of the consequences of inadequate infrastructure funding.
Infrastructural principles
- 1.
With greater complexity and a globally distributed data resource, it has become clear that in the design of globally coordinated scientific experiments, the global computational and data infrastructure needs to be formally examined as an integrated element.
The membership of the WIP, drawn as it is from experts in various aspects of the infrastructure, is a direct consequence of this requirement for integration. Representatives of modelling centres, infrastructure developers, and stakeholders in the scientific design of CMIP and its output comprise the panel membership. One of the WIP's first acts was to consider three phases in the process of infrastructure development: requirements, implementation, and operations, all informed by the builders of workflows at the modelling centres.
-
The WIP, in consort with the WCRP's CMIP Panel, takes responsibility for articulating the requirements for the infrastructure.
-
The implementation is in the hands of the infrastructure developers, principally ESGF for the federated archive , but also related projects like Earth System Documentation
ES-DOC
,.https://www.earthsystemcog.org/projects/es-doc-models/ (last access: 17 August 2018) -
In 2016 at the WIP's request, the “CMIP6 Data Node Operations Team” (CDNOT) was formed. It is charged with ensuring that all the infrastructure elements needed by CMIP6 are properly deployed and actually working as intended at the sites hosting CMIP6 data. It is also responsible for the operational aspects of the federation itself, including specifying what versions of the toolchain are run at every site at any given time, and organizing coordinated version and security upgrades across the federation.
Although there is now a clear separation of concerns into requirements, implementation, and operations, close links are maintained by cross-membership between the key bodies, including the WIP itself, the CMIP Panel, the ESGF Executive Committee, and the CDNOT.
-
- 5.
As the experimental design of CMIP has grown in complexity, costs both in time and money have become a matter of great concern, particularly for those designing, carrying out, and storing simulations. In order to justify commitment of resources to CMIP, mechanisms to identify costs and benefits in developing new models, performing CMIP simulations, and disseminating the model output need to be developed.
To quantify the scientific impact of CMIP, measures are needed to track the use of model output and its value to consumers. In addition to usage quantification, credit and tracing data usage in literature via citation of data is important. Current practice is at best citing large data collections provided by a CMIP participant, or all of CMIP. Accordingly, we note the need for a mechanism to identify and cite data provided by each modelling centre. Alongside the intellectual contribution to model development, which can be recognized by citation, there is a material cost to centres in computing and data processing, which is both burdensome and poorly understood by those requesting, designing, and using the results from CMIP experiments, who might not be in the business of model development. The criteria for endorsement introduced in CMIP6
see Table 1 in begin to grapple with this issue, but the costs still need to be measured and recorded. To begin documenting these costs for CMIP6, the “Computational Performance” MIP project (CPMIP) has been established, which will measure, among other things, throughput (simulated years per day) and cost (core-hours and joules per simulated year) as a function of model resolution and complexity. New tools for estimating data volumes have also been developed, see Sect. below.
With the basic fact of anthropogenic climate change
now well established
Accordingly, we note the requirement that infrastructure should ensure maximum transparency and usability for user (consumer) communities at some distance from the modelling (producer) communities.
While CMIP and the IPCC are formally independent, the CMIP archive is increasingly a reference in formulating climate policy. Hence the scientific reproducibility and the underlying durability and provenance of data have now become matters of central importance: the ability, long after the creation of the dataset, to trace back from model output to the configuration of models and the procedures and choices made along the way. This led the IPCC to require data distribution centres (DDCs) that attempt to guarantee the archiving and dissemination of these data in perpetuity, and subsequently to a requirement in the CMIP context of achieving reproducibility. Given the use of multi-model ensembles for both consensus estimates and uncertainty bounds on climate projections, it is important to document – as precisely as possible, given the independent genealogy and structure of many models – the details and differences among model configurations and analysis methods, to deliver both the requisite provenance and the routes to reproduction.
With the expectation that CMIP DECK experiment results should be routinely contributed to CMIP, opportunities now exist for engaging in a more systematic and routine evaluation of Earth system models (ESMs). This has led to community efforts to develop standard metrics of model “quality” . Typical multi-model analysis has hitherto taken the multi-model average, assigning equal weight to each model, as the most likely estimate of climate response. This “model democracy” has been called into question and there is now a considerable literature exploring the potential of weighting models by quality . The development of standard metrics would aid this kind of research.
To that end, there is now a requirement to enable (through the ESGF) a framework for accommodating quasi-operational evaluation tools that could routinely execute a series of standardized evaluation tasks. This would provide data consumers with an increasingly (over time) systematic characterization of models. It may be some time before a fully operational system of this kind can be implemented, but planning must start now.
In addition, there is an increased interest in climate analytics as a service . This follows the principle of placing analysis close to the data. Some centres plan to add resources that combine archiving and analysis capabilities, for example, NCAR's CMIP Analysis Platform
Experimental specifications have become ever more
complex, making it difficult to verify that experiment
configurations conform to those specifications. Several modelling
centres have encountered this problem in preparing for CMIP6,
noting, for example, the challenging intricacies in dealing with
input forcing data
Therefore, we note a requirement to encode the protocols to be directly ingested by workflows, in other words, “machine-readable experiment design”. The intent is to avoid, as far as possible, errors in conformance to design requirements introduced by the need for humans to transcribe and implement the protocols, for instance, deciding what variables to save from what experiments. This is accomplished by encoding most of the specifications in standard, structured, and machine readable text formats (XML and JSON) which can be directly read by the scripts running the model and post-processing, as explained further below in Sect. . The requirement spans all of the “controlled vocabularies” CMIP6_CVs
The transition from a unitary archive at PCMDI in CMIP3 to a globally federated archive in CMIP5 led to many changes in the way users interact with the archive, which impacts management of information about users and complicates communications with them. In particular, a growing number of data users no longer registered or interacted directly with the ESGF. Rather they relied on secondary repositories, often copies of some portion of the ESGF archive created by others at a particular time (see, for instance, the IPCC CMIP5 Data Factsheet
This key finding implies a more distributed design for several features outlined below, which devolve many of these features to the datasets themselves rather than the archives. One may think of this as a “dataset-centric rather than system-centric” design (in software terms, a “pull” rather than “push” design): information is made available upon request at the user/dataset level, relieving the ESGF implementation of an impossible burden.
Based upon the above considerations, the WIP produced a set of position papers (see Appendix ) encapsulating specifications and recommendations for CMIP6 and beyond. These papers, summarized below, are available from the WIP website
A structured approach to data production
The CMIP6 data framework has evolved considerably from CMIP5, and follows the principles of scientific reproducibility (Item 3 in Sect. 2.2) and the recognition that the complexity of the experimental design (Item 6) required far greater degrees of automation within the production workflow generating simulation results. As a starting point, all elements in the experiment specifications must be recorded in structured text formats (XML and JSON, for example), and any changes must be tracked through careful version control. “Machine-readable” specification of all aspects of the model output configuration is a design goal, as noted earlier.
The data request spans several elements discussed in sub-sections below.
CMIP6 data request
The CMIP6 data request
The CMIP6 data request combines definitions of variables and their output format with specifications of the objectives they support and the experiments that they are required for. The entire request is encoded in an XML database with rigorous type constraints. Important elements of the request, such as units, cell methods (expressing the subgrid processing implicit in the variable definition), sampling frequencies, and time “slices” (subsets of the entire simulation period as defined in the experimental design) for required output, are defined using controlled vocabularies that ensure consistency of interpretation. The request is designed to enable flexibility, allowing modelling centres to make informed decisions about the variables they should submit to the CMIP6 archive from each experiment.
In order to facilitate the cross linking between the 2100 variables from the 287 experiments, the request database allows MIPs to aggregate variables and experiments into groups. This allows MIPs to designate variable groups by priority and provides for queries that return the list of variables needed from any given experiment at a specified time slice and frequency.
This formulation takes into account the complexities that arise when a particular MIP requests that variables needed for their own experiments should also be saved from a DECK experiment or from an experiment proposed by a different MIP.
The data request supports a broad range of users who are provided with a range of different access points. These include the entire codification in the form of a structured (XML) document, web pages, or spreadsheets, as well as a Python API and command-line tools, to satisfy a wide variety of usage patterns for accessing the data request information.
The data request's machine-readable database has been an extraordinary resource for the modelling centres. They can, for example, directly integrate the request specifications with their workflows to ensure that the correct set of variables are saved for each experiment they plan to run. In addition, it has given them a new-found ability to estimate the data volume associated with meeting a MIP's requirements, a feature exploited below in Sect. .
Model inputs
Datasets used by the model for the configuration of model inputs
Data reference syntax
The organization of the model output follows the data reference syntax (DRS)
CMIP6 data volumes
As noted, extrapolations based on CMIP3 and CMIP5 lead to some alarming
trends in data volume
A rigorous approach is needed to estimate future data volumes, rather than relying on simple extrapolation. Contributions to the increase in data volume include the systematic increase in model resolution and complexity of the experimental protocol and data request. We consider these separately:
Resolution
The median horizontal resolution of a CMIP model tends to grow with time, and is typically expected to be 100 km in CMIP6, compared with 200 km in CMIP5. Generally the temporal resolution of the model (although not the data) is doubled as well, for reasons of numerical stability. Thus, for an -fold increase in horizontal resolution, we require an increase in computational capacity. The vertical resolution grows in a more controlled fashion, at least as far as the data are concerned, as often the requested output is reported on a standard set of atmospheric levels that has not changed much over the years. Similarly the temporal resolution of the data request does not increase at the same rate as the model time step: monthly averages remain monthly averages. Thus, the increase in computational capacity will result in an increase in data volume, ceteris paribus. Thus, data volume () and computational capacity () are related as , purely from the point of view of resolution. Consequently, if centres then experience an 8-fold increase in between CMIPs, we can expect a doubling of model resolution and an approximate quadrupling of the data volume (see discussion in the CMIP6 Output Grid Guidance document
A similar approximate doubling of model resolution occurred between CMIP3 and CMIP5, but data volume increased 50-fold. What caused that extraordinary increase?
Complexity
The answer lies in the complexity of CMIP: the complexity of the data request and of the experimental protocol. The first component, the data request complexity, is related to that of the science: the number of processes being studied, and the physical variables required for the study, along with the large number of satellite MIPs (23) that now comprise the CMIP6 project. In CPMIP , we have attempted a rigorous definition of this complexity, measured by the number of physical variables simulated by the model. This, we argue, grows not smoothly like resolution, but in very distinct generational step transitions, such as the one from atmosphere–ocean models to Earth system models, which, as shown in , involved a substantial jump in complexity with regard to the number of physical, chemical, and biological species being modelled. Many models of the CMIP5 era added atmospheric chemistry and aerosol–cloud feedbacks, sometimes with species. CMIP5 also marked the first time in CMIP that ESMs were used to simulate changes in the carbon cycle.
The second component of complexity is the experimental protocol, and the number of experiments themselves when comparing successive phases of CMIP. The number of experiments (and years simulated) grew from 12 in CMIP3 to about 50 in CMIP5, greatly inflating the data produced. With the new structure of CMIP6, with a DECK and 23 endorsed MIPs, the number of experiments has grown tremendously (from about 50 to 287). We propose as a measure of experimental complexity, “the total number of simulated years (SYs)” called for by the experimental protocol. Note that modelling centres must make trade-offs between experimental complexity and resolution in deciding their level of participation in CMIP6, as discussed in .
Two further steps have been proposed toward ensuring sustainable growth in
data volumes. The first of these is the consideration of standard horizontal
resolutions for saving data, as is already done for vertical and temporal
resolution in the data request. Cross-model analyses already cast all data to
a common grid in order to evaluate it as an ensemble, typically at fairly low
resolution. The studies of Knutti and colleagues (e.g.
), for example, are typically performed on
relatively coarse grids. Accordingly for most purposes atmospheric data on
the ERA-40 grid () would suffice, with obvious
exceptions for experiments like those called for by HighResMIP
. A similar conclusion applies for ocean data (the
World Ocean Atlas grid), with extended
discussion of the benefits and losses due to regridding
This has not been mandated for CMIP6 for a number of reasons. Firstly, regridding is burdensome on many grounds: it requires considerable expertise to choose appropriate algorithms for particular variables, for instance, we may need algorithms that guarantee the exact conservation for scalars, or the preservation of streamlines for vector fields may be a requirement; and it can be expensive in terms of computation and storage. Secondly, regridding is irreversible (which amounts to “lossy” data reduction) and non-commutative with certain basic arithmetic operations such as multiplication (i.e. the product of regridded variables does not in general equal the regridded output of the product computed on the native grid). This can be problematic for budget studies. However, the same issues would apply for time-averaging and other operations long used in the field: much analysis of CMIP output is performed on monthly averaged data, which is “lossy” compression along the time axis relative to the model's time resolution.
These issues have contributed to a lack of consensus in moving forward, and the recommendations on regridding remain in flux. The CMIP6 Output Grid Guidance document
There is a similar lack of consensus around whether or not to adopt a common calendar for particular experiments. In cases such as a long-running control simulation where all years are equivalent and of no historical significance, it is customary in this community to use simplified calendars – such as a Julian, a “no-leap” (365-day), or an “equal-month” (360-day) calendar – rather than the Gregorian. However, comparison across datasets using different calendars can be a frustrating burden on the end-user. However, there is no consensus at this point to impose a particular calendar.
As outlined below in Sect. , both ESGF data nodes and the
creators of secondary repositories are given considerable leeway in choosing
data subsets for replication, based on their own interests. The tracking
mechanisms outlined in Sect. below will allow us to
ascertain, after the fact, how widely used the native grid data may be
vis-à-vis the regridded subset, and allow us to recalibrate the
replicas, as usage data becomes available. We also note that the providers of
at least one of the standard metrics packages
A second method of data reduction for the purposes of storage and transmission is the issue of data compression. The netCDF4 software, which is used in writing CMIP6 data, includes an option for lossless compression or deflation that relies on the same technique used in standard tools such as gzip. In practice, the reduction in data volume will depend upon the “entropy” or randomness in the data, with smoother data or fields with many missing data points (e.g. land or ocean) being compressed more.
Dealing with compressed data entails computational costs, not only during its creation, but also every time the data are reinflated. There is also a subtle interplay with precision: for instance temperatures usually seen in climate models appear to deflate better when expressed in Kelvin, rather than Celsius, but that is due to the fact that the leading order bits are always the same; thus, the data is actually less precise. Deflation is also enhanced by reorganizing (“shuffling”) the data internally into chunks that have spatial and temporal coherence.
Some argue for the use of more aggressive lossy compression methods , but for CMIP6 it can be argued that the resulting loss of precision and the consequences for scientific results require considerably more evaluation by the community before such methods can be accepted. However, as noted above, some lossy methods of data reduction (e.g. time-averaging) have long been common practice.
To help inform the discussion about compression, we undertook a systematic study of typical model output files under lossless compression, the results of which are publicly available
The DREQ
For instance, analyses available at the DREQ site
Prior to CMIP5, similar analyses were undertaken at PCMDI to estimate data volume and the predicted volume proved reasonably accurate. However, the methods used for CMIP5 could not be applied to CMIP6 because they depended on having a much less complex data request. In particular, the cross-MIP data requests (variables requested by one MIP from another MIP, or the DECK) require a more sophisticated algorithm. The experience in many modelling centres as present is that data volume estimates become available only after the production runs have begun. Reliable estimates ahead of time based on nothing more than the experimental protocols and model resolutions are valuable for preparation and planning hardware acquisitions.
It should be noted that reporting output on a lower resolution standard grid (rather than the native model grid) could shrink the estimated data volume 10-fold, to 1.8 PB. This is an important number, as will be seen below in Sect. : the managers of Tier 1 nodes (the largest nodes in the federation) have indicated that 2 PB is about the practical limit for replicated storage of data from all CMIP6 models. This target could be achieved by requiring compression and the use of reduced-resolution standard grids, but modelling centres are free to choose whether or not to compress and regrid.
Licensing
The licensing policy established for CMIP6 is based on an examination of data usage patterns in CMIP5. First, while in CMIP5 the licensing policy called for registration and acceptance of the terms of use, a large fraction, perhaps a majority of users, actually obtained their data not directly from ESGF, but from third-party copies, such as the “snapshots” alluded to in Item 7, Sect. . Those users accessing the data indirectly, as shown in Fig. , relied on user groups or their home institutions to make secondary repositories that could be more conveniently accessed. The WIP CMIP6 Licensing and Access Control
At the same time we wish to retain the ability for users of these “dark” repositories to benefit from the augmented provenance services provided by infrastructure advances, where a user can inform themselves or be notified of data retractions or replacements when contributed datasets are found to be erroneous and replaced (see Sect. and ).
Typical data access pattern in CMIP5 involved users making local copies, and user groups making institutional-scale caches from ESGF. Figure courtesy of Stephan Kindermann, DKRZ, adapted from WIP Licensing White Paper.
[Figure omitted. See PDF]
The proposed licensing policy removes the impossible task of license enforcement from the distribution system, and embraces the “dark” repositories and users. To quote the WIP position paper:
The proposal is that (1) a data license be embedded in the data files, making it impossible for users to avoid having a copy of the license, and (2) the onus on defending the provisions of the license be on the original modeling center …
Licenses will be embedded in all CMIP6 files, and all repositories, whether sanctioned or “dark”, can be data sources, as seen below in the discussion of replication (Sect. ). In the embedded license approach, modelling centres are offered two choices of Creative Commons licenses: data covered by the Creative Commons Attribution “ShareAlike” 4.0 International License
Citation, provenance, quality assurance, and documentation
As noted in Sect. , citation requirements flow from two underlying considerations: one, to provide proper credit and formal acknowledgment of the authors of datasets; and the other, to enable rigorous tracking of data provenance and data usage. The tracking facilitates scientific reproducibility and traceability, as well as enabling statistical analyses of dataset utility.
In addition to clearly identifying what data have been used in research studies and who deserves credit for providing that data, it is essential that the data be examined for quality and that documentation be made available describing the model and experiment conditions under which it was generated. These subjects are addressed in the four position papers summarized in this section.
The principles outlined above are well-aligned with the Joint Declaration of Data Citation Principles
Given the complexity of the CMIP6 data request, we expect a total dataset count of . Because dozens of datasets are typically used in a single scientific study, it is impractical to cite each dataset individually in the same way as individual research publications are acknowledged. Based on this consideration, there needs to be a mechanism to cite data and give credit to data providers that relies on a rather coarse granularity, while at the same time offering another option at a much finer granularity for recording the specific files and datasets used in a study.
In the following, two distinct types of persistent identifiers (PIDs) are discussed: DOIs, which can only be assigned to data that comply with certain standards for citation metadata and curation, and the more generic “Handles”
Persistent identifiers for acknowledgment and citation
Based on earlier phases of CMIP, some datasets contributed to the CMIP6 archive will be flawed (due, for example, to errors in processing); therefore, they will not accurately represent a model's behaviour. When errors are uncovered in the datasets, they may be replaced with corrected versions. Similarly, additional datasets may be added to an initially incomplete collection of datasets. Thus, initially at least, the DOIs assigned for the purposes of citation and acknowledgement will represent an evolving underlying collection of datasets.
The recommendations, detailed in the CMIP6 Data Citation and Long Term Archival
For evolving dataset aggregations, the data citation infrastructure relies on information collected from the data providers and uses the DataCite
-
aggregations that include all the datasets contributed by one model from one institution from all of a single MIP's experiments, and
-
smaller-size aggregations that include all datasets contributed by one model from one institution generated in performing one experiment (which might include one or more simulations).
These aggregations are dynamic as far as the PID infrastructure is concerned: new elements can be added to the aggregation without modifying the PID. As an example, for the coarser of the two aggregations defined above, the same PID will apply to an evolving number of simulations as new experiments are performed with the model. This PID architecture is shown in Fig. . Since these collections are dynamic, citation requires authors to provide a version reference.
Schematic PID architecture, showing layers in the PID hierarchy. In the lower layers of the hierarchy, PIDs are static once generated, and new datasets generate new versions with new PIDs. Each file carries a PID and each collection (dataset, simulation, and so on) is related to a PID. Resolving the PID in the Handle server guides the user to the file or the landing page describing the collection. Each box in the figure will be uniquely addressed by its PID.
[Figure omitted. See PDF]
As an initial dataset matures and becomes stable, it is assigned a new DOI. Before this is done, to meet formal requirements, the data citation infrastructure requires some additional steps. First, we ensure that there has been sufficient community examination of the data (through citations in published literature, for instance) to qualify it as having been peer-reviewed. Second, further steps are undertaken to assure important information exists in ancillary metadata repositories, including, for example, documentation (ES-DOC, errata and citation) and to provide quality assurance of data and metadata consistency and completeness (see Sect. ). Once these criteria have been satisfied, a DOI will be issued by the IPCC DDC hosted by DKRZ. These dataset collections will meet the stringent metadata and documentation requirements of the IPCC DDC. Since these collections are static, no version reference is required in a citation. Should errors be subsequently found, they will be corrected in the data and published under a new DOI. The original DOI and its related data are still available but are labelled as superseded with a link recorded pointing to the corrected data.
For CMIP6, the initially assigned DOIs (associated with evolving collections of data) must be used in research papers to properly give credit to each of the modelling groups providing the data. Once a stable collection of datasets has met the higher standards for long-term curation and quality, the DOI assigned by the IPCC DDC should be used instead. The data citation approach is described in greater detail in .
Persistent identifiers for tracking, provenance, and curation
Although the DOIs assigned to relatively large aggregations of datasets are well suited for citation and acknowledgment purposes, they are not issued at fine enough granularity to meet the scientific imperative that published results should be traceable and verifiable. Furthermore, management of the CMIP6 archive requires that PIDs be assigned at a much finer granularity than the DOIs. For these purposes, PIDs recognized by the Global Handle Registry will be assigned at two different levels of granularity: one per file and one per dataset.
A unique Handle will be generated each time a new CMIP6 data file is created,
and the Handle will be recorded in the file's metadata (in the form of a
netCDF global attribute named
As described in the CMIP6 Persistent Identifiers Implementation Plan
PID workflow, showing the generation and registry of PIDs, with checkpoints where compliance is assured.
[Figure omitted. See PDF]
The implementation plan describes methods for generating and registering Handles using an asynchronous messaging system known as RabbitMQ. This system, designed in collaboration with ESGF developers and shown in Fig. , guarantees, for example, that PIDs are correctly generated in accordance with the versioning guidelines. The CMIP6 Handle system builds on the idea of tracking-ids used in CMIP5, but with a more rigorous quality control to ensure that new PIDs are generated when data are modified. The dataset and file Handles are also associated with basic metadata, called PID kernel information , which facilitate the recording of basic provenance information. Datasets and files point to each other to bind the granularities together. In addition, dataset kernel information refers to previous and later versions, errata information, and replicas, as explained in more detail in the position paper.
Quality assurance
Quality assurance (QA) encompasses the entire data life cycle, as depicted in Fig. . At all stages, a goal is to capture provenance information that will enable scientific reproducibility. Further, as noted in Item 2 in Sect. 2.2, the QA procedures should uncover issues that might undermine trust in the data by those outside the Earth system modelling community if errors were left unreported.
Schematic of the phases of quality assurance, displaying earlier stages in the hands of modelling centres (left), and more formal long-term data curation stages (right). Quality assurance is applied both to the data (D, above) as well as the metadata (M) describing the data. Figure drawn from the WIP's Quality Assurance position paper.
[Figure omitted. See PDF]
QA must ensure that the data and metadata correctly reflect a model's simulation, so that it can be reliably used for scientific purposes. As depicted in Fig. , the first stage of QA is the responsibility of the data producer: in fact the cycle of model development and diagnosis is the most critical element of QA. The second aspect is ensuring that disseminated data include common metadata based on common CVs, which will enable consistent treatment of data from different groups and institutions. These requirements are directly embedded in the ESGF publishing process and in tools such as CMOR
At this point, as noted in Fig. , control is ceded to the ESGF system, where designated QA nodes (ESGF data nodes where additional services are turned on) perform further QA checks to certify data is suitable for citation and long-term archiving. A critical step is the assignment of PIDs (Sect. , the D2 stage of Fig. ), which is more controlled than in CMIP5 and guarantees that across the data life cycle, the PIDs will be reliably useful as unique labels of datasets.
Beyond this, further stages of QA will be handled within the ESGF system following procedures outlined in the CMIP6 Quality Assurance
Documentation of provenance
As noted earlier in Sect. , for data to become a first-class scientific resource, the methods of their production must be documented to the fullest extent possible. For CMIP6, this includes documenting both the models and the experiments. While traditionally this is done through peer-reviewed literature, which remains essential, we note that to facilitate various aspects of search, discovery, and tracking of datasets, there is an additional need for structured documentation in machine readable form.
Elements of ES-DOC documentation. Rows indicate phases of the modelling process being documented, and box colours indicate the parties responsible for producing the documentation (see legend). Figure courtesy of Guillaume Levavasseur, IPSL.
[Figure omitted. See PDF]
In CMIP6, the documentation of experiments, models, and
simulations is done through the Earth System Documentation
A critical element in the ES-DOC process is the documentation of
conformances: steps undertaken by the modelling centres to ensure
that the simulation was conducted as called for by the experiment design. It
is here that the input datasets used in a simulation are documented
The method of capturing the conformance documentation is a two-stage process that has been designed to minimize the amount of work required by a modelling centre. The first stage is to capture the many conformances common to all simulations. ES-DOC will then automatically copy these common conformances to multiple simulations, thereby eliminating duplicated effort. This is followed by a second stage in which those conformances that are specific to individual experiments or simulations are collected.
While this method of documentation is unfamiliar to many, such methods
are likely to become common and required practice in the maturing
digital age as part of best scientific practices. Documentation of
software validation
In keeping with the “dataset-centric rather than system-centric”
approach (Item 7 in Sect. 2.2), a user
will be directly linked to documentation from each dataset. This is
done in CMIP6 by adding a required global attribute
Replication
The replication strategy is covered in the CMIP6 Replication and Versioning
-
ensuring at least one copy of a dataset is present at a stable ESGF node with a mission of long-term maintenance and curation of data. The total data storage resources planned across the Tier 1 nodes in the CMIP6 era is adequate to support this requirement, although some data will likely be held on accessible tape storage rather than spinning disk.
In addition, we have articulated a number of secondary goals:
-
enhancing data accessibility across the ESGF (e.g. Australian data easily accessible to the European continent despite the long distance);
-
enabling each Tier 1 data node to enact specific policies to support their local objectives;
-
ensuring that the most widely requested data is accessible from multiple ESGF data nodes (of course, any dataset will be available at least on its original publication data node);
-
enabling large-scale data analysis across the federation (see Item 4 in Sect. );
-
ensuring continuity of data access in the event of individual node failures;
-
enabling network load-balancing and enhanced performance;
-
reducing the manual workload related to replication;
-
and building a reliable replication mechanism that can be used not only within the federation, but by the secondary repositories created by user groups (see discussion in Sect. around Fig. ).
In conjunction with the ESGF and the International Climate Networking Working Group (ICNWG), these recommendations have been translated to two options for replication.
The basic toolchain for replication is built on updated versions of the software layers used in CMIP5 including the following: synda
As one option, these layers can be used for ad hoc replication by sites or user groups. For ad hoc replication, there is no obvious mechanism for triggering updates or replication when new or corrected data are published (or retracted, see Sect. below). As a second option, certain designated nodes (replica nodes) will maintain a protocol for automatic replication, shown in Fig. .
CMIP6 replication from data nodes to replica centres and between replica centres coordinated by a CMIP6 replication team, under the guidance of the CDNOT.
[Figure omitted. See PDF]
Given the nature of some of the secondary goals listed above, it would not be appropriate to prescribe which data should be replicated by each centre. Rather, the plan should be flexible to accommodate changing data use profiles and resource availability. A replication team under the guidance of the CDNOT will coordinate the replication activities of the CMIP6 data nodes such that the primary goal is achieved and an effective compromise for the secondary goals is established.
The International Climate Network Working Group (ICNWG), formed under the Earth System Grid Federation (ESGF), helps set up and optimize network infrastructures for ESGF climate data sites located around the world. For example, prioritizing the most widely requested data for replication can best be done based on operational experience and will of course change over time. To ensure that the replication strategy is responding to user need and data node capabilities, the replication team will maintain and run a set of monitoring and notification tools assuring that replicas are up to date. The CDNOT is tasked with ensuring the deployment and smooth functioning of replica nodes.
A key issue that emerged from discussions with node managers is that the replication target has to be of sustainable size. A key finding is that a replication target about 2 PB in size is the practical (technical and financial) limit for CMIP6 online (disk) storage at any single location. Replication beyond this may involve offline storage (tape) for disaster recovery.
Based on experience in CMIP5, it is expected that a number of “special interest” secondary repositories will hold selected subsets of CMIP6 data outside of the ESGF federation. This will have the effect of widening data accessibility geographically, and by user communities, with obvious benefit to the CMIP6 project. These secondary repositories will be encouraged and supported where it does not undermine CMIP6 data management and integrity objectives.
In the new dataset-centric approach, licenses and PIDs remain embedded and will continue to play their roles in the data toolchain even for these secondary repositories.
In CMIP5 a significant issue for users of some third-party archives was that their replicated data was taken as a one-time snapshot (see discussion above in Item 7 in Sect. ), and not updated as new versions of the data were submitted to the source ESGF node. Tools have been developed by a number of organizations to maintain locally synchronized archives of CMIP5 data and third-party providers should be encouraged to make use of these types of tools to keep the local archives up to date.
In summary, the requirements for replication are limited to ensuring
-
that within a reasonably short time period following submission, there is at least one instance of each submitted dataset stored at a Tier 1 node (in addition to its primary residence);
-
that subsequent versions of submitted datasets are also replicated by at least one Tier 1 node (see versioning discussion below in Sect. );
-
that creators of secondary repositories take advantage of the replication toolchain described here, to maintain replicas that can be kept up to date, and inform local users of dataset retractions and corrections;
-
that the CDNOT is the recognized body to manage the operational replication strategy for CMIP6.
We note that the ESGF PID registration service is part of the ESGF
data publication implementation and not exclusive to CMIP6, and is now
in use by the input4MIPs and obs4MIPs projects. The PID registration
service works for all NetCDF-CF files that carry a PID as
Thus, CMIP6 is the first implementation of the PID service in a larger data project and ESGF provides, in parallel, the classical data access via the data reference syntax outlined in the CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and CVs
Versioning
The versioning strategy for CMIP6 datasets (see the CMIP6 Replication and Versioning
A consistent versioning methodology across all the ESGF data nodes is required to satisfy these objectives. We note that inconsistent or informal versioning practices at individual nodes would likely be invisible to the ESGF infrastructure (e.g. yielding files that look like replicas, but with inconsistent data and checksums), which would inhibit traceability across versions.
Building on the replication strategy and on input from the ESGF implementation teams, versioning will leverage the PID infrastructure of Sect. . PIDs are permanently associated with a dataset, and new versions will get a new PID. When new versions are published, there will be two-way links created within the PID kernel information so that one may query a PID for prior or subsequent versions.
A version number will be assigned to each atomic dataset: a complete time series of one variable from one experiment and one model. The implication is that if an error is found in a single variable, other variables produced from the simulation need not be republished. If an entire experiment is retracted and republished, all variables will get a consistent version number. The CDNOT will ensure consistent versioning practices at all participating data nodes.
Errata
In particular, it is worth highlighting the new recommendations regarding errata. Until CMIP5, we relied on the ESGF system to push notifications to registered users regarding retractions and reported errors. This was found to result in imperfect coverage: as noted in Sect. , a substantial fraction of users are invisible to the ESGF system. Therefore, following the discussion in Sect. (see Item 7), we recommended a design which is dataset-centric rather than system-centric. Notifications are no longer pushed to users; rather they will be able to query the status of a dataset they are working with (e.g. ES-DOC Dataset Errata search
The future of the global data infrastructure
The WIP was formed in response to the explosive growth of CMIP between CMIP3 and CMIP5, and it is charged with studying and making recommendations about the global data infrastructure needed to support CMIP6 and subsequent similar WCRP activities as they are established and evolve. Our findings reflect the fact that CMIP is no longer a cottage industry, and a more formal approach is needed. Several of the findings have been translated into requirements on the design of the underlying software infrastructure for data production and distribution. We have separated infrastructure development into requirements, implementation, and operations phases, and we have provided recommendations on the most efficient use of scarce resources. The resulting recommendations stop well short of any sort of global governance of this “vast machine”, but address many areas where, with a relatively light touch, beneficial order, control, and resource efficiencies result.
One key finding that informs everything is that it appears that the
critical importance of such infrastructure is under-appreciated.
Building infrastructure using research funds puts the system in an
untenable position, with a fundamental contradiction at its heart:
infrastructure by its nature should be reliable, robust, based on what
is proven to work, and invisible, whereas scientific research is
hypothesis-driven, risky, and novel, and its results are widely broadcast.
While recommendations have been made at the highest level advocating
remedies
The central theme of this paper is the inversion of the design of federated data distribution, to make it dataset-centric rather than system-centric. We believe that this one aspect of the design considerably reduces systemic risk, and allows the size of the system to scale up and down as resource constraints allow. Individual scientists or institutions or consortia, will be able to pool resources and share data at will, with relatively light requirements related to licensing (Sect. ) and dataset tracking (Sect. ). This relieves a considerable design burden from the ESGF software stack, and further, recognizes that the data ecosystem extends well beyond the reach of any software system and that data will be used and reused in a myriad of ways outside anyone's control.
A second key element of the design is the insistence on machine-readable experimental protocols. Standards, conventions, and vocabularies are now stored in machine-readable structured text formats like XML and JSON, thereby enabling software to automate aspects of the process. This meets an existing urgent need, with some modelling centres already exploiting this structured information to mitigate against the overwhelming complexity of experimental protocols. Moreover, this will also enable and encourage unanticipated future use of the information in developing new software tools for exploiting it as technologies evolve. Our ability to predict (whether correctly or not remains to be seen) the expected CMIP6 data volume is one such unexpected outcome.
Finally, the infrastructure allows user communities to assess the costs of participation as well as the benefits. For example, we believe the new PID-based methods of dataset tracking will allow centres to measure which data has value downstream. The importance of citations and fair credit for data providers is recognized with a design that facilitates and encourages proper citation practices. Tools have been added and made available that allow centres, and the CMIP itself, to estimate the data requirements of each experimental protocol. Ancillary activities such as CPMIP add to this an accounting of the computational burden of CMIP6.
Certainly not all issues are resolved, and the validation of some of our findings will have to await the outcome of CMIP6. There is no community consensus on some proposed design elements, such as standard grids. Some features long promised, such as server-side analytics (“bringing analysis to the data”) are yet to become fully mature, although many exciting efforts are underway, for instance early investigations at using cloud technologies, both for data storage and analysis (see discussion above, Item 4 in Sect. ). The ESGF Compute Working Team is also working on a set of requirements and “certification” guidelines
The future brings with it new challenges. First among these is an expansion of the data ecosystem. There is an increasing blurring of the boundary between weather and climate as time and space scales merge . This will increasingly entrain new communities into climate data ecosystems, each with their own modelling and analysis practices, standards and conventions, and other issues. The establishment of the WIP was a crucial step in enhancing the capabilities, standards, protocols, and policies around the CMIP enterprise. Earlier discussions on the scope of the WIP also suggested a broader scope for the panel on the longer-term, to coordinate not only the model intercomparison activities (including for example, the CORDEX project , which also relies upon ESGF for data dissemination) but also the climate prediction (seasonal to decadal) issues and corresponding observational and reanalysis aspects. We would recommend a closer engagement between these communities in planning the future of a seamless global data infrastructure, to better leverage infrastructure investments and effort.
A further challenge the WIP and the community must grapple with is the
evolution of scientific publication in the digital age, beyond the
peer-reviewed paper. We have noted above that the nature of
publication is changing
Future development of the WIP's activities beyond the delivery of CMIP6 will include an analysis of how the infrastructure design performed during CMIP6. That analysis, combined with our assessment of technological change and emerging novel applications, will inform the future design of infrastructure software, as well as recommendations to the designers of experiments on how best to fit their protocols within resource limitations. The vision, as always, is for an open infrastructure that is reliable and invisible, and allows Earth system scientists to be nimble in the design of collaborative experiments, creative in their analysis, and rapid in the delivery of results.
The software and data used for the study of data compression are available at the deflation study website
The software and data used for the prediction of data volumes are available at the dreqDataVol page
Most of the software referenced here for which the WIP is providing design guidelines and requirements, but not implementation, including the ESGF, ESDOC, and DREQ software stacks are open source and freely available. They are autonomous projects and, therefore, not listed here.
List of WIP position papers
-
CDNOT Terms of Reference
: a charter for the CMIP6 Data Node Operations Team. Authorship: WIP.https://www.earthsystemcog.org/site_media/projects/wip/CDNOT_Terms_of_Reference.pdf (last access: 17 August 2018) -
CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and CVs
: conventions and controlled vocabularies for consistent naming of files and variables. Authorship: Karl E. Taylor, Martin Juckes, Venkatramani Balaji, Luca Cinquini, Sébastien Denvil, Paul J. Durack, Mark Elkington, Eric Guilyardi, Slava Kharin, Michael Lautenschlager, Bryan Lawrence, Denis Nadeau, and Martina Stockhause, and the WIP.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_global_attributes_filenames_CVs_v6.2.6.pdf (last access: 17 August 2018) -
CMIP6 Persistent Identifiers Implementation Plan
: a system of identifying and citing datasets used in studies, at a fine grain. Authorship: Tobias Weigel, Michael Lautenschlager, Martin Juckes and the WIP.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_PID_Implementation_Plan.pdf (last access: 17 August 2018) -
CMIP6 Replication and Versioning
: a system for ensuring reliable and verifiable replication; tracking of dataset versions, retractions, and errata. Authors: Stephan Kindermann, Sebastien Denvil and the WIP.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_Replication_and_Versioning.pdf (last access: 17 August 2018) -
CMIP6 Quality Assurance
: systems for ensuring data compliance with rules and conventions listed above. Authorship: Frank Toussaint, Martina Stockhause, Michael Lautenschlager and the WIP.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_Quality_Assurance.pdf (last access: 17 August 2018)
-
CMIP6 Data Citation and Long Term Archival
: a system for generating Document Object Identifies (DOIs) to ensure long-term data curation. Authorship: Martina Stockhause, Frank Toussaint, Michael Lautenschlager, Bryan Lawrence and the WIP.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_Data_Citation_LTA.pdf (last access: 17 August 2018) -
CMIP6 Licensing and Access Control
: terms of use and licences to use data. Authorship: Bryan Lawrence and the WIP.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_Licensing_and_Access_Control.pdf (last access: 17 August 2018) -
CMIP6 ESGF Publication Requirements
: linking WIP specifications to the ESGF software stack, conventions that software developers can build against. Authorship: Martin Juckes and the WIP.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_ESGF_Publication_Requirements.pdf (last access: 17 August 2018) -
Errata System for CMIP6
: a system for tracking and discovery of reported errata in the CMIP6 system. Authorship: Guillaume Levavasseur, Sébastien Denvil, Atef Ben Nasser, and the WIP.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_Errata_System.pdf (last access: 17 August 2018) -
ESDOC Documentation
: An overview of the process for providing structured documentation of the models, experiments and simulations that produce the CMIP6 output datasets. Authorship: the ES-DOC Team.https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_ESDOC_documentation.pdf (last access: 17 August 2018)
All of the authors participated in the development of the paper's findings and recommendations.
The authors declare that they have no conflict of interest.
Acknowledgements
We thank Michel Rixen, Stephen Griffies, John Krasting, and three anonymous reviewers for their close reading and comments on early drafts of this paper. Colleen McHugh aided with the analysis of data volumes.
The research leading to these results received funding from the European Union Seventh Framework program under the IS-ENES2 project (grant agreement no. 312979).
Venkatramani Balaji is supported by the Cooperative Institute for Climate Science, Princeton University, award NA08OAR4320752 from the National Oceanic and Atmospheric Administration, U.S. Department of Commerce. The statements, findings, conclusions, and recommendations are those of the authors and do not necessarily reflect the views of Princeton University, the National Oceanic and Atmospheric Administration, or the U.S. Department of Commerce.
Bryan N. Lawrence acknowledges additional support from the UK Natural Environment Research Council.
Karl E. Taylor and Paul J. Durack are supported by the Regional and Global Model Analysis Program of the United States Department of Energy's Office of Science, and their work was performed under the auspices of Lawrence Livermore National Laboratory's contract DE-AC52-07NA27344. Edited by: Steve Easterbrook Reviewed by: three anonymous referees
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2018. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The World Climate Research Programme (WCRP)'s Working Group on Climate Modelling (WGCM) Infrastructure Panel (WIP) was formed in 2014 in response to the explosive growth in size and complexity of Coupled Model Intercomparison Projects (CMIPs) between CMIP3 (2005–2006) and CMIP5 (2011–2012). This article presents the WIP recommendations for the global data infrastructure needed to support CMIP design, future growth, and evolution. Developed in close coordination with those who build and run the existing infrastructure (the Earth System Grid Federation; ESGF), the recommendations are based on several principles beginning with the need to separate requirements, implementation, and operations. Other important principles include the consideration of the diversity of community needs around data – a data ecosystem – the importance of provenance, the need for automation, and the obligation to measure costs and benefits.
This paper concentrates on requirements, recognizing the diversity of communities involved (modelers, analysts, software developers, and downstream users). Such requirements include the need for scientific reproducibility and accountability alongside the need to record and track data usage. One key element is to generate a dataset-centric rather than system-centric focus, with an aim to making the infrastructure less prone to systemic failure.
With these overarching principles and requirements, the WIP has produced a set of position papers, which are summarized in the latter pages of this document. They provide specifications for managing and delivering model output, including strategies for replication and versioning, licensing, data quality assurance, citation, long-term archiving, and dataset tracking. They also describe a new and more formal approach for specifying what data, and associated metadata, should be saved, which enables future data volumes to be estimated, particularly for well-defined projects such as CMIP6.
The paper concludes with a future facing consideration of the global data infrastructure evolution that follows from the blurring of boundaries between climate and weather, and the changing nature of published scientific results in the digital age.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details












1 Princeton University, Cooperative Institute of Climate Science, Princeton, NJ 08540, USA; NOAA/Geophysical Fluid Dynamics Laboratory, Princeton, NJ 08540, USA
2 PCMDI, Lawrence Livermore National Laboratory, Livermore, CA 94550, USA
3 Science and Technology Facilities Council, Abingdon, UK
4 National Centre for Atmospheric Science, University of Reading, Reading, UK; Science and Technology Facilities Council, Abingdon, UK
5 Deutsches KlimaRechenZentrum GmbH, Hamburg, Germany
6 Engility Corporation, NJ 08540, USA; NOAA/Geophysical Fluid Dynamics Laboratory, Princeton, NJ 08540, USA
7 Jet Propulsion Laboratory (JPL), 4800 Oak Grove Drive, Pasadena, CA 91109, USA
8 Institut Pierre Simon Laplace, CNRS/UPMC, Paris, France
9 Met Office, FitzRoy Road, Exeter, EX1 3PB, UK
10 Institut Pierre Simon Laplace, CNRS/UPMC, Paris, France; Science and Technology Facilities Council, Abingdon, UK
11 Canadian Centre for Climate Modelling and Analysis, Atmospheric Environment Service, University of Victoria, Victoria, BC, Canada