Introduction
As of today, there are several public repositories for genetic and genomic variation data. However, most of these repositories are exclusive to humans and do not include other organisms (
Cezard
When a data submission is made to the EVA, samples are automatically registered in the associated BioSamples database (
Courtot
The description of these samples must take into account plant specificities to fully enable findability and interoperability with reference databases such as EURISCO (
Weise
One hurdle in enforcing a uniform format specification for variation data in plant science is the fact that there are some databases that offer their own plant-specific variation data. These databases often belong to either the project-specific or the aggregation database classes (which are often organism-specific). The former do not change after the project lifetime, while the latter summarise the results of several studies in a uniform way. Probably the best known project-specific database is the GMI-MPI of the 1001 Genomes Consortium (
Alonso-Blanco
Another useful resource for the analysis of plant variation data is Ensembl Plants (
Howe
Lessons learned from studies on plant phenotyping and its application to metadata information in genotyping
The standardisation of plant variation data is still in its infancy. Therefore, it is beneficial to look to other data types for guidance and improvement. One particular data type where a lot of standardisation work has been done in recent years is plant phenotyping. Plant phenotyping has developed rapidly with the introduction of high-throughput technologies such as fully automated greenhouses, full-time sensor recording and aerial observation drones. The need to record data points and the method of observation has led the community to implement a standard for describing such experiments: MIAPPE (
Papoutsoglou
In contrast, genotyping data is often published and shared without sufficient metadata to ensure interoperability and reuse, as seen with other data formats (
Bernstein
Data and metadata formatting
The
Figure 1.
Example Variant Call Format (VCF) file structure, including meta-information lines and data lines (from https://samtools.github.io/hts-specs/VCFv4.3.pdf).
A critical aspect of VCF specifications is that sample naming within the VCF file does not follow any standard specifications, i.e. users can name their samples without reference to any real biological material. Even worse, phenotyping and genotyping data from the same experimental setup often use different sample identifiers even when the same biological material has been used, which makes it difficult to reconstruct later which datasets were derived from a common sample. To be able to represent such relationships, descriptive metadata is required that relates these different sample identifiers to each other.
In response to the points discussed previously, we propose a minimal list of metadata fields, recommend an identifier schema and guidelines for vocabulary and data format within a VCF file. Our suggestions are divided into recommended and optional changes. Although, we are primarily addressing data submissions to the EMBL-EBI repositories BioSamples and EVA (and implicitly ENA through the submission of sequence information), subsequent formatting guidelines should be applied regardless of the specific deposition repository and should also be considered when designing databases and APIs.
In our view, these additional fields should be required for a valid VCF:
One meta-information line, ##fileformat, is obligatory in VCF. We also recommend using the additional lines ##filedate, ##bioinformatics_source, ##reference_ac, ##reference_url, ##contig and ##SAMPLE. To ensure permanent unique and stable IDs for samples and genotypes, we recommend the registration of used genotypes and samples in the BioSamples database. This enables the publishing of biological material used in variation studies, and we explicitly recommend the use of long-term stable BioSamples identifiers as primary IDs for material description in VCF files ( Table 1).
Table 1.
Summary of recommendations for metadata formatting.
Metadata field | Definition | Format | Example | Cardinality |
---|---|---|---|---|
##fileDate | Creation date of the VCF file | Date (ISO 8601, YYYYMMDD) | ##fileDate=20120921 | 1 |
##bioinformatics_source | Chains of bioinformatics tools for creating the VCF file | URL, DOI | ##bioinformatics_source=“ doi.org/10.1038/s41588-018-0266-x” | 1 |
##reference_ac | Accession number of reference genome assembly used in the VCF file | /[(GCA/GCF)_(d){9}\.(0-9)*]/ | ##reference_ac=GCA_902498975.1 | 1 |
##reference_url | URL of the reference genome assembly used in the VCF file | URL, DOI | ##reference_url=“ ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/498/975/GCA_902498975.1_Morex_v2.0/GCA_902498975.1_Morex_v2.0_genomic.fna.gz” | 1 |
##contig | Metadata about a single sequence in the reference genome assembly | Composite (see below) | ##contig=<ID=chr1H,length=522466905,assembly=GCA_902498975.1,md5=8d21a35cc68340ecf40e2a8dec9428fa,species=NCBITaxon:4513> | 1:N |
The primary identifier of the sequence | String | ID=chr1H | 1 | |
The length in base pairs (bp) of the sequence | Integer | length=522466905 | 1 | |
The assembly accession number this sequence belongs to | /[(GCA/GCF)_(d){9}\.(0-9)*]/ | assembly=GCA_902498975.1 | 1 | |
The md5 checksum of the sequence | MD5 | md5=8d21a35cc68340ecf40e2a8dec9428fa | 1 | |
The species of the sequence (NCBI Taxon ID) | /[(NCBITaxon):(\d+)]/ | species=NCBITaxon:4513 | 1 | |
##SAMPLE | Metadata about a single sample genotype that is part of the genotyping experiment in the VCF file | Composite (see below) | ##SAMPLE=<ID=SAMEA104646767,DOI=“ doi.org/10.25642/IPK/GBIS/7811152”> | 1:N |
The primary identifier (BioSamples Database identifier) of the genotyping sample | /[(SAM)(E|N|D)(A|G)(\d+)]/ | ID=SAMEA104646767 | 1 | |
The DOI of the genotyping sample (if available) | URL, DOI | DOI=“ doi.org/10.25642/IPK/GBIS/7811152” | 0-1 | |
The external identifiers under which this genotyping sample is registered in other databases (either ‘FAO-WIEWS_instcode:genus:accession_number’ or ‘DNS:database_identifier:identifier_scheme:identifier’) | See Definition | ext_ID=“DEU146:Hordeum:HOR 1361 BRG” or ext_ID=“ipk-gatersleben.de:GBIS:akzessionId:7811152” | 0:N |
File date field format
The creation date of the VCF should be specified in the metadata via the field ##fileDate, the notation corresponds to ISO 8601 ( Kuhn, 1995) (in the basic form without separator: YYYYMMDD).
##fileDate=date
Example:
Description of a VCF file that was created on September 21st in 2012.
Bioinformatics source field format
The analytic approach (usually consisting of chains of bioinformatics tools) for creating the VCF file is specified in the ##bioinformatics_source field. Such approaches often involve several steps, like read mapping, variant calling and imputation, each carried out using a different program. Every component of this process should be clearly described, including all the parameter values.
##bioinformatics_source=url
This is ideally specified as the DOI of a publication, or more generally as URL/URI (like a public repository for the scripts and parameters used).
Examples:
1) Description of a GBS experiment in barley and subsequent read alignment and variant calling using a bioinformatics analysis pipeline consisting of cutadapt, BWA-MEM, SAMtools, NovoSort, Picard, BCFtools and seqArray. 2) Modified version of Tassel4 (v.4.3.7) for running the Tassel-GBS pipeline modified for polyploid species with high read depths used in (
Pereira
Reference_ac field format
This field contains the accession number (including the version) of the reference sequence on which the variation data of the present VCF is based.
##reference_ac=assembly_accession
The NCBI page on the Genome Assembly Model states ( NCBI, 2002): “The assembly accession starts with a three letter prefix, GCA for GenBank assemblies […]. This is followed by an underscore and 9 digits. A version is then added to the accession. For example, the assembly accession for the GenBank version of the public human reference assembly (GRCh38.p11) is GCA_000001405.26”. Note these accessions are shared by all INSDC archives.
Example:
Reference genome assembly for barley (
Reference_url field format
While the ##reference_ac field contains the accession number of the reference genome assembly, the ##reference_url field contains a URL (or URI/DOI) for downloading of this reference genome assembly, preferably from one INSDC archive.
##reference_url=url
The reference genome assembly should be in FASTA format; the user is free to provide a packed or unpacked publicly available version of the genome assembly.
Example:
Reference genome assembly for barley (
Contig field format
The individual sequence(s) of the reference genome assembly are described in more detail in the #contig field(s).
##contig=<ID=ctg1, length=sequence_length, assembly=gca_accession, md5=md5_hash, species=NCBI Taxon ID>
Each contig entry contains at least the attribute ID, and typically also include length, assembly, md5 and species. The ID is the identifier of the sequence contig used in the reference genome assembly. Length contains the base pair length of the sequence contig in the reference genome assembly. The assembly is the accession number of the reference genome. If the md5 parameter is given, please note that the individual sequence contigs MD5 checksum is expected, not the MD5 sum of the complete reference genome assembly. The species is the taxonomic name of the species of the reference genome assembly.
Examples:
1) Chromosome 1H of barley (
2) Chromosome 1 of maize (
Sample field format
The ##SAMPLE fields describe the material whose variants are given in the genotype call columns in greater detail and can be extended using the specifications of the VCF format.
##SAMPLE=<ID=BioSample_accession, DOI=doi, ext_ID=registry:identifier>
Genotyped samples are indicated in the VCF by the BioSample accession, which is formed as follows (based on information from the BioSamples documentation): “BioSample accessions always begin with SAM. The next letter is either E or N or D depending if the sample information was originally submitted to EMBL-EBI or NCBI or DDBJ, respectively. After that, there may be an A or a G to denote an Assay sample or a Group of samples. Finally, there is a numeric component that may or may not be zero-padded.” Additional information (like complete Multi-Crop Passport Descriptor (
Alercia
Examples (Please note that all examples here represent the same genotype. To avoid misunderstandings, if available, the preferred method of describing the data is by DOI.):
1) One genotype from the barley (
2) One genotype from the barley (
3) One genotype from the barley (
Recommendations for data fields
In order to allow the highest degree of interoperability, we suggest using BioSamples IDs as the column headers for each sample. In the header line, they should be provided after the 9 mandatory column headings (#CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT).
In addition, ensure that the genomic positions in the data lines (consisting of the #CHROM and POS tuple) use the same nomenclature as in the reference genome assembly FASTA file and that the positions of the variations are within the start and end positions of the respective chromosome or contig. Watch out for programmes that change these values automatically (especially during imputation).
Additional meta-information fields
On top of the preceding recommendations to improve findability, interoperability and reusability, we encourage everyone to describe their data in as much detail as possible in the metainformation lines. Before introducing new fields, please check the official format specifications (in VCFv4.3 this would be under 1.4 Meta-information lines) to avoid redundancy and possible incompatibilities.
Conclusion
With the data and metadata recommendations for VCF files presented here, we hope to make a contribution to linking genotypic and other data for plants (e.g. phenotypic, transcriptomic, metabolomic data sets that can be linked through precise sample identifiers provided by BioSamples). In our view, the minimum to achieve this is to have traceable material and sample management. Analytical results should be linked out to the respective sample(s) and defined in the context of the study being reported by using the persistent BioSamples identifiers throughout all analytical results. One way to ensure this is to generate long-term stable identifiers at an early stage, ideally when the sample is taken, and to document all work steps accurately. Reproducibility is also an important aspect, which has recently been criticised more frequently in various studies (
Baker, 2016;
Miyakawa, 2020). Technologies such as containers or the provision of the entire data set and the analytical computing pipeline in a cloud environment could be a further step towards overcoming such problems (
Grüning
The BioSamples database at EMBL-EBI stores samples metadata and allows their pre-registration; it provides unique, stable identifiers for each sample. BioSamples connects to other archives, enabling consistent tracking through time and assays of the samples and derived data. It supports validation of plant metadata according to the MIAPPE standard, ensuring data FAIRness (
Wilkinson
There are several ways in which the recommendations published here can find acceptance in the larger plant science community, each with its advantages and disadvantages (
Sielemann
One approach to enforcing these recommendations on a larger scale would be to contact the major publishers with a critical mass of supporters and ask them to consider these recommendations as a prerequisite for submitting new manuscripts. In addition, one could also approach various communities of plant scientists who would commit to following these recommendations without further outside influence, similar to the adaptation of the FAIR principles. Both options are very time-consuming and labour-intensive, although the second option has tended to prevail in the past. Especially with regard to the reusability of the data and the possibilities to combine it with new questions, it offers the greatest incentive to be positive about such a broad introduction.
The responsibilities of the people involved may vary from research institution to research institution, but the general tasks for the generation of plant genotyping data and the subsequent publication of these data follow a common pattern. To highlight how the complete data management of a genotyping project could be structured, we have designed an exemplary Unified Modeling Language (UML) diagram (
Figure 2) as a best practice proposal. We assume that the research institution has a LIMS and that sample collection, sample preparation, sequencing and all bioinformatic analyses are carried out in house. Even if one or more of these activities are outsourced, most data management activities (indicated in the figure by the actor “Data Steward”) and thus also the primary communication with public repositories remain the scientific responsibility of the research institution (
Mayer
Figure 2.
The recommended workflow for the submission of genotypic data to public EMBL-EBI databases.
DNA samples are collected by an Experimentalist and their metadata are stored in a Laboratory Information Management System (LIMS). The Data Steward then registers these samples with BioSamples and in return receives unique BioSamples IDs back, which the Data Steward adds to the created samples in the LIMS. The sequencing and quality control of these samples is then carried out by the Sequencing Staff and the primary sequence data is fed into the LIMS and linked to the sample data by the Data Steward. The sequencing results are then registered and submitted to the European Nucleotide Archive (ENA) using the BioSamples IDs to link the initially submitted samples to the generated sequencing reads. The study identifiers (ENA IDs) are assigned by ENA and added to the samples by the Data Steward in LIMS. The Bioinformatician then analyses the data and produces the genotyping results. Afterwards, the Data Steward prepares these data for transmission by linking them to the already created sample data from the LIMS and extracting the required metadata and adding it to the header of the Variant Call Format (VCF) file. If the reference genome used for genotyping is not yet available in public repositories, it will now be transferred by the Data Steward to one of the International Nucleotide Sequence Database Collaboration (INSDC) databases. Otherwise, the metadata-enriched VCF file can be registered and submitted to the European Variation Archive (EVA). The identifiers assigned by EVA are then transmitted back and the Principal Investigator can approve the publication of the data.
This approach to data management facilitates the submission of data for publication or at the end of the research project. Here, the situation often arises that the data steward, under time pressure, fails to submit the necessary (meta-)information to the public repositories. The submitted dataset therefore only consists of very generic and not meaningful metadata (
Toczydlowski
The FAIRness of datasets improves when the metadata fields defined here are collected and submitted. Indeed, findability is increased by the recommended persistent identifiers, which allow search engines to use this information to find linked datasets. For example, plant material and sample identifications, as recommended here, are used as germplasm filters in the FAIDARE search portal, allowing discovery of genotyping and phenotyping data containing the same plant material. The accessibility of datasets remains unchanged, but interoperability is significantly improved. Indeed, the improved identification of common IDs for plant material through the use of the BioSamples infrastructure makes it possible to link and integrate distributed and heterogeneous datasets. Thanks to this facilitated interoperability, reusability through new analyses is also improved. For the stepwise FAIRification of a plant variant dataset, a recipe has been provided in the FAIR Cookbook that implements the recommendations presented here ( https://w3id.org/faircookbook/FCB061). Adoption of these guidelines and best practices will help make plant genotyping data FAIR and provide new opportunities to advance our understanding of relationships between genotypic and phenotypic data.
Data availability
An example VCF conforming to the metadata recommendations presented here and comprising a barley genotyping experiment with 22626 accessions has been deposited at EVA under PRJEB51851 and is accessible via the study browser: https://www.ebi.ac.uk/eva/?eva-study=PRJEB51851.
See the FAIR Cookbook under the recipe https://w3id.org/faircookbook/FCB061 for step-by-step instructions on how to submit data according to the recommendations in this manuscript.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2022 Beier S et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of metadata in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified.
We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. They form a basis for the proposed VCF extensions here. We have learned from the existing application of VCF that the definition of relevant metadata using controlled standards, vocabulary and the consistent use of cross-references via resolvable identifiers (machine-readable) are particularly necessary and propose their encoding.
VCF is an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant data (for example, the HapMap and the gVCF formats), but none currently have the reach of VCF. For the sake of simplicity, we will only discuss VCF and our recommendations for its use, but these recommendations could also be applied to gVCF. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer