1. Introduction
Since the introduction of DNA sequencing technologies and genomic diagnostics, increasing insights have been given into the evolution and variability of genomes. Over the past few decades, changes in the nucleotide composition and structure have been observed and extensively studied in a variety of organisms. A common observation when comparing genomes over time, across individuals or in different cell types is alterations of a single or multiple nucleotides. In many studies, observations of these alterations, termed genomic variants, have been associated with diseases, evolutionary processes and biodiversity [1,2]. The perceived benefits of a comprehensive understanding of genomic variants have driven immense efforts to collect and analyze genomic data. A range of bioinformatics software programs have been released to accurately quantify and describe the characteristics of genomic variants in a single or multiple genomes [3]. For such a software product to prevail over time, it is often subject to long-term update cycles, compared to novel competing software in the field and analyzed for technical errors by the community. These challenges of continuous software development require robust testing and the evaluation of results [4]. In the context of genomic variant detection, bioinformatics software needs to perform well under a variety of technical setups and the set of reported genomic variants should comply with a known ground truth.
Several existing applications provide functionalities for the performance of individual steps towards the end-to-end testing of genomic variant detection. Mason [5] and ART [6] are applications used to generate synthetic sequencing data for different sequencing technologies. PICARD [7] and BCFtools [8] are established tool suites for the manipulation of high-throughput sequencing data and variant call format (VCF) files. Krusche et al. [9] proposed a framework describing best practices for the benchmarking of small germline variants, which led to the highly recognized precisionFDA challenge [10]. The same authors released the widely community-accepted hap.py tools for variant set comparison. This tool suite implements elaborate methods to address complex cases where representational differences between sets of variants cannot be trivially fixed. A commonly used alternative to hap.py is RTG Tools (
To this end, we introduce CIEVaD (an abbreviation for Continuous Integration and Evaluation of Variant Detection), a lightweight workflow collection designed for the rapid generation of synthetic test data from haploid reference genomes and the validation of software for genomic variant detection. CIEVaD is implemented in Nextflow [13], a workflow management framework for streamlined and reproducible data processing. The use of Nextflow additionally avoids manual installation efforts for the user via the deployment of software packages. CIEVaD is open-source and freely available on Github at
2. Materials and Methods
The CIEVaD workflow collection contains two main workflows. The haplotype workflow generates synthetic data, whereas the evaluation workflow examines the accordance between sets of genomic variants. In the following, the term synthetic is used synonymously for in silico-generated data.
2.1. Haplotype Workflow
The haplotype workflow provides a framework for data generation. Synthetic data can be used for genomic variant calling, and the data generation scales computationally to many individuals. Moreover, different types of sequencing data can be specified. To begin with data generation, the one input parameter strictly required by the user is a reference genome. With only a given reference genome, the haplotype workflow generates a new haplotype sequence of the entire reference genome, a set of synthetic genomic variants (henceforth referred to as a truthset) and synthetic genomic reads. The truthset comprises single-nucleotide variants (SNVs) and short insertions and deletions (indels) of, at most, 20 nucleotides. The maximum length of indels, as well as the frequency of both variant types, can be adjusted via the workflow parameters. Since the variants of the truthset are homozygous, the alternative–allele ratio for each variant is defined as , where is drawn from the read-specific error distribution.
In the initial step, the workflow indexes the given reference genome. Next, CIEVaD uses the Mason variator [5] for a given number n of individuals (the default is ). The result of this step is a haplotype sequence and truthset per individual. Here, a haplotype sequence is a copy of the reference genome with small genomic variants of the truthset inserted at pseudo-random locations. In other terms, the reference genome and a haplotype sequence differ in the variants of the corresponding truthset. The final step is the generation of synthetic genomic reads. Depending on the specified type of sequencing data (the default is NGS), either the Mason simulator generates pairs of genomic reads or PBSIM3 [14] generates a set of synthetic long reads. The read simulation parameters can be tuned via CIEVaD’s command-line interface or via a configuration file. In both cases, the workflow returns the alignment of the reads to the reference genome, e.g., for the manual inspection of the genomic variants and sequencing artifacts in the reads. Note that the data of each individual are computed in asynchronous parallel Nextflow processes, which scale effortlessly with additional CPU threads or compute nodes.
2.2. Evaluation Workflow
The objective of the evaluation workflow is to assess how successfully a third-party tool or workflow detects genomic variants. To assess the detection, CIEVaD’s evaluation workflow compares the set of genomic variants generated by the third-party tool or workflow (such a set is further referred to as a callset) with the corresponding truthset. The only strictly required input for the evaluation workflow is a set of callsets in the variant call format (VCF), either given as a folder path or a sample sheet. The outputs are reports about the accordance between corresponding truth- and callsets.
The evaluation workflow consists of only two consecutive steps. First, the open-source tool suite hap.py by Illumina Inc. is used to compare all truthsets with their corresponding callsets. In fact, its included submodule som.py identifies the number of correctly detected (TP), missed (FN), and erroneously detected (FP) variants. Due to using som.py, this comparison of the variants neglects the genotype information and only checks their positions in the genome, as well as the nucleotide composition. This default behavior is chosen since CIEVaD was initially developed for haploid pathogens where the presence of a variant itself reflects the genotype. Using the TP, FP, and FN counts, som.py reports the precision, recall, and F1-score of the variant detection process that yields the callset. The second step of the evaluation workflow is the computation of the average scores across the statistics of all individuals.
3. Results
To demonstrate the utility of CIEVaD, we implemented end-to-end tests (Appendix A Figure A1) for different variant detection software programs as part of their continuous integration frameworks.
3.1. Assessing Variant Detection from NGS Data as Part of SARS-CoV-2 Genome Reconstruction
We deployed CIEVaD to benchmark the variant detection routine within CoVpipe2 [15]. CoVpipe2 is a computational workflow for the reconstruction of SARS-CoV-2 genomes. One sub-process of CoVpipe2 applies the FreeBayes [16] variant detection method to a set of reference-aligned NGS reads. We implemented a test as part of CoVpipe2’s Github Actions (
Install Conda and Nextflow;
Download a reference genome;
Run CIEVaD hap.nf;
Run CoVpipe2;
Prepare input for CIEVaD eval.nf;
Run CIEVaD eval.nf;
Check results.
In brief, the test runs the CIEVaD haplotype workflow with the default parameters, CoVpipe2 with the generated synthetic NGS data, and the CIEVaD evaluation workflow with the filtered callsets from CoVpipe2 and finally checks whether the F1-scores of the SNV and indel detection have decreased compared to the previous test scores. This rapid deployment of CIEVaD (v0.4.1) benchmarks the variant callsets of CoVpipe2 (v0.5.2) with F1-scores of and for SNVs and indels, respectively. The full table of results for this evaluation is given in Appendix B Table A1.
3.2. Assessing Variant Detection from Long-Read Data as Part of Nanopore Sequencing Data Analysis
In order to deploy CIEVaD for a different genomic data type, we generated long reads with hap.nf using the –read_type parameter (see Supplementary Material A). The read type parameter invokes additional default parameters of the haplotype workflow that are tailored to a high-coverage long-read WGS experiment on the SARS-CoV-2 genome. With this setup, the haplotype workflow and its internal long-read module [14] generate a dataset with an average of 500-fold read coverage, per base accuracy of approximately (Appendix C Figure A2), and an error distribution model trained on genomic data from Oxford Nanopore Technologies. We used the synthetic long-read dataset and the ground truth variants to test the variant detection of the poreCov [17] data analysis pipeline. Supplementary Material A shows how we adjusted poreCov (v1.9.4) to process the synthetic long-read dataset and, subsequently, how the evaluation workflow of CIEVaD verified poreCov’s results. Our evaluation of poreCov (Appendix D Table A2) shows F1-scores of and for SNVs and indels, respectively.
4. Conclusions
We introduce CIEVaD, an easy-to-apply tool suite used to assess small variant detection from short- and long-read datasets. CIEVaD is modular, extensible, and scaleable; requires no manual installation of internal software; and operates entirely on standard bioinformatics file formats. We show that the workflows of CIEVaD enable the rapid deployment of end-to-end tests, including the generation of synthetic genomic data and the evaluation of results from third-party variant detection software. Our tool greatly benefits bioinformaticians and data scientists in genomics by reducing the time spent on routine tasks for the evaluation and reporting of variant detection.
With the workflow design, open-source policy, file formats, software packaging, and documentation, we aim to comply with the PHA4GE best practices for public health pipelines (
T.K. implemented CIEVaD and the tests of the third-party software. D.T. contributed to the implementation of additional features in CIEVaD. S.P. and S.F. supervised the project. T.K. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
The genomic read sequences used in this study were artificially (in silico) generated from the presented tools. The corresponding reference genome can be accessed from the NIH NCBI website via accession number
The authors would like to thank Namuun Battur for testing early versions of the workflow collection and Marie Lataretu for her feedback on CoVpipe2 and for proofreading the manuscript.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
1. Shastry, B.S. SNP alleles in human disease and evolution. J. Hum. Genet.; 2002; 47, pp. 561-566. [DOI: https://dx.doi.org/10.1007/s100380200086] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/12436191]
2. Gao, Y.; Jiang, G.; Yang, W.; Jin, W.; Gong, J.; Xu, X.; Niu, X. Animal-SNPAtlas: A comprehensive SNP database for multiple animals. Nucleic Acid Res.; 2023; 51, pp. D816-D826. [DOI: https://dx.doi.org/10.1093/nar/gkac954] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36300636]
3. Poplin, R.; Ruano-Rubio, V.; DePristo, M.A.; Fennell, T.J.; Carneiro, M.O.; Van der Auwera, G.A.; Kling, D.E.; Gauthier, L.D.; Levy-Moonshine, A.; Roazen, D. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv; 2017; [DOI: https://dx.doi.org/10.1101/201178]
4. Majidian, S.; Agustinho, D.P.; Chin, C.S.; Sedlazeck, F.J.; Mahmoud, M. Genomic variant benchmark: If you cannot measure it, you cannot improve it. Genome Biol.; 2023; 24, 221. [DOI: https://dx.doi.org/10.1186/s13059-023-03061-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37798733]
5. Holtgrewe, M. Mason—A Read Simulator for Second Generation Sequencing Data; Technical Report FU Berlin Freie Universität Berlin: Berlin, Germany, 2010.
6. Huang, W.; Li, L.; Myers, J.R.; Marth, G.T. ART: A next-generation sequencing read simulator. Bioinformatics; 2012; 28, pp. 593-594. [DOI: https://dx.doi.org/10.1093/bioinformatics/btr708] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22199392]
7. Broad Institute. Picard Toolkit; Broad Institute: Cambridge, MA, USA, 2019; GitHub Repository Available online: http://broadinstitute.github.io/picard (accessed on 3 September 2024).
8. Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M. et al. Twelve years of SAMtools and BCFtools. GigaScience; 2021; 10, giab008. [DOI: https://dx.doi.org/10.1093/gigascience/giab008] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33590861]
9. Krusche, P.; Trigg, L.; Boutros, P.C.; Mason, C.E.; De La Vega, F.M.; Moore, B.L.; Gonzalez-Porta, M.; Eberle, M.A.; Tezak, Z.; Lababidi, S. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol.; 2019; 37, pp. 555-560. [DOI: https://dx.doi.org/10.1038/s41587-019-0054-x]
10. Olson, N.D.; Wagner, J.; McDaniel, J.; Stephens, S.H.; Westreich, S.T.; Prasanna, A.G.; Johanson, E.; Boja, E.; Maier, E.J.; Serang, O. et al. PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions. Cell Genom.; 2022; 2, 100129. [DOI: https://dx.doi.org/10.1016/j.xgen.2022.100129]
11. Dunn, T.; Narayanasamy, S. vcfdist: Accurately benchmarking phased small variant calls in human genomes. Nat. Commun.; 2023; 14, 8149. [DOI: https://dx.doi.org/10.1038/s41467-023-43876-x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38071244]
12. Hanssen, F.; Gabernet, G.; Smith, N.H.; Mertes, C.; Neogi, A.G.; Brandhoff, L.; Ossowski, A.; Altmueller, J.; Becker, K.; Petzold, A. et al. NCBench: Providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling [version 1; peer review: 1 approved with reservations]. F1000Research; 2023; 12, 1125. [DOI: https://dx.doi.org/10.12688/f1000research.140344.1]
13. Di Tommaso, P.; Chatzou, M.; Floden, E.W.; Barja, P.P.; Palumbo, E.; Notredame, C. Nextflow enables reproducible computational workflows. Nat. Biotechnol.; 2017; 35, pp. 316-319. [DOI: https://dx.doi.org/10.1038/nbt.3820] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28398311]
14. Ono, Y.; Hamada, M.; Asai, K. PBSIM3: A simulator for all types of PacBio and ONT long reads. NAR Genom. Bioinform.; 2022; 4, lqac092. [DOI: https://dx.doi.org/10.1093/nargab/lqac092] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36465498]
15. Lataretu, M.; Drechsel, O.; Kmiecinski, R.; Trappe, K.; Hölzer, M.; Fuchs, S. Lessons learned: Overcoming common challenges in reconstructing the SARS-CoV-2 genome from short-read sequencing data via CoVpipe2 [version 2; peer review: 2 approved]. F1000Research; 2024; 12, 1091. [DOI: https://dx.doi.org/10.12688/f1000research.136683.2] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38716230]
16. Garrison, E.; Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv; 2012; arXiv: 1207.3907[DOI: https://dx.doi.org/10.48550/arXiv.1207.3907]
17. Brandt, C.; Krautwurst, S.; Spott, R.; Lohde, M.; Jundzill, M.; Marquet, M.; Hölzer, M. Corrigendum: PoreCov—An Easy to Use, Fast, and Robust Workflow for SARS CoV-2 Genome Reconstruction via Nanopore Sequencing. Front. Genet.; 2022; 13, 875644. [DOI: https://dx.doi.org/10.3389/fgene.2022.875644] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35368706]
18. Köndgen, S.; Oh, D.Y.; Thürmer, A.; Sedaghatjoo, S.; Patrono, L.V.; Calvignac-Spencer, S.; Biere, B.; Wolff, T.; Dürrwald, R.; Fuchs, S. et al. A Robust, Scalable, and Cost-Efficient Approach to Whole Genome Sequencing of RSV Directly from Clinical Samples. J. Clin. Microbiol.; 2024; 62, e0111123.Erratum in J. Clin. Microbiol. 2024, 62, e0078424 [DOI: https://dx.doi.org/10.1128/jcm.01111-23] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38407068]
19. Ewels, P.A.; Peltzer, A.; Fillinger, S.; Patel, H.; Alneberg, J.; Wilm, A.; Garcia, M.U.; Di Tommaso, P.; Nahnsen, S. The nf-core Framework for Community-Curated Bioinformatics Pipelines. Nat. Biotechnol.; 2020; 38, pp. 276-278. [DOI: https://dx.doi.org/10.1038/s41587-020-0439-x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32055031]
20. De Coster, W.; Rademakers, R. Nanopack2: Population-scale evaluation of long-read sequencing data. Bioinformatics; 2023; 39, btad311. [DOI: https://dx.doi.org/10.1093/bioinformatics/btad311]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The identification of genomic variants has become a routine task in the age of genome sequencing. In particular, small genomic variants of a single or few nucleotides are routinely investigated for their impact on an organism’s phenotype. Hence, the precise and robust detection of the variants’ exact genomic locations and changes in nucleotide composition is vital in many biological applications. Although a plethora of methods exist for the many key steps of variant detection, thoroughly testing the detection process and evaluating its results is still a cumbersome procedure. In this work, we present a collection of easy-to-apply and highly modifiable workflows to facilitate the generation of synthetic test data, as well as to evaluate the accordance of a user-provided set of variants with the test data. The workflows are implemented in Nextflow and are open-source and freely available on Github under the GPL-3.0 license.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer