Abstract

Existing somatic benchmark datasets for human sequencing data use germline variants, synthetic methods, or expensive validations, none of which are satisfactory for providing a large collection of true somatic variation across a whole genome. Here we propose a dataset of short somatic mutations, that are validated using a known cell lineage. The dataset contains 56,974 (2,687 unique) Single Nucleotide Variations (SNV), 6,370 (316 unique) small Insertions and Deletions (Indels), and 144 (8 unique) Copy Number Variants (CNV) across 98 in silico mixed truth sets with a high confidence region covering 2.7 gigabases per mixture. The data is publicly available for use as a benchmarking dataset for somatic short mutation discovery pipelines.

Footnotes

* https://app.terra.bio/#workspaces/broad-dsp-spec-ops-fc/somatic_truth_data_from_cell_lineage

* https://github.com/meganshand/gatk

Details

Title
Somatic Truth Data from Cell Lineage
Author
Shand, Megan; Soto, Jose; Lichtenstein, Lee; Benjamin, David; Farjoun, Yossi; Brody, Yehuda; Maruvka, Yosef E; Blainey, Paul C; Banks, Eric
University/institution
Cold Spring Harbor Laboratory Press
Section
New Results
Publication year
2019
Publication date
Oct 31, 2019
Publisher
Cold Spring Harbor Laboratory Press
Source type
Working Paper
Language of publication
English
ProQuest document ID
2310814576
Copyright
© 2019. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.