Existing methods for identifying structural variants (SVs) from short read datasets are inaccurate. This complicates disease-gene identification and efforts to understand the consequences of genetic variation. In response, we have created Wham (Whole-genome Alignment Metrics) to provide a single, integrated framework for both structural variant calling and association testing, thereby bypassing many of the difficulties that currently frustrate attempts to employ SVs in association testing. Here we describe Wham, benchmark it against three other widely used SV identification tools-Lumpy, Delly and SoftSearch-and demonstrate Wham's ability to identify and associate SVs with phenotypes using data from humans, domestic pigeons, and vaccinia virus. Wham and all associated software are covered under the MIT License and can be freely
Fig 1. Sensitivity and false discovery rates (FDR) for simulated data.
The sensitivity and FDR of Delly, Lumpy, SoftSearch and Wham for simulated deletions, duplications, insertions and inversions. The sensitivity is measured for each category at depths of 10x and 50x. SVs ranging from 50 bp to 1 Mb are grouped into four left-closed size intervals. A) The sensitivity of the three tools is faceted on size, depth and SV type. At 10x Wham has noticeably better sensitivity for deletions and duplications in the smallest size class. Wham's sensitivity is higher than Delly and Lumpy for insertions at 10x and gains sensitivity at 50x. B) The FDR for each type of SV faceted by depth and the amount of slop added to each confidence interval. In the 25 bp slop category, each confidence interval was extended in both directions by 25 bp. At 10x depth Wham has the highest FDR across all SV classes and Lumpy has the lowest. At 50x Delly has heightened FDR for deletions and Lumpy has a much higher FDR for insertions. Shrinking the confidence intervals increases the FDR for Delly and Lumpy, but not Wham. C) Breakpoint sensitivity for deletions. The confidence intervals, provided by the three tools are ignored and slop is incrementally added to the predicted breakpoints. Wham has the highest sensitivity when 1-10 bp of slop is added. D) Genotype sensitivity for the homozygous non-reference simulated SVs. Delly and Wham have similar sensitivity for deletions and duplications while both tools fail to correctly genotype duplications.
Next we assayed the false discovery rate (FDR) of the four tools on the same simulated data. Wham has an FDR of 0.05 at 50x with 50 bp of slop (see Supporting Information), which is higher than Delly (0.02), but lower than Lumpy (0.11) and SoftSearch (0.41) (Fig 1B). Lumpy has the lowest overall FDR if insertions are excluded. Reducing the amount of slop added to the confidence intervals slightly increases the FDR for Delly and Lumpy, but not for Wham or SoftSearch. Wham's FDR can be attributed to misclassification of SV type and failures when identifying both breakpoints. All three tools exhibited a positive correlation between depth and FDR when comparing the 10x and 50x datasets. For example, Delly's FDR for deletions nearly doubles in the 50x relative to the 10x data. All three tools had elevated FDRs for insertion events. This is because our simulated insertions create inter-chromosomal duplications which increase mapping errors leading to false positive SV calls. As expected, the FDRs for the simulated data are much lower than the human benchmarks, as discussed below. The change in FDR between simulated and experimental data can be attributed to the simplicity of the simulations. For example, we did not model errors in the reference genome or mobile element insertions, which would have increased the baseline FDR for all the tools.
To assess the breakpoint accuracy of the tools, we removed the confidence intervals for deletions and then incrementally added 1-500 bp of bi-directional slop to both breakpoints (Fig 1C). Wham has the highest positional accuracy for deletions of the four tools, as it has the highest sensitivity (0.75) with only 1 bp of slop. Lumpy exhibits a marked gain in sensitivity from 10 bp to 25 bp of slop as it is designed to detect "soft" breakpoint boundaries [14], whereas Delly's sensitivity exhibits an increase from 1 bp to 5 bp of slop. In contrast, Wham maintains a near constant sensitivity down to 5 bp of slop, after which Wham's sensitivity drops, but remains greater than 0.75. Wham maintains sensitivity at small intervals by relying on highly accurate mapping and soft clipping. Wham and SoftSearch, unlike the other tools, use soft-clipping information to call small SVs. However, Wham loses less sensitivity from 5 bp to 1 bp than Softsearch. Wham's breakpoint sensitivity is important for maintaining power during association testing. The power to detect an association between a SV and a phenotype is diminished when breakpoints are miscalled within a cohort of affected individuals. All four tools showed improved breakpoint detection at higher depth (Fig 1C). The high positional sensitivity shows that mapping-based methods can reliably localize SV breakpoints down to a 3-bp interval. This small interval provides sufficient accuracy for association testing.
Collectively, these simulations show that Wham provides a robust means for SV identification. Compared to the other three tools, Wham excels at finding smaller structural variants across all simulated SV classes and has the highest breakpoint sensitivity. This is important as SVs are distributed geometrically with respect to length, and thus shorter SVs comprise the vast majority of real events [42,43]. As we show below, Wham also maintains high sensitivity when using real human short-read data, but at the cost of a much higher false discovery rate.
Validation using human WGS data
For our human benchmarks, we started with NA12878, the best characterized human genome. For the truth set we used the 2,597 NA12878 non-reference genotype calls in the 1000 Genomes SV dataset (phase III submitted calls) [42],
Fig 2. Benchmarking Delly, Lumpy, SoftSearch and Wham against NA12878 and CHM1 datasets.
A) The sensitivity and FDR for filtered NA12878 Phase III deletion calls across four size intervals. The number of true positives and the number NA12878 calls are listed above sensitivity, while the total number of false positives and total calls for each tool is listed above FDR. Most true positives and false positives are within the 150-1,000 bp interval. B) The sensitivity and FDR for CHM1 deletions. C) The size distribution of the true positive calls that overlap the CHM1 deletions. One thousand true positives were randomly sampled from each tool and the truth set (CHM1-DEL).
For a second, independent, human benchmarking experiment we used the recently published single-molecule, real-time (SMRT) sequencing dataset of a hydatidiform mole cell line (CHM1) [45]. The hydatidiform genome comprises a duplicated male haploid (double haploid). The PacBio SMRT SV calls are a good standard for validating Wham's performance on the related Illumina datasets because PacBio SMRT sequencing does not require DNA cloning or amplification, two common sources of sequencing artifacts. Moreover, the absence of allelic heterogeneity in the haploid CHM1 genome facilitates accurate assembly [45,46]. Additionally, PacBio reads can capture small and moderately sized SVs internally within a read, providing a more accurate source for detecting SVs. Both PacBio and Illumina sequence data were generated from DNA recovered from CHM1 cells. The 101-bp Illumina reads and PacBio (~8 kb average length) reads cover the haploid genome to 40.7x and 36.6x depth, respectively. The structural variant calls from the PacBio single molecule sequencing were generated by first identifying putative SV breakpoints followed by local assembly (see supporting information of [45]). We analyzed the Illumina data with Wham, Delly, Lumpy and SoftSearch, and compared their SV calls to the 11,311 SMRT deletions (http://eichlerlab.gs.washington.edu/publications/chm1-structural-variation).
Wham and Lumpy have similar sensitivities for CHM1 deletions, while Delly and SoftSearch lag behind (Fig 2B). For SVs larger than 1 kb, all tools have similar sensitivity for CHM1 deletions. Overall, the sensitivity and FDR for the CHM1 dataset was lower than for the Phase III NA12878 1kg dataset. There are several possible explanations for this difference. First, the sensitivity might be lower because many of CHM1 PacBio calls are in repetitive regions that cannot be detected with Illumina short-read mappings. The FDR may be lower because the CHM1 dataset contains ~ 10 times more calls than the 1kg NA12878 dataset. Lastly, the CHM1 SV size distribution contains smaller calls than the 1kg N12878 dataset. We examined the size distribution of true positive calls by Delly, Lumpy, SoftSearch and Wham (Fig 2C). Wham's size distribution for deletions closely tracks the CHM1 dataset. Both datasets are enriched for deletions less than 100 bp, which is concordant with previous studies [42,47]. The peaks in the size distributions at 300 bp and 6000 bp correspond to ALUs, STRs, and LINE-1 elements. The variability between the size distributions of the tools suggests that each tools is well suited for a slightly different size class.
Genotyping accuracy
We began to assay the genotype accuracy of the Delly, Lumpy (svtyper) and Wham by comparing their calls to Phase III NA12878 1kg deletions (genotyped by Genome STRiP [48]). SoftSearch was excluded from all genotyping assays because it only provides a hard-coded heterozygous genotype call. We measured how many genotypes were misclassified (Fig 3A). Delly has the highest fraction of concordant calls followed by Lumpy and Wham, however all tools differ by only a few percent. Wham and Delly tend to call homozygous non-reference genotypes as heterozygous. Lumpy, unlike the other two tools, calls a small percentage of heterozygous genotypes as homozygous reference. Wham calls a small fraction of heterozygous genotypes as homozygous non-reference.
Fig 3. Genotyping assays.
A) Comparison of Genome STRiP (GS) genotypes vs. Delly, Lumpy and Wham. The x-axis lists the GS genotype. Different colors denote the zygosity of the Delly, Lumpy, and Wham genotypes. B) The fraction of Chromosome 1 deletions for the NA12878, NA1277, and NA12882 trio that conform to Mendelian inheritance patterns. C) The CEPH/Utah Pedigree 1463 allele frequency (AF) spectrum represented as an empirical cumulative distribution function (ECDF). This curve is derived from Chromosome 1 deletions. FB, Freebayes; UG, Unified Genotyper [49].
Next we used the Platinum trio of human genomes (NA12878, NA12877 and NA12882) to measure Mendelian violations for deletions. Wham had the lowest number of Mendelian violations followed by Lumpy and Delly (Fig 3B).
To ensure that Wham provides robust multi-sample genotype calls, we calculated the allele frequency spectrum for Chromosome 1 deletions in the CEPH 1463 pedigree (Fig 3C). The CEPH 1463 pedigree has 17 family members, across three generations, with the final generation consisting of 11 siblings. Based on the structure of the pedigree, we expect to see the allele frequency spectrum skew toward common variants (Fig 3C). Both Wham and Lumpy have spectra that are similar to FreeBayes and Unified Genotyper small deletions. Delly stood out as it overcalled rare variants, even after quality filtering.
The results from benchmarking on real data have several important implications. First, Wham achieves comparable performance relative to other commonly used structural variant callers for deletions. Importantly, Wham provides robust means for discovering small structural variants. Second, the low overlap between SV classes among the tools tested here supports the power of integrated SV call sets. Frameworks, like the approaches of bcbio, which acts by combining SV calls from a variety of callers (including Wham), can capture a greater swath of genetic diversity while also providing higher confidence for concordant allele calls across varying heuristic methods [17-19].
Identifying candidate SVs with Wham's association test
Although Wham, Delly and Lumpy have similar sensitivities for NA12878 deletions, any critical appraisal of their performance must take into account the very high false discovery rates of all three tools. Using the NA12878 deletion data, for instance, Wham's FDR is 0.78, Delly's is 0.81 and Lumpy's is 0.66. These values illustrate just how difficult SV discovery is using short read data. However, for purposes of genotype-phenotype association, high false discovery rates are tolerable so long as false positives are either randomly distributed across cases and controls (non-differential error), or are systematic (e.g. called in every individual). In both scenarios false positives will cancel out in an association test. Thus, given a reasonable true positive rate, robust association signals will be obtained even in the face of very high FDR.
We first sought to test if Wham's non-differential false discovery rate creates spurious signals of association. To examine this, we used a cohort of individuals with a high degree of genetic relatedness such that if they were assigned randomly into two groups for association testing, there should be little to no differentiation. We chose the CEPH/Utah Pedigree 1463, comprised of seventeen individuals across three generations [42,50]. This pedigree should not harbor appreciable levels of population stratification, thus removing a potential confounding source of false positive associations in our sampling. Wham was run in default mode three times, randomly dividing the pedigree into two groups of eight individuals for assignment to either target or background groups. One genome was excluded each round so that the target and background had the same number of individuals. True and false SV calls were assigned according to their proximity to the phase III 1000 Genomes Project SV calls using a 50-bp truth interval. In total, 16,470 association tests were run for the true positive SV calls, while 380,005 were run for the false positives. Comparing the distributions of Wham's LRT p-values (Chi-squared one degree of freedom) between the groups showed a significant difference between the true and false SVs, as shown in S1 Fig (Two-sample Kolmogorov-Smirnov [KS] test D = 0.0948, p-value = 2.2e-16). The median p-value for the true positive group was 0.66 and 0.69 for the false positive group (1.03 times higher). To see if the significant difference is robust to the number of variants assayed, we subsampled 100 Wham p-values for both groups and the KS test was re-run. Over 1000 iterations, only 362 of the 1000 KS tests achieved significance. This suggests that there is a small, albeit significant, difference between true positives and false positives. Together, this demonstrates that Wham's false SV calls are only expected to slightly inflate the number of spurious associations.
While Wham has a high false discovery rate for SV detection, Wham's association-testing framework is robust to many of these errors as they are non-differential between the cases and controls. To demonstrate that Wham's high FDR for SVs does not hinder association studies we used Wham on both genotypic (pigeon) and pooled (viral) datasets. In the pigeon dataset, Wham was used to re-map a causative SV for the recessive red pigmentation trait and in the viral dataset we show that Wham reliably identifies the breakpoints of a duplication involved in viral adaptation.
Identifying the genetic basis of recessive red coloration in domestic pigeons
Pigeon fanciers have selected for a wide range of phenotypic variation in domestic pigeons over thousands of years. These traits include plumage patterns, behavior, body size and pigmentation [51]. Several alleles in three genes-Tryp1, Sox10, and Slc45a2 -were recently identified that produce variation in melanin synthesis [52]. For example, birds homozygous for a deletion spanning a melanocyte-specific enhancer of Sox10 have reduced expression of Sox10 and its target Tyrp1, resulting in the 'recessive red' color phenotype (classical e locus). Using a previously generated WGS dataset, we examined the power of Wham's association test to identify the e1 allele of recessive red, a 7.5-kb deletion on scaffold974 of the pigeon genome assembly (C_liv1.0 [53]). In conjunction with the Wham analyses, we also ran the same association test (implemented in pFst [38]) for SNPs and Delly SV genotype calls.
Wham identified the e1 allele as the best genome-wide candidate for recessive red using a likelihood ratio test (LRT; Fig 4A, Fig 4C). The LRT implemented in Wham measures the differences in allele frequencies based on the genotype calls at every SV position in the genome. Five recessive red and six wild type birds were processed with Wham to identify SVs and conduct association testing. The highest Wham LRT scores localized at the two PCR-confirmed breakpoints of the e1 allele on scaffold974 (Fig 4C). Because the pigeon reference genome was assembled from a recessive red bird that harbored the e1 deletion allele, Wham indirectly identified the location of the deletion by identifying an "insertion" in the wild-type birds, relative to the reference genome. Delly was unable to identify this allele, because it was not designed to identify novel insertions (Fig 4C). Wham also detected several small inversions near the deletion breakpoints of the e1 allele, whereas Delly additionally failed to detect these inversions. The increased LRT scores (converted to p-values) around the e1 allele are attributable to linkage disequilibrium since e1 is on a haplotype shared by all of the recessive red birds we tested. This linkage is even more pronounced in the SNP data, which have a much higher density of variants. The p-values from Wham's association test fit a uniform distribution, suggesting little to no population stratification between the cases and controls in domestic pigeon (Fig 4B). Importantly, Wham's high false discovery rate did not affect our ability to find the e1 allele. This analysis demonstrates the utility of Wham for rapidly and confidently identifying a structural variant associated with a Mendelian trait in a non-model system.
Fig 4. Identification of the e1 allele using Wham's LRT.
A) Wham's LRT interrogates allele frequency differences between recessive red and wild type birds. Genomic scaffolds are denoted by different colors and are sorted by size in increasing order. The highest LRT score (dashed vertical line) is a 7.5-kb deletion upstream of the Sox10 gene, which encodes a transcription factor that is critical to the melanin synthesis pathway. Only LRT values above 1.5 are shown in A. B) The quantile-quantile plot after converting Wham's likelihood ratio values to p-values. C) Scaffold974 association tests from SNPs, Delly SV calls and WHAM SV calls.
Identifying adaptive structural variation in vaccinia virus populations
Structural variants in the form of gene copy number variation (CNV) in DNA virus genomes provide a mechanism for rapid virus adaptation to host immune defenses [54-57]. For example, frequent recombination events creating tandem gene duplications have been observed in vaccinia virus (VACV) as a means of adaptation to the human antiviral host factor protein kinase R (PKR) during experimental evolution [55]. In this system, selective pressure was placed on the virus by deleting the E3L gene encoding a strong PKR inhibitor, leaving only a weak PKR inhibitor encoded by the K3L gene [58]. Experimental evolution of this [delta]E3L virus in HeLa cells revealed that copy number expansion of the K3L gene provides gains in viral fitness. To test whether CNV is a common mechanism of adaptation and determine whether Wham is an effective tool to detect and characterize such events, we passaged the [delta]E3L VACV strain ten times in a different cell line derived from primary human fibroblasts.
We analyzed short-read sequencing data from viral genomes obtained from a virus population after ten passages and the parental [delta]E3L strain for comparison. This analysis revealed areas of structural variation in the adapted viral population. Plotting read depth across genomic positions revealed a spike in read depth corresponding to the K3L locus that is only present in the adapted strain (Fig 5A). This is consistent with previous work in which a similarly large increase in depth corresponded to increased K3L copy number as a means of adaptation [55]. To determine the exact position of the recombination event generating the CNV, and to find any novel structural variants, we performed SV calling using either Wham or Lumpy. Using similar filtering schemes, Wham identified 6 SV calls in the adapted viral population, compared to 20 SV calls identified by Lumpy. Overlaying SV breakpoint calls on the read depth plots shows the increased specificity of using Wham to identify SVs (Fig 5A).
Fig 5. Wham detects structural variation in vaccinia virus populations.
A) Read depth normalized within each sample is plotted across the ~200 kb vaccinia genome (excluding inverted terminal repeats) for either the parental strain (top panel) or an adapted strain (middle and bottom panels, called by Wham or Lumpy, respectively). Arrows highlight the positions of K3L CNV and E3L deletion. The black lines represent the breakpoints of every SV call after filtering (see Supporting Information). B) Wham calls in the adapted strain near the K3L duplication breakpoint are shown as black triangles above the viral genes in colored boxes. The height of the triangle represents split-read (SR) count supporting the call. Sanger sequencing positions relative to the reference sequence are listed below. Asterisks (*) indicate Wham calls that match the exact breakpoint determined by Sanger sequencing (see S3 Table for Wham and Lumpy breakpoints). C) Wham calls in the adapted strain near the E3L deletion are shown above the genes, and Sanger sequence confirmed positions below, as in B. The arrow indicates the position of the 11K promoter driving [Beta]-gal expression. For breakpoints in grey, the height of the triangle indicates the relative mate-pair count from Wham, as these positions do not have SR support.
Wham analysis identified four SVs in the parental strain, and an additional two in the adapted strain. Notably, all six SVs map near the K3L locus or the E3L deletion (Fig 5B, S3 Table). The two breakpoints near the K3L locus were only identified in the adapted population, suggesting that the SV was not present in the parental strain. These two breakpoints have very high read support, indicative of the same recombination event dominating throughout the adapted viral population (S3 Table). Indeed, when we specifically amplified and sequenced the region around the K3L breakpoint, we identified a single breakpoint in the adapted strain, but could not detect any SV in the parental strain at this location. Importantly, the Wham-identified breakpoints match the exact positions of the breakpoint identified by PCR and Sanger sequencing (Fig 5B). Thus Wham is able to identify SVs in viral populations, down to single nucleotide accuracy. This analysis also suggests that K3L CNV is a common mechanism for VACV to overcome the antiviral PKR defense pathway.
Surprisingly, the other four breakpoints Wham identified with high read support map near the E3L deletion (Fig 5C and S3 Table). This is unexpected because E3L was originally replaced with a [Beta]-galactosidase ([Beta]-gal) selective marker, creating an insertion much larger than the reads from deep sequencing. However, the [delta]E3L virus was originally engineered to express [Beta]-gal under the control of the VACV 11K promoter. This promoter naturally drives expression of the F17R gene, and is thus present in the reference genome at the F17R locus approximately 5 kb upstream of E3L. Therefore, one end of the E3L deletion is supported by split reads mapping to the natural viral promoter. The other end of the deletion does not have SR support as the [Beta]-gal gene itself is not in the reference genome, but Wham did pick up a breakpoint with mate-pair support at this end. Importantly, the genomic positions identified by direct sequencing of both the parental and adapted strains for each end of the E3L deletion were correctly called by Wham (Fig 5C). These results show that Wham can identify both a genetic rearrangement (K3L) and a novel insertion ([Beta]-gal) with respect to a reference sequence. In comparison, while Lumpy successfully identified the K3L duplications (also down to the exact base pair on one end), it failed to identify two of the E3L breakpoints detected by Wham. Overall, three of the five breakpoint positions identified by direct sequencing were called using Lumpy, although the remaining two are in close proximity to one of the called positions. Also, only one of these positions exactly matches the sequenced position, consistent with Lumpy providing a region rather than a specific position. Thus, in this experiment, Wham shows greater specificity as demonstrated by fewer total SV calls, as well as improved accuracy when compared to Lumpy in analyzing this data set. While Wham's low call rate in this example is not consistent with the human data, there are two possible explanations for this trend: the truth sets for the human data are under-called resulting in Wham's high FDR, or Wham under calls pooled datasets. The second possibility is unlikely since Wham correctly identified the breakpoints of all SVs independently verified in the viral dataset [55].
Taking a closer look at Wham calls with mate-pair (MP) support in addition to SR calls, we discovered a complex set of breakpoints around one end of the E3L deletion. For the K3L breakpoint, Wham only called the two positions of the single breakpoint, whereas it called one additional position with high read support on one end of the E3L deletion (Fig 5B and 5C). To determine whether these calls represent true variants, we performed PCR and Sanger sequencing across the region spanning from E2L to E4L, which includes the entire [Beta]-gal cassette. This analysis revealed SVs in both the parental and adapted strains that contain partial deletions of the [Beta]-gal and the 11K promoter. Thus Wham correctly identified a previously unknown variable deletion (Fig 5C). We have two hypotheses to explain the appearance of variable deletions in this region. First, in the absence of selection on the [Beta]-gal marker gene, there is a fitness cost to carrying the engineered marker, so viruses losing this region have a fitness advantage compared to ones retaining it. Alternatively, using a VACV promoter for [Beta]-gal expression present at a second location only ~5 kb away in the genome might promote localized recombination in this region of the VACV genome. These hypotheses are not mutually exclusive and highlight how genetically engineered virus strains may not always be homogenous.
In addition to identification of SVs from sequencing individual genomes, this analysis demonstrates that Wham is able to detect variable structural changes within polymorphic populations. This provides an example of Wham's utility as a tool for accurate detection of SVs in rapidly changing microbial populations. Gene amplification can play a major adaptive role in response to selective pressure in both viral [54-57] and bacterial populations (reviewed in [59-62]), so it is important to accurately define the adaptive potential of structural variants. Recent advances in whole genome sequencing provide a wealth of genetic information about microbial population dynamics, and Wham provides a tool to rapidly identify potentially adaptive SVs.
CPU usage, memory and runtime
Of the four tools used in the benchmarks, Lumpy has the fastest runtime and lowest memory requirements, but with two important caveats. First, the preprocessing steps require reading through each BAM file at least once. Second, Lumpy's genotyper, SVtyper, took three days to run with one CPU for NA12878 deletions compared to less than a day for Wham or Delly. Filtering out high coverage regions drastically improved SVtypers performance. SoftSearch was prohibitively slow in the human benchmarks. To successfully run SoftSearch we split the BAM files into smaller genomic regions. Delly and Wham do not have filtering steps and joint call samples, thereby allowing a more direct speed comparison (S2 Fig). Wham's runtimes increase linearly with the number of samples. Wham's memory requirements depend on the number of CPUs (threading), the number of individuals, and the read depth since Wham maintains a pileup for each individual. Wham's performance and memory usage is similar to other widely used SNP calling tools.
Conclusions
Wham is a flexible SV caller that works on a broad range of data including pooled and diploid individuals. Wham's SV detection compares favorably with other popular mapping based SV calling methods and performs well across a number of SV types in both simulated and real datasets. Like other SV calling tools, Wham suffers from high false positive rates, but we show that this is unlikely to affect the results of Wham's association testing. Wham's ease of use also makes it an ideal package for inclusion with integrated SV callers. By simply running Wham in its default association-testing mode, we were able to identify the causal SV allele of a recessive trait in pigeons. Similarly, Wham's accurate breakpoint predictions were able to locate a copy number variant in viral populations relative to a parental strain with very high precision.
Availability and Future Directions
Wham and all associated software can be found on github (https://github.com/zeeev/wham), documentation is on the wiki (http://zeeev.github.io/wham/). For community support please post questions to https://Biostars.org [63].
Supporting Information
S1 File. Additional information regarding simulations, benchmarks, biological datasets, validations, and software versions.
(DOCX)
S1 Table. A description of the factors used in structural variant classification.
These factors are found in the VCF info field under the "AT" key.
(DOCX)
S2 Table. The number of structural variant calls for NA12878 before and after filtering.
(DOCX)
S3 Table. Breakpoint support and accuracy in the vaccinia virus dataset.
(DOCX)
S1 Fig. Wham false positives and true positives share similar p-value distributions.
Quantile-quantile plots for Wham's LRT statistic after conversion to p-values (y-axis). Left panel: The p-values for the structural variants that intersect with The 1000 Genomes Project Phase 3 dataset (within +/- 25 bp). Right panel: The p-values for structural variants that do not intersect with the phase III 1000 Genomes Project dataset. Both the true and false positive SV calls have very similar distributions.
(PDF)
S2 Fig. Wham and Delly runtimes for a one Mb region across differing samples sizes.
Wham has a linear relation between runtime and the number of samples run. The black line has a y-intercept of zero and slope of one.
(PDF)
Acknowledgments
We thank Gabor Marth, Aaron Quinlan, Ryan Layer, Ryan Abo, and Erik Garrison for their helpful suggestions and critiques. Brad Chapman provided extensive feedback for Wham's classifier and benchmarked Wham on the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge. We also thank Carson Holt for processing the Illumina Platinum Genomes data.
Author Contributions
Conceived and designed the experiments: ZNK EJO KRC ETD MDS NCE MY BJK. Performed the experiments: ZNK EJO KRC ETD. Analyzed the data: ZNK EJO. Contributed reagents/materials/analysis tools: KRC ETD MDS NCE MY. Wrote the paper: ZNK EJO MY.
Gemayel R, Vinces MD, Legendre M, Verstrepen KJ (2010) Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu Rev Genet 44: 445-477. doi: 10.1146/annurev-genet-072610-155046. pmid:20809801
Chan YF, Marks ME, Jones FC, Villarreal G, Shapiro MD, et al. (2010) Adaptive evolution of pelvic reduction in sticklebacks by recurrent deletion of a Pitx1 enhancer. Science 327: 302-305. doi: 10.1126/science.1182213. pmid:20007865
Perry G, Yang F, Marques-Bonet T (2008) Copy number variation and evolution in humans and chimpanzees. Genome Res 18: 1698-1710. doi: 10.1101/gr.082016.108. pmid:18775914
Axelsson E, Ratnakumar A, Arendt M-L, Maqbool K, Webster MT, et al. (2013) The genomic signature of dog domestication reveals adaptation to a starch-rich diet. Nature 495: 360-364. doi: 10.1038/nature11837. pmid:23354050
McCarroll S, Altshuler DM (2007) Copy-number variation and association studies of human disease. Nat Genet 39: S37-S42. pmid:17597780 doi: 10.1038/ng2080
Weischenfeldt J, Symmons O, Spitz F, Korbel JO (2013) Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet 14: 125-138. doi: 10.1038/nrg3373. pmid:23329113
Stankiewicz P, Lupski JR (2010) Structural variation in the human genome and its role in disease. Annu Rev Med 61: 437-455. doi: 10.1146/annurev-med-100708-204735. pmid:20059347
Onishi-Seebacher M, Korbel JO (2011) Challenges in studying genomic structural variant formation mechanisms: the short-read dilemma and beyond. Bioessays 33: 840-850. doi: 10.1002/bies.201100075. pmid:21959584
Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, et al. (2008) Mapping and sequencing of structural variation from eight human genomes. Nature 453: 56-64. doi: 10.1038/nature06862. pmid:18451855
Sindi SS, Onal S, Peng LC, Wu H-T, Raphael BJ (2012) An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol 13: R22. doi: 10.1186/gb-2012-13-3-r22. pmid:22452995
Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, et al. (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28: i333-i339. doi: 10.1093/bioinformatics/bts378. pmid:22962449
Marschall T, Hajirasouliha I, Schönhuth A (2013) MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels. Bioinformatics 29: 3143-3150. doi: 10.1093/bioinformatics/btt556. pmid:24072733
Marschall T, Costa IG, Canzar S, Bauer M, Klau GW, et al. (2012) CLEVER: clique-enumerating variant finder. Bioinformatics 28: 2875-2882. doi: 10.1093/bioinformatics/bts566. pmid:23060616
Layer RM, Chiang C, Quinlan AR, Hall IM (2014) LUMPY: A probabilistic framework for structural variant discovery. Genome Biol 15: R84. doi: 10.1186/gb-2014-15-6-r84. pmid:24970577
Hart SN, Sarangi V, Moore R, Baheti S, Bhavsar JD, et al. (2013) SoftSearch: integration of multiple sequence features to identify breakpoints of structural variations. PLoS One 8: e83356. doi: 10.1371/journal.pone.0083356. pmid:24358278
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, et al. (2009) BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 6: 677-681. doi: 10.1038/nmeth.1363. pmid:19668202
Mimori T, Nariai N, Kojima K, Takahashi M, Ono A, et al. (2013) iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data. BMC Syst Biol 7 Suppl 6: S8. doi: 10.1186/1752-0509-7-s6-s8
Wong K, Keane TM, Stalker J, Adams DJ (2010) Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biol 11: R128. doi: 10.1186/gb-2010-11-12-r128. pmid:21194472
Chapman B. bcbio-nextgen. github. https://github.com/chapmanb/bcbio-nextgen. Accessed 27 April 2015
Li Y, Zheng H, Luo R, Wu H, Zhu H, et al. (2011) Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nat Biotechnol 29: 725-732. doi: 10.1038/nbt.1904
Kemena C, Notredame C (2009) Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25: 2455-2465. doi: 10.1093/bioinformatics/btp452. pmid:19648142
Chen K, Chen L, Fan X, Wallis J, Ding L, et al. (2014) TIGRA: A targeted iterative graph routing assembler for breakpoint assembly. Genome Res 24: 310-317. doi: 10.1101/gr.162883.113. pmid:24307552
Quinlan A, Clark R (2010) Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Research: 623-635. doi: 10.1101/gr.102970.109. pmid:20308636
Narzisi G, O'Rawe J a, Iossifov I, Fang H, Lee Y, et al. (2014) Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat Methods 11: 1-7. doi: 10.1038/nmeth.3069
Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP (2014) Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15: 121-132. doi: 10.1038/nrg3642. pmid:24434847
Kim SY, Li Y, Guo Y, Li R, Holmkvist J, et al. (2010) Design of association studies with pooled or un-pooled next-generation sequencing data. Genet Epidemiol 34: 479-491. doi: 10.1002/gepi.20501. pmid:20552648
Döring A, Weese D, Rausch T, Reinert K (2008) SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9: 11. doi: 10.1186/1471-2105-9-11. pmid:18184432
Zhao M, Lee WP, Garrison EP, Marth GT (2013) SSW library: An SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS One 8: 1-7. doi: 10.1371/journal.pone.0082138
Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27: 2987-2993. doi: 10.1093/bioinformatics/btr509. pmid:21903627
Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12: 443-451. doi: 10.1038/nrg2986. pmid:21587300
Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Prepr arXiv13033997 00: 1-3
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357-359. doi: 10.1038/nmeth.1923. pmid:22388286
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, et al. (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12: 2825-2830. doi: 10.3389/fninf.2014.00014
Michaelson J, Sebat J (2012) forestSV: structural variant discovery through statisical learning. Nat Methods 9: 819-821. doi: 10.1038/nmeth.2085. pmid:22751202
Yandell M, Huff C, Hu H, Singleton M, Moore B, et al. (2011) A probabilistic disease-gene finder for personal genomes. Genome Res 21: 1529-1542. doi: 10.1101/gr.123158.111. pmid:21700766
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559-575. pmid:17701901 doi: 10.1086/519795
Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, et al. (2007) TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23: 2633-2635. pmid:17586829 doi: 10.1093/bioinformatics/btm308
Kronenberg Z. GPAT++. github. https://github.com/jewmanchue/vcflib/wiki. Accessed 27 April 2015
Ye K, Schulz MH, Long Q, Apweiler R, Ning Z (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25: 2865-2871. doi: 10.1093/bioinformatics/btp394. pmid:19561018
Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061-1073. doi: 10.1038/nature09534. pmid:20981092
Pang AW, MacDonald JR, Pinto D, Wei J, Rafiq M a, et al. (2010) Towards a comprehensive structural variation map of an individual human genome. Genome Biol 11: R52. doi: 10.1186/gb-2010-11-5-r52. pmid:20482838
Abecasis GR, Auton A, Brooks LD, DePristo M a, Durbin RM, et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56-65. doi: 10.1038/nature11632. pmid:23128226
Lappalainen I, Lopez J, Skipper L, Hefferon T, Spalding JD, et al. (2013) DbVar and DGVa: Public archives for genomic structural variation. Nucleic Acids Res 41: D936-D941. doi: 10.1093/nar/gks1213. pmid:23193291
Bickhart DM, Hutchison JL, Xu L, Schnabel RD, Taylor JF, et al. (2015) RAPTR-SV: a hybrid method for the detection of structural variants. Bioinformatics 31: 2084-2090. doi: 10.1093/bioinformatics/btv086. pmid:25686638
Chaisson MJP, Huddleston J, Dennis MY, Sudmant PH, Malig M, et al. (2014) Resolving the complexity of the human genome using single-molecule sequencing. Nature 517: 608-611. doi: 10.1038/nature13907. pmid:25383537
Steinberg KM, Schneider VA, Graves-lindsay TA, Fulton RS, Agarwala R, et al. (2014) Single haplotype assembly of the human genome from a hydatidiform mole: 2066-2076
Mills RE (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16: 1182-1190. pmid:16902084 doi: 10.1101/gr.4565806
Handsaker RE, Van Doren V, Berman JR, Genovese G, Kashin S, et al. (2015) Large multiallelic copy number variations in humans. Nat Genet 47: 296-303. doi: 10.1038/ng.3200. pmid:25621458
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20: 1297-1303. doi: 10.1101/gr.107524.110. pmid:20644199
Illumina. Whole-genome sequencing performed on Illumina HiSeq. http://www.illumina.com/platinumgenomes/. Accessed 27 April 2015
Shapiro MD, Domyan ET (2013) Domestic pigeons. Curr Biol 23: R302-R303. doi: 10.1016/j.cub.2013.01.063. pmid:23618660
Domyan ET, Guernsey MW, Kronenberg Z, Krishnan S, Boissy RE, et al. (2014) Epistatic and combinatorial effects of pigmentary gene mutations in the domestic pigeon. Curr Biol 24: 459-464. doi: 10.1016/j.cub.2014.01.020. pmid:24508169
Shapiro MD, Kronenberg Z, Li C, Domyan ET, Pan H, et al. (2013) Genomic diversity and evolution of the head crest in the rock pigeon. Science 339: 1063-1067. doi: 10.1126/science.1230422. pmid:23371554
Slabaugh MB, Roseman NA, Mathews CK (1989) Amplification of the ribonucleotide reductase small subunit gene: analysis of novel joints and the mechanism of gene duplication in vaccinia virus. Nucleic Acids Res 17: 7073-7088. pmid:2674905 doi: 10.1093/nar/17.17.7073
Elde NC, Child SJ, Eickbush MT, Kitzman JO, Rogers KS, et al. (2012) Poxviruses deploy genomic accordions to adapt rapidly against host antiviral defenses. Cell 150: 831-841. doi: 10.1016/j.cell.2012.05.049. pmid:22901812
Brennan G, Kitzman JO, Rothenburg S, Shendure J, Geballe AP (2014) Adaptive Gene Amplification As an Intermediate Step in the Expansion of Virus Host Range. PLoS Pathog 10: e1004002. doi: 10.1371/journal.ppat.1004002. pmid:24626510
Erlandson KJ, Cotter CA, Charity JC, Martens C, Fischer ER, et al. (2014) Duplication of the A17L Locus of Vaccinia Virus Provides an Alternate Route to Rifampin Resistance. J Virol 88: 11576-11585. doi: 10.1128/JVI.00618-14. pmid:25078687
Beattie E, Denzler KL, Tartaglia J, Perkus ME, Paoletti E, et al. (1995) Reversal of the interferon-sensitive phenotype of a vaccinia virus lacking E3L by expression of the reovirus S4 gene. J Virol 69: 499-505. pmid:7527085
Romero D, Palacios R (1997) Gene amplification and genomic plasticity in prokaryotes. Annu Rev Genet 31: 91-111. pmid:9442891 doi: 10.1146/annurev.genet.31.1.91
Andersson DI, Hughes D (2009) Gene Amplification and Adaptive Evolution in Bacteria. Annu Rev Genet 43: 167-195. doi: 10.1146/annurev-genet-102108-134805. pmid:19686082
Sandegren L, Andersson DI (2009) Bacterial gene amplification: implications for the evolution of antibiotic resistance. Nat Rev Microbiol 7: 578-588. doi: 10.1038/nrmicro2174. pmid:19609259
Elliott KT, Cuff LE, Neidle EL (2013) Copy number change: evolving views on gene amplification. Future Microbiol 8: 887-899. doi: 10.2217/fmb.13.53. pmid:23841635
Parnell LD, Lindenbaum P, Shameer K, Dall'Olio GM, Swan DC, et al. (2011) BioStar: an online question & answer resource for the bioinformatics community. PLoS Comput Biol 7: e1002216. doi: 10.1371/journal.pcbi.1002216. pmid:22046109
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2015 Public Library of Science. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited: Kronenberg ZN, Osborne EJ, Cone KR, Kennedy BJ, Domyan ET, Shapiro MD, et al. (2015) Wham: Identifying Structural Variants of Biological Consequence. PLoS Comput Biol 11(12): e1004572. doi:10.1371/journal.pcbi.1004572
Abstract
Existing methods for identifying structural variants (SVs) from short read datasets are inaccurate. This complicates disease-gene identification and efforts to understand the consequences of genetic variation. In response, we have created Wham (Whole-genome Alignment Metrics) to provide a single, integrated framework for both structural variant calling and association testing, thereby bypassing many of the difficulties that currently frustrate attempts to employ SVs in association testing. Here we describe Wham, benchmark it against three other widely used SV identification tools-Lumpy, Delly and SoftSearch-and demonstrate Wham's ability to identify and associate SVs with phenotypes using data from humans, domestic pigeons, and vaccinia virus. Wham and all associated software are covered under the MIT License and can be freely downloaded from github (https://github.com/zeeev/wham), with documentation on a wiki (http://zeeev.github.io/wham/). For community support please post questions to https://www.biostars.org/.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer