ARTICLE
Received 23 Dec 2011 | Accepted 30 Mar 2012 | Published 1 May 2012 DOI: 10.1038/ncomms1814
Moritz Gerstung1,2, Christian Beisel1, Markus Rechsteiner3, Peter Wild3, Peter Schraml3, Holger Moch3 & Niko Beerenwinkel1,2
According to the clonal evolution model, tumour growth is driven by competing subclones in somatically evolving cancer cell populations, which gives rise to genetically heterogeneous tumours. Here we present a comparative targeted deep-sequencing approach combined with a customised statistical algorithm, called deepSNV, for detecting and quantifying subclonal single-nucleotide variants in mixed populations. We show in a rigorous experimental assessment that our approach is capable of detecting variants with frequencies as low as 1/10,000 alleles. In selected genomic loci of the TP53 and VHL genes isolated from matched tumour and normal samples of four renal cell carcinoma patients, we detect 24 variants at allele frequencies ranging from 0.0002 to 0.34. Moreover, we demonstrate how the allele frequencies of known single-nucleotide polymorphisms can be exploited to detect loss of heterozygosity. Our ndings demonstrate that genomic diversity is common in renal cell carcinomas and provide quantitative evidence for the clonal evolution model.
Reliable detection of subclonal single-nucleotide variants in tumour cell populations
1 Department of Biosystems Science and Engineering, ETH Zurich, Mattenstrasse 26, 4058 Basel, Switzerland. 2 SIB Swiss Institute of Bioinformatics, 4056 Basel, Switzerland. 3 Institute for Surgical Pathology, University Hospital Zurich, Schmelzbergstrasse 12, 8091 Zurich, Switzerland. Correspondence and requests for materials should be addressed to N.B. (email: [email protected]).
NATURE COMMUNICATIONS | 3:811 | DOI: 10.1038/ncomms1814 | www.nature.com/naturecommunications
2012 Macmillan Publishers Limited. All rights reserved.
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1814
Cancer is a somatic evolutionary process in which mutations render cells non-cooperative and overly proliferative13. Selectively advantageous driver mutations accumulate in
multiple rounds of clonal expansions together with hitch-hiking, selectively neutral passenger mutations1,4. The driving forces of evolution include mutations in single cells and selection of the most proliferative clones. Mutation diversies an evolving population by generating novel variants, whereas selection has a purifying eect. Genomic diversity resulting from the interplay of mutation and selection is thus a key signature of evolution.
Studying genomic diversity in heterogeneous cell populations became possible with second-generation sequencing technologies that process millions of DNA molecules in a single run5. They enable direct sequencing of mixed samples, such as virus populations6,7, bacterial communities8, tumours911 and pooled samples12,13, and the reconstruction of their genomic composition. However, single-nucleotide errors resulting from target enrichment, library preparation and base calling are frequent on all current sequencing platforms5, and they are difficult to separate from true low- frequency single-nucleotide variants (SNVs). Sequencing error rates vary across genomic sites, oen reaching up to 1%, and they challenge accurate calling of SNVs present at frequencies below this rate.
To overcome these limitations, we employ a comparative sequencing strategy, where the same genomic region is compared between a heterogeneous test sample and a homogeneous control sample, using a customised statistical algorithm (Fig. 1a). The control sample allows for estimating the local error rate, which increases the power for calling true variants at a given false-positive rate. Unlike true variants, sequencing errors depend on the directionality of sequencing and tend to occur more oen on one DNA strand than the other, which can be used to further increase the specicity of variant calling14,15. Batch-library preparation and sequencing in the same run ensure identical noise characteristics of test and control, an important prerequisite for reliable variant detection.
ResultsdeepSNV algorithm. Comparing test and control experiment requires estimation of inter-experimental variation. For each genomic position, we model the number of observed nucleotide counts on the two strands in both experiments with a hierarchical binomial model and derive a likelihood ratio test for each base to quantify the excess of the SNV in the test over the control sample (Fig. 1bd; Methods). We aggregate the test results from both strands into a single P-value that quanties how likely it is that an observed nucleotide is a sequencing error, rather than a true variant (Fig. 1ei). P-values are corrected for the number of tests performed, controlling either the family-wise error rate (FWER; Bonferroni method) or the false discovery rate (FDR; BenjaminiHochberg)16. We have implemented the testing procedure in the R package deepSNV, which is freely available at http://www.bioconductor.org.
Experimental analysis of specicity and sensitivity. An initial analysis of two Illumina GAIIx sequenced replicates of the phiX genome conrmed the accuracy of the P-values computed by deep-
SNV as a measure of type-1 errors (Fig. 1i). Accurate P-values are critical, because the algorithm assesses all four minor alleles on each position in the genome, resulting in thousands or even millions of tests, and multiple testing schemes fail if P-values are biased. Specicity is lost if sequencing is performed in dierent runs because of dissimilar error distributions, but can partially be recovered by data normalisation (Supplementary Fig. S1).
To assess the power of comparative sequencing followed by variant calling by deepSNV, we generated synthetic test samples by mixing six plasmids containing known clones of a 1.5 kb fragment of the HIV pol gene at relative frequencies 10 5, 10 4, 10 3, 10 2
and 10 1, respectively, together with a majority clone at frequency 0.89999 (Supplementary Table S1). The majority clone also served as a control sample. The ve low-frequency clones contained approximately 100 SNVs relative to the control clone. As some variants are present on multiple clones and can be masked by clones with higher frequencies, the number of unique variants is between 36 and 101 (Table 1). PCR target enrichment was simulated by amplifying the inserts from the two samples and resulted in elevated noise levels, but only minimally altered variant frequencies (Supplementary Fig. S2). Both PCR-amplied and non-amplied mixture and control samples were sequenced at 69,203 to 117,180 coverage on an Illumina GAIIx sequencer in the same lane using barcodes and
36 nucleotide reads (Supplementary Table S2). Reads were aligned to the HXB2 HIV reference genome to avoid bias towards any of the clones. At each position, nucleotides with Phred quality larger than 25 were counted, insertions and alignment artifacts were ignored, and 23 variants of a conrmed subpopulation in the control sample were masked (Supplementary Fig. S3).
For SNV frequencies larger than or equal to 10 4, the measured nucleotide frequencies accurately agree with the true values, whereas SNVs with frequencies below 10 4 are additively biased by sequencing errors that occurred at a median rate of 2.210 5 (Fig. 2a). The long tail of sequencing errors confounds SNV calling, but this limitation can partially be overcome by testing against the control (Fig. 2b).
DeepSNV calls variants with frequencies higher than 10 4 with high sensitivity and specicity (Fig. 2c). At an FDR of 0.05, it recovered all SNVs of frequency 10 1 and 10 2, 53/57 variants of frequency 10 3, and 3/44 variants of frequency 10 4, whereas the false-positive rate was 2/5,740 (Table 1). With a more conservative FWER control, no false positives were called. At a xed FWER, deepSNV outperformed all related soware packages1719 in terms of both specicity and sensitivity. Although the power of deepSNV is comparable to that of vipR for variant frequencies of 0.1 and 0.01, its performance is considerably better for variant frequencies of 0.001 and 0.0001 (Supplementary Fig. S4). Most importantly, deepSNV achieves a high sensitivity for small false-positive rates, but also a high overall power as measured by the area under the receiver-operating characteristic (ROC) curve (Supplementary Table S3). With the exception of VarScan17, however, deep-SNV is the only method specically designed for detecting SNVs in mixed populations with an unknown number of clones. For low frequencies of 10 3, deepSNV achieves a power of 86%, compared with the second-best method with 53%. Our algorithm was also the fastest because of a direct C interface to the condensed bam alignments that present a bottleneck for nucleotide-wise analysis.
The deepSNV algorithm uses a Phred quality cuto to avoid false positives caused by ambiguous nucleotide calls. The choice of the cuto has a negligible eect on performance as long as it is greater than 10 (Supplementary Fig. S5A). For higher cutos, there is a small decrease in power because of the reduced coverage. A default Phred score cuto of 25 resulted in a good compromise between specicity and sensitivity. The performance of deepSNV was also found not to depend strongly neither on the chosen method of P-value combination, nor on PCR amplication (Supplementary Table S4). Power calculations show that additional sensitivity for calling low-frequency variants can be gained by increased sequencing depth (Supplementary Fig. S5B). Roughly, the required coverage for calling a variant needs to be at least ten times higher than its inverse frequency. For large genomes, the power of SNV calling is diminished by multiple testing corrections, but it remains high for variants present in 1/1,000 alleles (Supplementary Fig. S5C).
Subclonal diversity in renal cell carcinomas. We extracted 10,374 bp of the VHL, PTEN, TP53 and CDKN1B genes by PCR from matched normal and tumour samples of four clear cell renal cell carcinoma
NATURE COMMUNICATIONS | 3:811 | DOI: 10.1038/ncomms1814 | www.nature.com/naturecommunications
2012 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1814
ARTICLE
a
b
Position i
Reference
Strand
Mean rate (errors + SNVs)
ps,i,b s,i,b
ns,i Xs,i,b
s = 0,1
i = 1,...,Nb = A,T,C,G,
Beta nucleotide rate
Binomial nucleotide counts
deepSNV algorithm
2. Combine P-values from each strand
3. Adjust P-values for multiple testing
Coverage
Dispersion
Mean rate Beta
error rate
Binomial nucleotide counts
1. Test for each strand s, position i, and nucleotide b:
Test: mixed population
Control: Clonal
Ultra-deep sequencing
s = 1 s = 0
Null-hypothesis: no SNV
Alternative: SNV b, frequencyf
ps,i,b =qs,i,b
ps,i,b =qs,i,b + f
>qs,i,b
qs,i,b
i = 1,...,Nb = A,T,C,G,
s,i,b
ms,i Ys,i,b
s = 0,1
Coverage
Dispersion
Aligned reads
SNVs Error
c
d
e
1.0
1 = 1000
Nucleotide frequency p 0,i,b phiX 2
0.1
0.1
Nucleotide frequency p 0,i,b tumour
0.01
P 1,i,b
0.01
50
0.001
50
1
20
20
50
6 8
0.001
10
8
10
1e04
4
6
0.0
2
2
4
1e04
1
1 = 137
0.0 1.0
P0,i,b
1e05
0
1e05
i
1e05 1e04 0.001 0.01 0.1 1
1e05 1e04 0.001 0.01 0.1 1
1
Nucleotide frequency q0,i,b phiX 1
log10 P0,i,b
Nucleotide frequency q0,i,b normal
f h
g
0.01
1.0
1.0
1.0
0.9 0.8
0.5
0.7 0.6
0.7
CDF
P 1,i,b
0.3 0.4
P 1,i,b
0.2
0.3
0.4 0.5
0.6
0.8
0.9
P 1,i,b
0.9
1e04
0.2
0.7
0.8
0.1
0.5
0.6
0.1
1e06
0.1
0.2
0.3
0.4
0.0
0.0
0.0
1e06 1e04 0.01 1
0.0 1.0
0.0 1.0
0.0 1.0
Combined P-value
P0,i,b P0,i,b P0,i,b
Figure 1 | Testing for low-frequency SNVs with deepSNV. (a) For a mixed sample, the distribution of reads with SNVs (grey arrows with coloured dots) resembles the population structure, but sequencing errors (black dots) confound calling of SNVs. A homogeneous control sample allows for precise estimation of the local error rate, which is often biased to one strand, and enables accurate SNV calling against the background noise frequency. (b) Generative probabilistic model underlying deepSNV and summary of the algorithm. At each position i in the test experiment and for each strand s, the frequency of nucleotide b, s,i,b, is drawn from a beta distribution with mean ps,i,b. The dispersion quanties the variability of the nucleotide frequencies across experiments. In the absence of an SNV, ps,i,b resembles the error rates. The nucleotide counts Xs,i,b are modelled by a binomial distribution with coverage ns,i. The same model is used for generating the nucleotide counts Ys,i,b. in the control experiment with mean error rate qs,i,b.
Testing for an SNV b at position i amounts to test the null hypothesis that the mean relative frequencies of nucleotide b are identical in test and control experiment ps,i,b = qs,i,b. (c) Scatter plot of mean nucleotide frequencies from strand 0 for two phiX replicates (black dots). Colours and contour lines denote the strand-specic P-value of deepSNV for median coverage of 166,000. (d) Scatter plot of nucleotide frequencies for a cancer sample and control (black dots). Contour lines are shown for the median control coverage of 42,438. (e) Scatter plot of P-values from both strands of the two phiX replicates. (fh) Combining P-values. Contours of the joint P-value for the max (f), average (g) and product (h) statistic. (i) Empirical distribution of joint P-values for the two phiX experiments, combined with the max method.
(RCC) patients and sequenced the fragmented amplica at ultra-deep coverage (Fig. 3a and Supplementary Tables S5S7). For one patient, additional samples from an opposing side of the primary
tumour and from a metastatic lesion were taken, and an additional 4,378 bp of the PTEN gene were isolated by PCR. We detected a total of 24 (range 113 per sample) dierent SNVs in the tumours with
NATURE COMMUNICATIONS | 3:811 | DOI: 10.1038/ncomms1814 | www.nature.com/naturecommunications
2012 Macmillan Publishers Limited. All rights reserved.
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1814
frequencies ranging from 0.0002 to 0.34, as opposed to only two variants with higher frequencies in the controls (FWER < 0.05, beta-binomial test; Table 2 and Fig. 3b). Eight selected subclonal variants were resequenced and conrmed on a Roche GS Junior sequencer using 300 bp reads (Supplementary Table S8). The validation experiment also showed an accurate agreement of the nucleotide frequencies with the original discovery experiment (Fig. 3c). The nucleotide substitution spectrum is similar to previous reports in RCC20 (Fig. 3d), with a characteristic overrepresentation of {C,G}>{T,A} deaminations at CpG dinucleotides and more prevalent G>A substitutions on the transcribed strand.
In three out of four cases, the VHL gene was hit by a high-frequency truncating mutation, namely a stop codon at p.E189* at frequency 0.34 in tumour 1, and two single-nucleotide deletions, c.565delG and c.349delT, observed at frequencies 0.17 and 0.24 in tumour 2 and in the multiple lesions samples, respectively. The remaining 21 subclonal variants had low frequencies. Four subclonal SNVs were found in coding regions, of which one SNV at frequency 0.01 in tumour 2 introduces a stop codon in TP53 at p.E198. Another four SNVs occur in 3- and 5- untranslated regions. The remaining 13 variants are located in intronic regions. The co-occurrence of two intronic SNVs at 20-bp distance in tumour 1 (chr17: 7577407A>C and chr17: 7577427G>A) was detected both in the
discovery experiment using Illumina and in the validation experiment using 454/Roche. All other SNVs sequenced on the same amplica were detected on separate alleles. The number of SNVs was much greater in tumour 1 (n = 13) and tumour 2 (n = 8) than in the other samples that contained only one or two SNVs.
The estimated nucleotide frequencies may be utilised to infer regions of lost heterozygosity. For this purpose, the frequencies of germline single-nucleotide polymorphisms (SNPs) were assessed. The dierence of the SNP allele frequencies in the normal versus tumour samples measures the excess of an allele that indicates lost heterozygosity (Fig. 4ac). With this approach, we detected loss of parts of chromosome 3 in ve out of six samples, including the multiple-lesions cases. The copy-number losses were conrmed by standard copy-number analysis using 250-kb SNP arrays in three matched tumour-normal samples (Fig. 4df).
The SNP allele counts also allow for estimating the fraction of cells with a lost allele, which can indicate a mixture of normal and tumour cells (Fig. 4g). We estimated a tumour content of 42 to 50%. In the case of multiple lesions per patient, the tumour content was conserved across the three samples, which suggests a constant, stable equilibrium between tumour and normal cells (Fig. 4h). The frequency of hemizygous SNPs in all three cases with loss-of-heterozygosity (LOH) agrees well with the mutation frequencies of truncating VHL mutations, suggesting that both alleles of this tumour suppressor gene are impaired in tumour cells. Taken together, the clonal VHL point mutation and loss of chromosome arm 3p as well as the 7 subclonal mutations found at the time of diagnosis suggest, for tumour 1, the evolutionary history summarised in Fig. 5.
Discussion
We have presented a comparative targeted deep-sequencing approach and a powerful statistical algorithm for detecting subclonal SNVs in heterogeneous cell populations. The specicity and sensitivity of the method have been rigorously assessed on multiple control experiments. Its reliability results from an overdispersed statistical model of nucleotide counts and from integrating the signals from both DNA strands. The current limit of detection is around 1/10,000 alleles, but it may be further improved by increased coverage and higher sequencing delity with improved biochemistry or barcoded reads21.
The method can be applied to any tissue sample of a heterogeneous cell population for which a control sample is available. It may be utilised for the analysis of pathogen populations, such as viruses or bacteria, for the assessment of T-cell diversity22, or for detecting rare somatic mutations associated with diseases, such as the
Table 1 | Comparison of SNV calling methods.
SNV frequency Errors CPU time
10 1 10 2 10 3 10 4 10 5
Truth 101 46 57 44 36 5,740* deepSNVFDR < 0.05
101 46 53 3 0 2 141 s
deepSNV FWER < 0.05
99 46 49 0 0 0 141 s
VarScan17 pileup2snp
96 42 26 32 8 472 361 s
VarScan somatic
50 29 34 1 0 33 439 s
CRISP18 91 43 46 0 0 16 44 h vipR19 98 43 30 0 0 1 279 s
Abbreviations: CPU central processing unit; FDR, false discovery rate; FWER, family-wise error rate; SNVs, single-nucleotide variants*Number of negatives=4 tests1,512 positions 284 SNVs 24 masked positions (23 validated subpopulation and 1 alignment artefact).
Including 260 s for the samtools pileup command.
Known mix and control
1.0
1e+00
0.8
0.1
Observed SNV
frequency
Frequency in mix
True-positive rate
0.01
1e02
0.6
0.001
Significance
0.4
SNV frequency0.10.010.001 1e04 1e05
1e04
1e04
0.2
1e05
Expected frequency
SNVs Error
1e06
0.0
Errors
1e05
1e04
0.001
0.01
0.1
1e06
1e04
1e02
1e+00
0.00
0.01
0.02
0.03
0.04
0.05
True SNV frequency
Frequency in control
False-positive rate
Figure 2 | Experimental assessment of deepSNV. (a) Observed frequency distributions for SNVs and sequencing errors in a test data set of known composition. The dashed line shows the expected frequency based on an additive error model. (b) Scatter plot of nucleotide frequencies in control and mixture sample, and indicated levels of signicance of the deepSNV test. The null hypothesis of the test implies that the observed relative frequencies lie close to the diagonal, whereas the alternative allows the frequency in the test experiment to be greater than in the control. (c) Receiver-operating characteristic curves demonstrate high power at low false-positive rates.
NATURE COMMUNICATIONS | 3:811 | DOI: 10.1038/ncomms1814 | www.nature.com/naturecommunications
2012 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1814
ARTICLE
TP53
VHL
1
T1 T2 T3 P1 P2 M
SNV frequency
0.1
p.E198*
0.01
3 UTR
0.001
1e04
7572000 7574000 7576000 7578000 Position
1
3x Tumour-normal: Tumour 1Tumour 2Tumour 3
1x Multiple lesions Primary 1 Primary 2 Metastasis
c.349delT p.E189*
c.565delG
p.R210Wp.R205C
10184000 10188000 10192000
Position
SNV frequency
0.1
0.01
3 UTR
5 UTR
0.001
1e02
3 UTR
Frequency above control,
GS junior
1e04
1e03
Non CpG
CpG
15 TS NTS
0 Del
Unique mutations
1e04
10
1e04 1e03 1e02
Frequency above control, Illumina
5
A>C
A>G
A>T
G>A
G>C
G>T
Figure 3 | Detecting intra-tumour heterogeneity in renal cell carcinomas. (a) Three matched tumour-normal samples and one case with biopsies from multiple lesions were analysed. (b) Distribution of SNV frequencies on the TP53 (top) and VHL (bottom) genes. The colour represents the tumour sample. Black boxes indicate the PCR amplicons used, grey bars are exons. Only non-synonymous exonic SNVs are annotated. UTR, untranslated region. (c) A total of 8 of 23 subclonal SNVs detected by Illumina sequencing were validated by additional sequencing on a Roche GS Junior sequencer. Shown are the differences in the nucleotide frequencies, averaged from both strands, between the tumour and the control samples in the two experiments. Filled circles denote statistical signicance (P < 0.05; deepSNV test). (d) Distribution of base substitutions. NTS, non-transcribed strand, TS, transcribed strand.
Proteus syndrome23. Another application is the cost-eective pooled sequencing of multiple individuals. In cases where a pure sample of the majority clone is not available, a closely related reference sample could be used as a control, for example, a stock plasmid of the genomic regions of interest. The deepSNV algorithm has primarily been designed for targeted sequencing of selected loci at ultra-deep coverage, but power calculations indicate that the algorithm can also detect heterozygous mutations at 100 coverage in comparative exome-sequencing studies, and simulations show that this application is computationally feasible.
We have demonstrated the utility of the sequencing approach for RCC tissue samples, revealing multiple subclonal variants and intratumour heterogeneity on the chromosomal and single-nucleotide level. In addition, the imbalances of SNP allele frequencies were used to correctly predict an LOH on chromosome 3 in only a subset of the tumour samples. Recent studies found genomic heterogeneity in breast cancer10,24, pancreatic cancer25,26, and B-cell chronic lymphocytic leukemia9, as well as mosaic amplications of tyro-sine kinase receptor genes in glioblastoma27. Together, these ndings provide compelling evidence for clonal evolution as a general mechanism in cancer development. Quantifying subclonal diversity in tumours is important for understanding the driving forces
of their evolution, and sensitive methods are required for detecting low-frequency drug-resistant mutations before treatment28.
Most tumour variants were found at frequencies below 1/1,000 alleles. This observation agrees with the notion that mutations occur initially in single cells and selection amplies few alterations to high frequencies, which causes the number of dierent variants to decrease with increasing frequencies. A total of 13 out of 21 subclonal variants occurred in introns, and they are most likely neutral- passenger mutations. All SNVs were found in the VHL and TP53 genes, which show a similar dinucleotide composition as the PTEN and CDKN1B amplicons, and made up 8,753 of the 10,375 bp sequenced in each sample, suggesting an overrepresentation of subclonal SNVs in VHL and TP53 (P = 0.06, Fishers exact test) that requires further investigation. As the majority of variants is intronic and appears to be selectively neutral, a possible explanation might be an increased mutation rate at these loci, but additional experiments comprising more genes in a larger cohort are necessary to test this hypothesis. An overall elevated mutation rate may also explain that two RCC cases showed a substantially larger number of low-frequency SNVs than the other samples.
An extrapolation of our ndings from the selected loci to the entire genome suggests that there are more than 100,000 subclonal
NATURE COMMUNICATIONS | 3:811 | DOI: 10.1038/ncomms1814 | www.nature.com/naturecommunications
2012 Macmillan Publishers Limited. All rights reserved.
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1814
Table 2 | SNVs in tumor samples.
Sample Gene Chr. Pos.* Ref. Var. Discovery Validation dbSNP Type
Freq. P value Freq. P value
T1 VHL chr3 10188193 G A 0.0004 2e-03 Intronic
10191572 G T 0.3500 < 2e-300 p.E189* 10191635 C T 0.0002 9e-03 p.R210W TP53 chr17 7572600 G A 0.0008 2e-05 0.00070.00070.0001
1e-07 3e-03
6e-01
3 UTR
7577407 A C 0.0010 9e-14 Yes
Yes
Intronic
7577427 G A 0.0008 1e-09 Intronic 7577653 G A 0.0004 7e-05 Intronic 7578183 C T 0.0007 2e-06 Yes p.P222P
T2 VHL chr3 10183359 T C 0.0005 1e-04 5 UTR
10188161 G A 0.0003 8e-050.00020.00040.0002
2e-02
3e-02
3e-01
Yes Intronic
10188329 G A 0.0003 4e-08 Intronic 10188427 G A 0.0004 2e-05 Intronic 10188549 G A 0.0003 9e-06 Intronic 10191572 G 0.1686 5e-297 c.565delG 10191620 C T 0.0005 2e-03 p.R205C 10192372 G A 0.0004 5e-06 3 UTR
TP53 chr17 7573512 G A 0.0006 4e-03 Intronic
7573681 C T 0.0006 3e-04 Yes Intronic 7577368 G A 0.0007 4e-03 Intronic 7577999 T C 0.0003 9e-03
0.0177 2e-21
3 UTR (variable)
7578257 C A 0.0113 7e-89 p.E198*
T3 TP53 chr17 7573682 G A 0.0003 9e-03 Intronic P1 VHL chr3 10188206 T 0.2396 2e-278 c.349delT P2 VHL chr3 10188206 T 0.2412 3e-288 c.349delT M VHL chr3 10188206 T 0.2440 1e-3030.0051 1e-38
c.349delT
VHL 10192220 C G 0.0046 1e-29 3 UTR
Abbreviations: Chr., chromosome; Freq., frequency; Pos., position; Ref., reference; SNP, single-nucleotide polymorphism; SNVs, single-nucleotide variants; UTR, untranslated region; Var. variant; * UCSC hg19 coordinates. Major allele in control sample. Variant detected by deepSNV. Average frequency of both strands in tumouraverage frequency in control. Coordinates refer to the Ensembl transcripts ENST00000256474 (VHL) and ENST00000269305 (TP53).
SNVs present in a tumour cell population of comparable size. This substantial intra-tumour genomic diversity could have important consequences for cancer diagnosis and it may directly impact treatment strategies29.
Methods
deepSNV algorithm. The nucleotide counts in the test experiment Xs,i,b, b{A,T,C,G, }, at genomic position i on strand s = 0,1 (forward, reverse), are
modelled by a hierarchical binomial model with coverage ns,i and substitution rates drawn from a beta distribution with mean ps,i,b and parameter :
p a
s,i,b s,i,b
Beta
~ ( , )
p
X n
s,i,b s,i,b s,i s,i,b
Bin
| ~ ( , ).
p p
Here, the gap symbol ( ) is treated as a h nucleotide character (see Fig. 1b for a graphical depiction). The marginal counts of nucleotide b follow a beta-binomial distribution,
X n p
s,i,b s,i s,i,b
BetaBin
~ ( , , ).
a
In the absence of an SNV, the substitution rates of non-consensus bases are identical, ps,i,b = qs,i,b, and reect sequencing errors only, whereas in the presence of an
SNV b with frequency f in the test experiment, the rate ps,i,b = qs,i,b + f is greater than the error rate qs,i,b. The deepSNV algorithm detects SNVs by testing the alternative hypothesis H1: ps,i,b > qs,i,b against the null-hypothesis H0: ps,i,b = qs,i,b for each locus, nucleotide, and strand by means of a likelihood ratio test statistic
D g X n p g Y m q
s,i,b
s,i,b s,i s,i,b s,i,b s,i s,i,b
Here, g denotes the probability mass function of the beta-binomial distribution, p X n
s,i,b s,i s,i
( ) /
1 =
and q Y m
s,i,b s,i,b s,i
= 2
0
log ( ; , , ) ( ; ,
( ) a (( )
( ) ( )
, )
( ; , , ) ( ; , ,
0
1 1
a
a a
g X n p g Y m q
s,i,b s,i s,i,b s,i,b s,i s,i,b ))
(5)(5)
(1)(1)
(2)(2)
(3)(3)
(4)(4)
are the method-of-moments estimates of the mean nucleotide rates under H1, and p X Y n m q
s,i,b s,i,b s,i,b s,i s,i s,i,b
( ) ( )
/( )
( ) /
1 =
0 0
= +
( ) + =
0 0
= +
( ) + = the estimated mean rate under H0. The estimate of the dispersion is computed by numerical maximisation of the log-likelihood under H0,a a
a
= argmax log ( ; , , ) ( ; ,
p X Y n m q
s,i,b s,i,b s,i,b s,i s,i s,i,b
( ) ( )
/( )
s,i,b
( )
g X n p g Y m q
s,i,b s,i s,i,b s,i,b s,i s,i,b
0 (( ) , )
0 a
Here, the beta-binomial distribution is parameterised by the mean ps,i,b, and dispersion . For small ps,i,b, the variance of the nucleotide count is
Var s,i,b s,i s,i,b s,i s,i,b
[ ] /
X n P n P
.
Under the null-hypothesis and for large coverages, Ds,i,b is c12-distributed with one degree of freedom, as the models are nested, H0H1. A P-value is computed as Ps,i,b = 1 G(Ds,i,b), where G is the cumulative distribution function of the c12
distribution.
The resulting two P-values for each strand P0,i,b and P1,i,b can be combined in dierent ways into a single P-value, depending on which violation of the joint null-hypothesis is characteristic of true SNVs. The joint P-value Pi,b denotes the probability that the observed combination of nucleotide counts on both strands resulted from sequencing errors. It is dened as the tail probability of a given combination of P-values Qi,b(P0,i,b, P1,i,b) under the null-hypothesis that the
P-values of both strands are independently uniformly distributed. The maximum statistic Qi,b=max{P0,i,b, P1,i,b} generates a joint P-value of Pi,b=max{P0,i,b, P1,i,b}2
as a joint P-value (Fig. 1f). The average statistic Qi,b=(P0,i,b + P1,i,b)/2 yieldsPi,b = (P0,i,b + P1,i,b)2 if P0,i,b + P1,i,b < 1 and (1 P0,i,b P1,i,b)2 else (Fig. 1g). A third
= + 2 a. The overdispersion adds a quadratic term to the variance, which vanishes for large values of (compare Fig. 1c and d). In this limit, one recovers a binomial model with variance proportional to the mean.
Similarly, we dene Ys,i,b as the count of nucleotide b at position i and strand s in the control experiment with coverage ms,i,
BetaBin
s,i,b s,i s,i,b
Y m q
~ ( , , ).
a
NATURE COMMUNICATIONS | 3:811 | DOI: 10.1038/ncomms1814 | www.nature.com/naturecommunications
2012 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1814
ARTICLE
Tumour 1 Tumour 2 Tumour 3
1.0
1.0
1.0
0.7
LOH = 35%
Normal
tumour
Allele frequency
Allele frequency
LOH = 35%
T1 T3 P1 P2 M
chr3 chr10 chr12 chr17
Allele frequency
Fraction of cells with LOH
0.5
0.0
0.0
0.0
0.3
0 10000
0 10000
0 10000
10184000 10190000
SNP index
SNP index
SNP index
VHL VHL VHL
Log 2intensity ratio
1
2
0 50chr3 Position (Mb)
100 150 200
2
1
0
2
1
0
Log 2intensity ratio
1
2
0 50chr3 Position (Mb)
Log 2intensity ratio
1
2
0 50chr3 Position (Mb)
100 150 200
chr3 Genomic position
100 150 200
2
1
0
Figure 4 | Detecting copy-number alterations from SNP imbalances. (ac) Allele frequencies of known SNPs in matched tumour-normal samples. (a) Tumour 1, (b) tumour 2, (c) tumour 3. Light colours denote the frequency in the normal control, dark colours denote the frequency in the tumour. A deviation from heterozygous SNP frequencies of 0.5 indicates loss of heterozygosity (LOH). (df) Copy number proles and logarithmic probe intensities of 1 M SNP arrays for the samples presented in panel (ac). (g) Fraction of cells with LOH. The difference of heterozygous SNP frequencies on chromosome 3 (chr3) allows for computing the number of cells carrying only one copy. The resulting fraction of 43% is conserved across the three tumour samples of the same patient. (h) Histology of RCC. CD34-positive, non-cancerous cells (brown) in the primary RCC tissue sample. Nuclei are stained blue. The scale bar denotes 100 m.
Frequency
VHL p.E189*
3p
Seven subclonal mutations
Normal
Time Diagnosis
Figure 5 | Possible evolutionary history of tumour 1. A point mutation in VHL occurred at early stages with a subsequent loss of chromosome arm 3p, each followed by a selective sweep. At the time of diagnosis, seven subclonal mutations are observable in the VHL and TP53 genes.
alternative is Fishers method30, which is based on the product of the two P-values Qi,b = P0,i,bP1,i,b, the negative logarithm of which then follows c22-distribution (Fig. 1h).
The algorithm tests in total N4 genomic sites, where N denotes the length of the sequence and 4 equals the size of the alphabet minus 1, as the consensus base is excluded from the test. The combined P-values are thus corrected for multiple testing by either the method of Bonferroni or BenjaminiHochberg31 for a control of the FWER or the FDR, respectively. To avoid false positives arising from bad nucleotides, the algorithm can be adjusted to only consider calls above a Phred threshold, which was set to 25.
Detection of LOH and tumour content from SNP frequencies. LOH skews the allele-frequency ratios of heterozygous SNPs in tumour samples, which are typically a mixture of tumour and normal cells. Suppose there exists a heterozygous SNP with alleles A and a in the sample. Ideally, the ratio of A and a alleles would be r=fA/fa = 1. If the tumour population has lost allele a, then the frequency of A to a
alleles changes to r=(1 ) 1, where is the fraction of tumour cells. In the case of aneuploidy of degree n in allele a, the fraction of cells with LOH, that is, tumour cells, can be estimated as =[r 1]/[r + (n 1)].
In the presence of sequencing bias, the observed ratio of allele A over a is altered. If, for a heterozygous SNP, the true ratio is known to be one, then the bias can be estimated from a control experiment as the inverse allele ratio 1/r0.
Thus, the corrected tumour fraction is =[r/r0 1]/[r/r0 + (n 1)]. For a simple LOH (n=1), the corrected tumour fraction is =1 r0/r.
Experimental test data. Six 1.5 kb variants of the HIV pol gene were cloned, sequenced with Sanger sequencing, and mixed at frequencies 10 5, 10 4, 10 3, 10 2, 10 1 and 0.89999, respectively. A pure sample of the majority clone served as the control. Both samples were additionally amplied by 25 cycles of PCR. The resulting four samples were fragmented, adaptor-ligated and sequenced with barcodes in a single lane of an Illumina GAIIx sequencer. The resulting reads were aligned to the HXB2 reference with novoalign version 2.07.10 (www.novocra.com).
Comparison of methods. The performance of deepSNV on the test data was compared with VarScan 2.2.5 (ref. 17), CRISP,18 v5 and vipR19 0.0.11. For each algorithm, the minimal base quality was set to 25, and only variants from both strands were accepted. The minimal variant frequency was set to 1/10,000 (VarScan) and the poolsize was set to 10,000 (CRISP, vipR). ROC curves and the area under the ROC curve were computed for each variant frequency with the R package ROCR32. See Supplementary Methods for a detailed description of the chosen options.
Power calculations. The power of the deepSNV algorithm was assessed as a function of sequencing depth, genome size and minimal Phred nucleotide quality. For coverage smaller than the observed, the power of deepSNV was computed by sub-sampling without replacement from the actual nucleotide counts. For higher coverage, error rates s,i,b were drawn independently for each genomic locus from a Dirichlet distribution trained across all observed sites. The nucleotide counts
Xs,i,b and Ys,i,b were sampled from multinomial distributions with mean coverage of the test and control experiments, respectively.
To quantify the loss of power introduced by BenjaminiHochberg multiple-testing correction, we sampled the distribution of P-values 20 times, corrected each sample for a given number of tests, and averaged the results. For the Bonferroni method, no sampling was performed; instead the P-values were directly adjusted to the number of tests imposed by a given genome size. The eect of the Phred quality cuto was measured by varying the threshold at increments of 5 from 0 to 35 on the actual data and computing the power for each threshold.
RCC samples. This study was approved by the local commission of ethics (reference number StV 38-2005). Four fresh-frozen samples, including normal tissue, from a single metastatic RCC patient and matched tumour-normal samples from three other RCC patients were analysed. Approximately 50 g of genomic DNA
NATURE COMMUNICATIONS | 3:811 | DOI: 10.1038/ncomms1814 | www.nature.com/naturecommunications
2012 Macmillan Publishers Limited. All rights reserved.
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1814
was isolated from each sample and selected loci were amplied with 33 cycles of PCR using a total of at least 100 ng genomic DNA as template. The amplica were pooled according to their length, fragmented, adaptor-ligated and sequenced on separate lanes of an Illumina GAIIx sequencer (multiple lesions case) with 76 bp single-end reads or on a single lane of an HiSeq2000 sequencer with barcoded adaptors and 36 bp single-end reads. Reads were aligned to the UCSC hg18 human reference (multiple lesions) or the UCSC hg19 reference with novoalign 2.07.10 (Supplementary Methods).
Subclonal variant validation. Eight subclonal SNVs were selected for validation on a Roche GS Junior sequencer. A total of four PCR amplicons approximately 300 bp long were extracted from 100 ng template DNA with primers containing sequencing adaptors. For TP53 exon 7 and VHL exon 2, the corresponding amplicon used for Illumina sequencing served as PCR template, whereas in the other two cases primary tumour DNA was used. Reads were aligned using Mosaik (http://bioinformatics.bc.edu/marthlab/Mosaik) to the hg19 human genome.
References
1. Nowell, P. C. The clonal evolution of tumor cell populations. Science 194, 2328 (1976).
2. Merlo, L. M. F., Pepper, J. W., Reid, B. J. & Maley, C. C. Cancer as an evolutionary and ecological process. Nat.Rev.Cancer 6, 924935 (2006).
3. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719724 (2009).
4. Greenman, C. etal. Patterns of somatic mutation in human cancer genomes. Nature 446, 153158 (2007).
5. Metzker, M. L. Sequencing technologies - the next generation. Nat.Rev.Genet. 11, 3146 (2010).
6. Zagordi, O., Klein, R., Dumer, M. & Beerenwinkel, N. Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. NucleicAcidsRes. 38, 74007409 (2010).
7. Flaherty, P. etal. Ultrasensitive detection of rare mutations using next-generation targeted resequencing. NucleicAcidsRes 40, e2 (2012).
8. Barrick, J. E. & Lenski, R. E. Genome-wide mutational diversity in an evolving population of Escherichiacoli. ColdSpringHarb.Symp.Quant.Biol. 74, 119129 (2009).
9. Campbell, P. J. etal. Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proc.NatlAcad.Sci.USA 105, 1308113086 (2008).
10. Shah, S. P. etal. Mutational evolution in a lobular breast tumour proled at single nucleotide resolution. Nature 461, 809813 (2009).
11. Ding, L. etal. Genome remodelling in a basal-like breast cancer metastasis and xenogra. Nature 464, 9991005 (2010).
12. Druley, T. E. etal. Quantication of rare allelic variants from pooled genomic DNA. Nat.Methods 6, 263265 (2009).
13. Bansal, V. etal. Accurate detection and genotyping of SNPs utilizing population sequencing data. GenomeRes. 20, 537545 (2010).
14. Chapman, M. A. etal. Initial genome sequencing and analysis of multiple myeloma. Nature 471, 467472 (2011).
15. Varela, I. etal. Exome sequencing identies frequent mutation of the SWI/SNF complex gene PBRM1 in renal carcinoma. Nature 469, 539542 (2011).
16. Hastie, T., Tibshirani, R. & Friedman, J. TheElementsofStatisticalLearning, SecondEdition:DataMining,Inference,andPrediction 2nd edn (Springer, 2009).
17. Koboldt, D. C. etal. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 22832285 (2009).
18. Bansal, V. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics 26, i318324 (2010).
19. Altmann, A. etal. vipR: variant identication in pooled DNA using R. Bioinformatics 27, i77i84 (2011).
20. Dalgliesh, G. L. etal. Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes. Nature 463, 360363 (2010).
21. Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K. W. & Vogelstein, B. Detection and quantication of rare mutations with massively parallel sequencing. Proc. NatlAcad.Sci.USA 108, 95309535 (2011).
22. Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H. & Holt, R. A. Proling the T-cell receptor beta-chain repertoire by massively parallel sequencing. GenomeRes. 19, 18171824 (2009).
23. Lindhurst, M. J. etal. A mosaic activating mutation in AKT1 associated with the Proteus syndrome. N.Engl.J.Med. 365, 611619 (2011).
24. Navin, N. etal. Tumour evolution inferred by single-cell sequencing. Nature 472, 9094 (2011).
25. Campbell, P. J. etal. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature 467, 11091113 (2010).
26. Yachida, S. etal. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature 467, 11141117 (2010).
27. Snuderl, M. etal. Mosaic amplication of multiple receptor tyrosine kinase genes in glioblastoma. CancerCell 20, 810817 (2011).
28. Nazarian, R. etal. Melanomas acquire resistance to B-RAF(V600E) inhibition by RTK or N-RAS upregulation. Nature 468, 973977 (2010).
29. Ene, C. I. & Fine, H. A. Many tumors in one: a daunting therapeutic prospect. CancerCell 20, 695697 (2011).
30. Elston, R. C. On Fishers method of combining P-values. BiometricalJ. 33, 339345 (1991).
31. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J.R.Stat.Soc.B(Methodological) 57, 289300 (1995).
32. Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classier performance in R. Bioinformatics 21, 39403941 (2005).
Acknowledgements
We thank I. Nissen and M. Kohler (Quantitative Genomics Facility, D-BSSE, ETH, Zurich) for support in Illumina sequencing, S. Dietz for GS Junior sequencing, M. Dumer for providing HIV DNA clones, M. Storz and S. Dettwiler for isolating tumour DNA, and M. Baudis for providing SNP array data. This work was funded by SystemsX. ch under Grant No. 2009/024, evaluated by the Swiss National Science Foundation (SNF), and SNF Grant 31-135792 to H.M.
Author contributions
M.G., C.B., P.S., H.M. and N.B. designed the study. M.G., C.B. and N.B. wrote the manuscript. P.S. and H.M. reviewed and provided the tumour samples, P.W. isolated tumour material. M.G. and C.B. prepared all sequencing libraries. M.G. and N.B. developed algorithms and analysed the data. M.R. validated variants with Roche GS Junior sequencing.
Additional information
Accession codes: The sequencing data have been deposited in the European Nucleotide Archive under accession number ERP001312.
Supplementary Information accompanies this paper at http://www.nature.com/ naturecommunications
Competing nancial interests: The authors declare no competing nancial interests.
Reprints and permission information is available online at http://npg.nature.com/ reprintsandpermissions/
How to cite this article: Gerstung, M. etal. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat.Commun. 3:811 doi: 10.1038/ncomms1814 (2012).
NATURE COMMUNICATIONS | 3:811 | DOI: 10.1038/ncomms1814 | www.nature.com/naturecommunications
2012 Macmillan Publishers Limited. All rights reserved.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright Nature Publishing Group May 2012
Abstract
According to the clonal evolution model, tumour growth is driven by competing subclones in somatically evolving cancer cell populations, which gives rise to genetically heterogeneous tumours. Here we present a comparative targeted deep-sequencing approach combined with a customised statistical algorithm, called deepSNV, for detecting and quantifying subclonal single-nucleotide variants in mixed populations. We show in a rigorous experimental assessment that our approach is capable of detecting variants with frequencies as low as 1/10,000 alleles. In selected genomic loci of the TP53 and VHL genes isolated from matched tumour and normal samples of four renal cell carcinoma patients, we detect 24 variants at allele frequencies ranging from 0.0002 to 0.34. Moreover, we demonstrate how the allele frequencies of known single-nucleotide polymorphisms can be exploited to detect loss of heterozygosity. Our findings demonstrate that genomic diversity is common in renal cell carcinomas and provide quantitative evidence for the clonal evolution model.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer