ARTICLE
Received 11 Jul 2014 | Accepted 15 Sep 2014 | Published 20 Nov 2014
Zheng Xia1,2, Lawrence A. Donehower3,4, Thomas A. Cooper2,5,6, Joel R. Neilson6, David A. Wheeler4,7, Eric J. Wagner8 & Wei Li1,2
Alternative polyadenylation (APA) is a pervasive mechanism in the regulation of most human genes, and its implication in diseases including cancer is only beginning to be appreciated. Since conventional APA proling has not been widely adopted, global cancer APA studies are very limited. Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identication of dynamic APAs from standard RNA-seq. When applied to 358 TCGA Pan-Cancer tumour/normal pairs across seven tumour types, DaPars reveals 1,346 genes with recurrent and tumour-specic APAs. Most APA genes (91%) have shorter 30-untranslated regions (30 UTRs) in tumours that can avoid microRNA-mediated repression, including glutaminase (GLS), a key metabolic enzyme for tumour proliferation. Interestingly, selected
APA events add strong prognostic power beyond common clinical and molecular variables, suggesting their potential as novel prognostic biomarkers. Finally, our results implicate CstF64, an essential polyadenylation factor, as a master regulator of 30-UTR shortening across multiple tumour types.
1 Division of Biostatistics, Dan L Duncan Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA. 2 Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas 77030, USA. 3 Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, Texas 77030, USA. 4 Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA. 5 Department of Pathology and Immunology, Baylor College of Medicine, Houston, Texas 77030, USA. 6 Department of Molecular Physiology and Biophysics, Baylor College of Medicine, Houston, Texas 77030, USA. 7 Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA. 8 Department of Biochemistry and Molecular Biology, The University of Texas Medical School at Houston, Houston, Texas 77030, USA. Correspondence and requests for materials should be addressed to W.L. (email: mailto:[email protected]
Web End [email protected] ).
NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications 1
& 2014 Macmillan Publishers Limited. All rights reserved.
DOI: 10.1038/ncomms6274
Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 30-UTR landscape across seven tumour types
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274
The dynamic usage of messenger RNA (mRNA) 30-untranslated region (30 UTR), mediated through alternative polyadenylation (APA), plays an important role in
post-transcriptional regulation under diverse physiological and pathological conditions1,2. Approximately 70% of human genes3 are characterized by multiple polyA sites that produce distinct transcript isoforms with variable 30-UTR length and content, thereby signicantly contributing to transcriptome diversity4. The majority of APA examples utilize alternative polyA sites located within the terminal exon proximally downstream of the stop codon (tandem APA). As a result, while the protein-coding sequence is unaltered, the regulatory elements in the distal 30 UTR that might reduce mRNA stability or impair translation efciency can be removed, including AU-rich elements5 and microRNA (miRNA)-binding sites6. A small percentage of APA sites can be located within internal introns/exons (splicing APA) and are coupled with alternative splicing to produce mRNA isoforms encoding distinct proteins. A well-documented example occurs during B-cell differentiation, where IgM switches from a membrane-bound form to a secreted form using a proximal polyA site instead of a distal one7. More recent studies8 have shed light on the importance of APA in human diseases such as cancer, but its clinical signicance to tumorigenesis is only beginning to be appreciated. Both proliferating cells2,9 and transformed cells10 have been shown to favour expression of shortened 30 UTRs through APA, leading to activation of several proto-oncogenes, such as cyclin D1 (ref. 8). Collectively, these observations imply that truncation of the 30 UTRs may serve as prognostic biomarkers10,11. While compelling, these studies were highly limited to either a limited number of genes or a small sample size. It remains to be determined to what extent APA occurs in cancer patients, what level of clinical utility APA may have and the molecular mechanisms and functional consequences of APA during tumorigenesis across multiple tumour types.
RNA-seq has become a routine protocol for gene expression analysis; however, methods to quantify relative APA usage are still under development. Previous global APA studies use microarrays2,12, which are limited by the dependence on annotated polyA databases as well as inherent technical biases such as cross-hybridization. Recent APA protocols use polyA junction sites enrichment followed by high-throughput sequencing (PolyA-seq)13,14. These PolyA-seq protocols, although powerful in providing the precise locations of polyA sites, are hampered by technical issues, such as internal priming artefacts, and thus have not been widely adopted by the cancer community. In contrast, RNA-seq has been widely employed in almost every large-scale genomics project, including The Cancer Genome Atlas (TCGA). However, very few RNA-seq reads contain polyA tails, challenging our ability to identify APA events. For example, an ultra-deep sequencing study15 only identied B40 thousand putative polyA reads (B0.003%) from1.2 billion total RNA-seq reads. Moreover, although the popular RNA-seq tool MISO16 can detect annotated alternative tandem 30
UTRs, it cannot identify any novel APA events beyond polyA databases. Finally, the short 30 UTRs are often embedded within the long ones, and thus the isoforms with short 30 UTRs are commonly overlooked by transcript assembly tools, such as Cufinks17. Despite these inherent limitations, we hypothesize that any major changes in APA usage between different conditions will result in localized changes in RNA-seq density near the 30-end of mRNA. And this localized RNA-density change can be readily detected through single-nucleotide resolution RNA-seq analysis. We therefore developed a novel bioinformatics algorithm, Dynamic analyses of Alternative PolyAdenylation from RNA-Seq (DaPars), to directly infer
dynamic APA events through the comparison of standard RNA-seq data between different conditions.
TCGA has characterized a comprehensive list of genomic, epigenomic and transcriptomic features in thousands of tumour samples; however, it lacks a PolyA-seq platform for APA analysis. To ll this knowledge gap, we used DaPars to retrospectively analyse the existing RNA-seq data of tumours and matched normal tissues derived from 358 patients across 7 tumour types. We discover 1,346 genes with highly recurrent tumour-specic dynamic APA events, demonstrate the additional prognostic power of APA beyond common clinical and molecular variables and expand our knowledge of the mechanisms and consequences of APA regulation during tumorigenesis.
ResultsDaPars identies dynamic APA events. DaPars performs de novo identication and quantication of dynamic APA events between tumour and matched normal tissues, regardless of any prior APA annotation. For a given transcript, DaPars rst identies the de novo distal polyA site based on a continuous RNA-seq signal independent of the gene model (Fig. 1a; Supplementary Fig. 1a,b). Assuming there is an alternative de novo proximal polyA site, DaPars models the normalized single-nucleotide resolution RNA-seq-read densities of both tumour and normal as a linear combination of both proximal and distal polyA sites. DaPars then uses a linear regression model to identify the location of the de novo proximal polyA site as an optimal tting point (vertical arrow in Fig. 1a) that can best explain the localized read-density change. Furthermore, this regression model is extended towards internal exons, so that splicing-coupled APA events can also be detected. Finally, the degree of difference in APA usage between tumour and normal can be quantied as a change in Percentage of Distal polyA site Usage Index (DPDUI), which is capable of identifying lengthening (positive index) or shortening (negative index) of 30 UTRs. The dynamic APA events with statistically signicant DPDUI between tumour and normal will be reported. The DaPars algorithm is described in further detail in the Methods. One example of an identied dynamic APA event is given for the TMEM237 gene (Fig. 1b), where the shorter 30
UTR predominates in both breast (breast invasive carcinoma (BRCA)) and lung (lung squamous cell carcinoma (LUSC)) tumours compared with matched normal tissues. Another example is LRRFIP1 (Fig. 1c), where the distal 30 UTR is nearly absent in both breast and lung tumours.
DaPars evaluation using simulated and experimental APA data. To assess the performance of DaPars, we conducted a series of proof-of-principle experiments. First, we used simulated RNA-seq data with predened APA events to evaluate DaPars as a function of sequencing coverage. We simulated 1,000 genes in tumour and normal at different levels of sequencing coverage (reads per base gene model). For each gene, we simulated two isoforms with long and short 30 UTRs (3,000 and 1,500 bp), respectively. The relative proportion of these two isoforms is randomly generated, so that the DPDUI between tumour and normal for each gene is a random number ranging from 1 to 1.
According to these gene models and expression levels, we used Flux Simulator18 to generate 50-bp paired-end RNA-seq reads with a 150-bp fragment length, taking into account typical technical biases observed in RNA-seq. The simulated RNA-seq reads were used as the input for DaPars analysis, while the short/ long isoforms and the DPDUI values were hidden variables to be determined by DaPars. As a criterion for accuracy, the DaPars dynamic APA prediction is considered to be correct if the predicted de novo APA is within 50-bp distance of the bona de
2 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications
& 2014 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE
2 kb
hg19
780
0
BRCA LUSC
Matched
normal
Tumour
Matched normal
Tumour
285
1 501
1 140
1 185
1
Total 10kb
Normal
Tumor
999
0
Identified from RNA-seq
TMEM237
Proximal APA Distal APA
Possible
3 UTRs
10 kb
hg19
410
1 890
1 615
1 894
1
Matched
Normal
Tumour
Regression model
Fitted value
Concatenated exons + 10 kb 3 UTR extension
4,000
1,000
BRCA LUSC
Matched
Normal
Tumour
0 500 1,000 1,500
LRRFIP1
Proximal Distal
100%
Prediction accuracy under differenct mean coverage
Percentage of recovered APAs
20%
90%
1 kb hg19
152
0
146
0 550
0
541
0
16.24
010.25
0
27.21
0
31.74
MAQC brain1
MAQC brain2
UBE2I
80%
70%
60%
RNA-seq PolyA-seq
50%
40%
30%
MAQC
UHR1
MAQC
UHR2
MAQC UHR1
DaPars Prediction
10
30
50
70
90
110
130
150
170
190
Mean coverage (reads per base)
PolyA-seq
DaPars analysis of RNA-seq
MAQC brain1
MAQC brain2
MAQC UHR2
RefSeq genes
222
127 150
Cufflinks
Figure 1 | Overview of the DaPars algorithm and its performance evaluation. (a) Diagram depicts the DaPars algorithm for the identication of dynamic APA between tumour and normal samples. The top panel shows RNA-seq coverage on exons with 10 kb extension without any prior knowledge of APA sites. The distal APA site is inferred directly from the combined RNA-seq data of tumour and normal tissues (middle panels). The y axis of the bottom panel is the tted value of our regression model and the locus with the minimum tted value (red point below vertical arrow) corresponds to the predicted proximal APA site (red horizontal bar). (b) An example of DaPars identied dynamic APA from the TCGA RNA-seq data. The shorter 30 UTR of
TMEM237 is preferred in BRCA and LUSC tumours. (c) Another example of dynamic APA, here the distal APA of LRRFIP1 is nearly absent in both BRCA and LUSC tumours while the proximal APA is unchanged. (d) A simulation study to demonstrate DaPars performance. The percentage of recovered APA events is plotted against different sequencing coverage. The quantile box shows the variation of DaPars prediction based on 1,000 simulated events. The black line in each box is the median recovery rate. (e) An example of dynamic APA between MAQC UHR and brain detected by both DaPars analysis of RNA-seq and PolyA-seq. The three bottom tracks are the RefSeq gene structure, Cufinks prediction and DaPars prediction. (f) Venn diagram comparison between PolyA-seq and DaPars analysis of RNA-seq based on the same MAQC UHR and brain samples.
NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications 3
& 2014 Macmillan Publishers Limited. All rights reserved.
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274
polyA site, and the predicted DPDUI is within 0.05 from the predetermined DPDUI. The nal prediction accuracy (percentage of recovered APAs) is plotted as a function of the different coverage levels (Fig. 1d). Using genes with a single isoform as negative controls, we also reported receiver-operating characteristic curves at different coverage levels with areas under receiver-operating characteristic curves (AUCs) ranging from 0.762 to 0.985 (Supplementary Fig. 2). Our results indicate that dynamic APA events can be readily identied across a very broad range of coverage levels. Importantly, we determined that a sequencing coverage of 30-fold can achieve 470% accuracy and close to 0.9
AUC in dynamic APA detection. Therefore, we ltered out genes with o30-fold coverage for all further analysis.
As an additional proof-of-principle, we directly compared APA events detected by DaPars with that of PolyA-seq. To achieve this, we used the RNA-seq data19 and PolyA-seq data3 based on the same Human Brain Reference and the Universal Human Reference (UHR) MAQC samples20. For PolyA-seq, the differentially altered 30-UTR usage was identied as described in Methods. From the comparison between brain and UHR, we found that B60% (P valueo2.2e 16; Fishers exact test) of 372
DaPars predicted APA events could be strongly supported by PolyA-seq (Fig. 1e,f). Both PolyA-seq and DaPars reported longer 30 UTRs in brain than in UHR in 494% dynamic APA events, which is consistent with recent reports that brain tissues normally have the longest 30 UTRs21,22. Close inspection of the raw data indicates that the non-overlapping dynamic APA events can be partially explained by the individual assay limitations. For example, PolyA-seq is designed to amplify polyA tags; therefore, some dynamic APA events reported by PolyA-seq may have a small magnitude of changes that are not readily detectable by RNA-seq (Supplementary Fig. 1c). Meanwhile, probably due to additional steps such as fractionation, PolyA-seq may also fail to detect dynamic APAs that are clearly observed by RNA-seq (Supplementary Fig. 1d). Together, we conclude that DaPars can reliably detect dynamic APA events between different conditions using standard RNA-seq.
Broad and recurrent shortening of 30 UTRs across tumour types. Since TCGA lacks a PolyA-seq platform for APA analysis, we sought to ll this knowledge gap through DaPars retrospective analysis of existing TCGA RNA-seq data, which were originally sequenced for gene expression. We focused our analysis on seven tumour types that have 410 tumour/normal pairs, including bladder urothelial carcinoma (BLCA), head and neck squamous cell carcinoma, LUSC, lung adenocarcinoma (LUAD), BRCA, kidney renal clear cell carcinoma (KIRC) and uterine corpus endometrioid carcinoma (UCEC) (Supplementary Table 1). TCGA RNA-seq data are of high quality with a mean coverage of around 50-fold, which corresponds to 80% accuracy for DaPars APA analysis based on our simulation study (Fig. 1d). For each tumour type, we identied 224744 genes with statistically signicant and recurrent (occurrence rate 420%) dynamic APA events during tumorigenesis, leading to a total of 1,346 non-redundant events across 7 tumour types (Fig. 2a; Supplementary Fig. 3a; Supplementary Data 1). As a negative control, we did not observe any recurrent APA events between different batches of normal tissues of the same tumour type, indicating that the 1,346 DaPars reported tumour-specic APA events are not likely due to technical artefacts, such as sequencing bias or batch effect. Overall, lung (LUSC and LUAD), uterine (UCEC), breast (BRCA) and bladder (BLCA) cancers possess the highest amount dynamic APA events than the other tumour types (Fig. 2a; Supplementary Fig. 3a,b). Furthermore, 55% of the 1,346 dynamic APA events occur in at least 2 tumour types (Supplementary Fig. 3c),
indicating potential concerted mechanisms in APA regulation across tumour types. Strikingly, the majority (6198%) of APA events have shorter 30 UTRs in tumours (Fig. 2a; Supplementary
Fig. 3a), which is consistent with previous reports that transformed cells preferentially express mRNAs with shortened 30
UTRs8.
Multiple lines of evidence indicate that DaPars reported de novo APA events are indeed regulated through APA. First, 51% of DaPars predictions are within 50 bp of the annotated APAs compiled from Refseq, ENSEMBL, UCSC gene models and polyA database23. There is an approximately sixfold enrichment of annotated APAs in our DaPars predictions compared with random controls (Fig. 2b). Second, in the upstream ( 50 nt) of our de nov APA sites, canonical polyA
signal AATAAA can be successfully identied by MEME motif enrichment analysis24 (Fig. 2c). In addition, AATAAA and ATTAAA are the most prevalent motifs among variants25 of polyA signals (Supplementary Fig. 4)4. By comparing 50 bp anking sequences of the distal and proximal polyA sites of the 30-UTR shortening events, DREME26 discriminative motif discovery algorithm reported that AATAAA motif is signicantly stronger in distal polyA sites (Supplementary Fig. 5), suggesting the molecular basis for differential polyA site selection27. Furthermore, the canonical polyA signal can also be identied (Supplementary Figs 6 and 7) on those de novo APA sites that do not coincide with previous annotation. As expected, the de novo DaPars analysis enables us to detect novel APAs that are not annotated in any database. For example, we found a potential novel proximal APA site in AGPS that is signicantly upregulated in LUSC tumour (Fig. 2d). Together, we conclude that DaPars reliably identied a comprehensive list of novel and existing APA target genes across seven TCGA tumour types, and the preferential shortening of 30 UTR is a major layer of transcriptomic dynamics during tumorigenesis.
APA events remain far from complete. To explore to what extent the discovered 1,346 APA events have reached saturation, we performed down-sampling saturation analysis. We repeated DaPars analysis (occurrence rate 420%) on random subsets of samples of various smaller sizes. Saturation is expected to occur when increasing sample size fails to discover additional APA events. The results indicate that the number of APA events increases steadily with increasing sample size in total (Fig. 2e), sample size per tumour type (Supplementary Fig. 3d) and the number of tumour types studied (Fig. 2f). This suggests that APA events derived from 358 samples across 7 tumour types remain far from complete. DaPars analysis on a larger sample size or more tumour types is likely to reveal many more novel APA events. This prediction is consistent with a recent report demonstrating that cancer genome sequencing normally requires thousands of samples per tumour type to approach saturation28. This observation also highlights the need for de novo discovery of APA, since any prior annotation-based detection methods are likely to miss a signicant portion of novel APA events from tumour samples.
Genes with shorter 30 UTRs are prone to be upregulated. The current model predicts that 30-UTR shortening through APA during tumorigenesis may upregulate its parental gene by escaping miRNA repression. To test this hypothesis, we calculated the numbers of miRNA-binding sites lost due to 30-UTR shortening in tumours (Fig. 3a). Using this approach, we determined that B67% genes with shorter 30 UTRs in tumours have lost at least 1 predicted miRNA-binding site (Fig. 3a). Furthermore, when compared with all the genes of sufcient sequencing
4 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications
& 2014 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE
571
0 Percentage of patients with lengthening (%)
# Shortening# Lengthening
# Events
P < 2.2e16
60%
30%
Overlapping with annotated PASs
51.3%
Percentage of patients with shortening (%)
UCEC
LUSC
BLCA
LUAD
BRCA
HNSC
KIRC
50 0 40 60
30
0
0 DaPars prediction
8.7%
Random
T
AE-value: 1.8e135
2
A
A
Bits
1
T
T
0
1
2
3
4
5
6
2 kb hg19
LUSC normal
LUSC tumour
Percentage of total events
100%
80%
60%
40%
20%
0
Percentage of total events
100%
80%
60%
40%
20%
DaPars prediction
AGPS
AGPS
AGPS
Hs.516543.1.25 Hs.516543.1.26
DaPars predicted polyA site
UCSC
RefSeq genes
Reported polyA sites from polyA_DB
Ensembl gene
0 50 100 150 200 250 300 350 # Samples
1 2 3 4 5 6 7 # Tumour types
Figure 2 | Broad shortening of 30 UTRs across seven TCGA tumour types. (a) The central heatmap shows genes (rows) undergoing 30-UTR shortening (blue) or lengthening (red) in each of the 358 tumours (columns) compared with matched normal tissues across seven tumour types. The upper histogram shows the number of APA events per tumour. The side histograms show the percentage of tumours with 30-UTR shortening (left) or lengthening (right) for each APA gene. (b) Bar plots show the percentages of DaPars-predicted APAs and randomly selected APAs from 30-UTR regions overlapping with annotated APAs from four databases (Refseq, UCSC, ENSEMBL and PolyA_DB). The P value was calculated by t-test using 50 bootstrapping
of data. (c) MEME identies the canonical polyA motif AATAAA with a very signicant E-value (1.8e 135) from the upstream ( 50 bp) of the
proximal polyA sites predicted by DaPars. (d) An example of DaPars-predicted novel polyA site (red bar) in a LUSC tumour that is far away from any annotated polyA sites. (e) Saturation analysis of APA events by adding more samples. Each point is a random subset of samples of various smaller sizes. All the points were tted by a smoothed read line. (f) Saturation analysis by adding more tumour types. Each grey line represents a random orderingof seven tumour types and the red curve is the tting line. The percentage of dynamic APA events increased with the number of tumour types.
coverage, those genes with shorter 30 UTRs in tumours have overall greater miRNA-binding site density in their gene models (P value 1.8e 11, t-test; Fig. 3b). These data imply that APA
regulation tends to maximize the miRNA-binding loss through preferentially shortening those 30 UTRs already heavily regulated by miRNA. To examine the consequences of 30 UTR and miRNA-binding loss, we compared the gene expression between tumours and matched normal tissues. As expected, those genes with shorter 30 UTRs in tumours tend to be more upregulated in tumours (Fig. 3c). In conclusion, our data strongly support the hypothesis that many genes are upregulated during tumorigenesis
by shortening their 30 UTRs to escape post-transcriptional miRNA repression.
APA events add prognostic power beyond common covariates. Very little is known of the clinical implications of the dynamic 30
UTRs in cancer patients. To address this issue, we used a standard Cox proportional hazards model29 for the correlation between patient overall survival and multiple clinical and molecular covariates. Here we only used BRCA, LUSC and KIRC due to high mortality rate and large sample size. We rst used common
NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications 5
& 2014 Macmillan Publishers Limited. All rights reserved.
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274
400
8
300
P-value 1.8e11
67%
# Events
# miRNA binding per kb
6
200
100
4
0
0 5 10 15 20 25 30 35 40
# Lost miRNA targets
2
0
Gene expression:log2 (tumour/normal)
All genes Genes with shorter3 UTRs in 7 tumour types
4
2
0
2
868
Up
1,000
1,000
0
4
122
Dn
Shortening
Lengthening
15 Up
44 Dn
0
0.6 0.4 0.2 0.2 0.4 0.6
0
Percentage of distal 3 UTR usage difference (PDUI) Tumour normal
Figure 3 | Genes with shorter 30 UTRs in tumours are prone to be upregulated. (a) Number of genes losing miRNA-binding sites due to the shortening of their 30 UTRs. Here we selected miRNA-bindings sites predicted by both TargetScanHuman V6.2 (refs 52,53) and miRanda54, as a more conservative list of miRNA targets. Number in the bracket represents the percentage of genes losing at least 1 miRNA-binding site. (b) Genes with shorter 30 UTRs in tumours have greater miRNA-binding-site density in the 30-UTR region than all RefSeq genes. We used RefSeq gene models for all the calculations regardless of the APA detection. The y axis is the number of miRNA-binding sites normalized by 30-UTR length (per kb). The P value was calculated by a t-test. (c) For genes with shorter 30 UTRs in tumours, their fold-change expression between tumours and normal tissues are plotted against their DPDUI values. All isoforms of the same gene were combined for the expression measurement. The genes signicantly up- or downregulated in tumours are shown in red and blue, respectively, which were identied by paired t-test with BenjaminiHochberg (BH) false-discovery rate at 5%. Accordingly, the red and blue bar plots indicate the number of up- and downregulated genes, respectively.
clinical covariates including only tumour stage, age, gender (excluding breast cancer) and smoking status (lung cancer only) to generate low- and high-risk groups, which are visualized by KaplanMeier plots and compared by the log-rank test (Fig. 4a). We next used the same Cox regression model integrated with LASSO to select the APA (DPDUI) events besides clinical covariates that can best separate risk groups. With clinical covariates always included, we used leave-one-out cross-validation (CV) to select the optimal 13 APA events (Supplementary Table 2) to constitute new APA-clinical Cox regression models (Fig. 4d), which have much more signicant P values in the risk group comparison. To quantify the added prognostic power of APA events, we used a likelihood-ratio test (LRT) to compare the new APA-clinical models with the clinical only models. The LRT results (Fig. 4e) clearly demonstrate a strong additional prognostic power of APA events beyond clinical covariates. Among these six APA covariates, signicant worse survival is associated with 30-UTR shortening of three genes (SYNCRIP in BRCA; TMCO7 and PLXDC2 in KIRC) and 30-UTR lengthening of two genes (ATP5S in BRCA; RAB23 in LUSC) (Supplementary Table 2). This result strongly suggests that, depending on the tumour types or genes studied, either lengthening or shortening of 30 UTRs may be associated with poor clinical outcome. Since our CV procedure only selects the optimal APA events, it is highly likely that even more APA events
can be associated with patient survival. Furthermore, we combined clinical covariates with tumour mRNA expression (mRNA-clinical) and tumour-vs-normal gene expression fold-change (mRNA-FC-clinical model) of the same APA genes (Supplementary Table 2) as two additional Cox regression models and repeated the same analyses. Compared with the APA-clinical model, both mRNA-clinical and mRNA-FC-clinical models provide much less additional prognostic power (Fig. 4e), less-signicant log-rank P values in risk group comparison (Fig. 4bd). Finally, we show that the separated high- and low-risk groups by APA-clinical models have no correlation with the TCGA Pancan12 signicantly mutated gene (http://dx.doi.org/doi:10.7303/syn1750331
Web End =doi:10.7303/ http://dx.doi.org/doi:10.7303/syn1750331
Web End =syn1750331 ) (Fig. 4f). Together, APA events provide additional power in survival prediction beyond clinical covariates, and independent of commonly used molecular data such as gene expression and somatic mutations.
Cancer metabolism gene GLS is regulated through APA. Ingenuity IPA and literature searches were used to characterize the pathways enriched in 1,346 dynamic APA events (Fig. 5a; Supplementary Data 2). The vast majority of enriched pathways are cancer related, such as ERK/MAP signalling and glutamine metabolism. The metabolism gene glutaminase (GLS) is of particular interest. It is well known that tumours are considerably
6 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications
& 2014 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE
Clinical only
Clinical + mRNA
Clinical + mRNA FC Clinical + APA
BRCA BRCA BRCA BRCA
1.0
P = 0.2
n=49
n=48
1.0
0.0
P = 0.013
n=49
n=48
1.0
0.0
P = 0.022
n=49
n=48
1.0
0.0
P = 2.9e05
n=49
n=48
0.8
0.8
0.8
0.8
Survival probability
Survival probability
Survival probability
Survival probability
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0 1,000 2,000 3,000 4,000Days
0 1,000 2,000 3,000 4,000Days
0 1,000 2,000 3,000 4,000
0 1,000 2,000 3,000 4,000Days
Days
LUSC LUSC
LUSC
KIRC KIRC KIRC
LUSC
1.0
Survival probability
P = 0.07
n=17
n=16
1.0
0.0
P = 0.018
n=17
n=16
1.0
0.0
P = 0.17
n=17
n=16
1.0
0.0
P = 0.00019
n=17
n=16
0.8
0.8
0.8
0.8
Survival probability
Survival probability
Survival probability
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0 1,000 2,000 3,000 4,000Days
0 1,000 2,000 3,000 4,000Days
0 1,000 2,000 3,000 4,000Days
0 1,000 2,000 3,000 4,000Days
KIRC
1.0
P = 2.6e05
n=34
n=34
1.0
0.0
P = 7e05
n=34
n=34
1.0
0.0
P = 0.00012
n=34
n=34
1.0
0.0
0.8
0.8
0.8
0.8
P = 3.8e06
n=34
n=34
Survival probability
Survival probability
Survival probability
Survival probability
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0 1,000 2,000 3,000Days
0 1,000 2,000 3,000 4,000Days
0 1,000 2,000 3,000 4,000Days
0 1,000 2,000 3,000 4,000Days
4,000
106
Clinical + APA
102
100
Clinical + mRNA
Clinical + mRNA fold change
BRCA LUSC KIRC
105
Additional predictive power
(P-value)
104
0.05
101
103
P-value
102
101
100
BRCA KIRC LUSC
NOTCH1
TBX3
BAP1
PHF6
ATRX
VHL
ERCC2
KRAS
RB1
SPOP
SMAD2
APC
MLL3
MLL2
HLA.A
PIK3R1
BRAF
STAG2
FLT3
MAP3K1
SETD2
TP53
NUMA1
KDM5C
SMAD4
CDK12
CDKN1B
CDKN1A
BRCA1
BRCA2
FBXW7
FGFR2
FGFR3
CTNNB1
EP300
ATM
CDKN2A
KDM6A
NRAS
NF1
CASP8
IDH1
EGFR
CDH1
KEAP1
NFE2L2
ELF3
PTEN
U2AF1
KIT
PIK3CA
TBL1XR1
PPP2R1A
CBFB
RUNX1
CHEK2
PBRM1
PDGFRA
GATA3
CCND1
CTCF
SF3B1
ARID1A
Figure 4 | Prognostic power of dynamic APA events. (ad) KaplanMeier survival plots for high- (red line) and low (blue line)-risk groups separated by clinical only (a), clinical with mRNA expression (b), clinical with tumour-vs-normal mRNA expressions fold-change (c) and clinical with dynamic APAs (d). P value was calculated using the log-rank test. (e) Additional prognostic power of APA, mRNA expression and mRNA tumour-vs-normal expression fold-change beyond clinical variables. The P value is calculated by the likelihood-ratio test. (f) No correlation between risk groups separated by APA-clinical models and mutation proles of signicantly mutated genes (SMGs). The dotted vertical line represents the P value (MannWhitneytest) cutoff of 0.05. All SMG P values are below this cutoff and thus are not signicant.
more dependent on the glycolytic pathway, regardless of oxygen availability, to supply a great deal of their energetic and biosynthetic demand for cell division. This phenomenon, termed the
Warburg effect, is a hallmark of cancer30. GLS is a key enzyme in glutaminolysis and its high expression is essential to support the cancer metabolic phenotype31. There are two major GLS isoforms
NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications 7
& 2014 Macmillan Publishers Limited. All rights reserved.
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274
10 kb
Normal
Tumour
Normal
Tumour
Normal
Tumour
Log10(P-value)
hg19
0 1 2 3 4
Breast cancer regulation by stathmin1
LUAD
LUSC
KIRC
CCR3 signalling in eosinophils
Role of p14/p19ARF in tumor suppression
ERK/MAPK signalling
Protein ubiquitination pathway
Rac signalling
PAK signalling
Molecular mechanisms of cancer
PDGF signalling
Role of CHK in cell cycle checkpoint control
PI3K signalling in B lymphocytes
Myc-mediated apoptosis signalling
FAK signalling
Chemokine signalling
SAPK/JNK signalling
IGF1 signalling
Small-cell lung cancer signalling
IL3 signalling
GLS
GAC
RefSeq genes
EIF2 signalling
miRNA sites
KGA
Distal
HGF signalling
Proximal
Telomerase signalling
Signalling by Rho family GTPases
EGF signalling
Role of tissue factor in cancer
KIRC survival based on GAC percentage
PTEN signalling
Glutamine degradation I
1.0
Log-rank P 1.5e06
Low GAC ratio (248)
High GAC ratio (248)
Glycine biosynthesis I
Alanine biosynthesis II
Actin cytoskeleton signalling
0.8
LUAD
Normal Tumour
LUSC
Normal Tumour
KIRC
Normal Tumour
0.6
100%
80%
60%
40%
20%
0
1.3e09
3.7e18
2.1e06
GAC percentage
GAC percentage
100%
80%
60%
40%
20%
0
GAC percentage
100%
80%
60%
40%
20%
0
Survival probability
0.4
0.2
0.0
0 20 40 60 80 100 120
Months
Figure 5 | Pathway analysis. (a) Signicantly enriched (P valueo0.05; Fishers exact test) Ingenuity canonical pathways in the 1,346 dynamic APA events. (b) GLS has a signicant 30-UTR shift from KGA long isoform in normal to GAC short isoform in tumour. (c) GAC percentages are signicantlyhigher in LUAD, LUSC and KIRC tumours. The P value in each box was calculated using a paired t-test. (d) KaplanMeier survival plot of two KIRC tumour groups stratied by the GAC ratios with equal patient number in each group. P value was calculated using the log-rank test.
termed distal Kidney-type (KGA) and proximal Glutaminase C (GAC), which have distinct 30 UTRs and slightly different C termini3234 (Fig. 5b). KGA has a number of miRNA-binding sites within its 30 UTR, whereas GAC surprisingly is not predicted to have any (Fig. 5b). Furthermore, it has been shown that miR-23 represses KGA in most cells. However, in myc-transformed cells, MYC overexpression de-represses GLS through down-regulation of miR-23, resulting in glutamine-dependent growth characteristics35. Interestingly, we found a strong alternative-splicing-coupled 30-UTR shift from KGA in normal to GAC in tumour, leading to a signicantly increased percentage of GAC in LUAD, LUSC and KIRC (Fig. 5b,c). This is consistent with previous report that GAC is key to the mitochondrial glutaminase metabolism of cancer cells31. The implication of the 30 UTR switch to GAC is that the expression of GLS is no longer regulated by miR-23 or MYC. Consistently, we did not observe any signicant expression changes of miR-23 between tumours and normal tissues, although MYC is upregulated in LUSC and KIRC tumours (Supplementary Fig. 8a), suggesting that GLS potentially utilizes 30-UTR switch, rather than MYC to escape miR-23-mediated repression.
To investigate the potential clinical utility of the APA-mediated GLS isoform switch, we examined the correlation between GAC percentage and clinical survival information for KIRC tumours, using the Cox proportional hazards model with age and gender as covariates. We found that higher GAC percentage is highly correlated with worse survival (P 3.2e 13, hazard ratio 50,
95% condence interval: 17141; Fig. 5d), which is consistent with previous studies indicating that GAC is essential for cancer cell growth36. Overall, patients with high GAC ratios in KIRC have a median survival of B55 months, compared with 492 months for those with low GAC ratios. We did not nd a statistically signicant correlation between GAC percentage and survival outcome in LUSC and LUAD possibly because the GAC percentages ((0.5, 0.97) and (0.59,0.98), respectively) (Supplementary Fig. 8b) have very limited dynamic range in these two tumour types, and thus may not have enough power to stratify patients. In contrast, GAC percentage ranges from 0.05 to0.96 in KIRC (Supplementary Fig. 8b), allowing patient stratication based on a full range of GAC levels. Together, the GLS APA regulation suggests a novel and potentially MYC-independent and miRNA-independent mechanism of
8 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications
& 2014 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE
glutaminase regulation in tumours, and GLS APA can be used to predict patient survival in KIRC.
Potential mechanisms for APA regulation during tumorigenesis. We sought to investigate the potential mechanisms governing APA dynamics in tumorigenesis. Although many details remain poorly understood, APA is thought to be regulated in cis through genetic aberrations37,38 of the underlying nascent mRNA (derived from DNA), and in trans by regulatory proteins in responding to dynamic environmental changes39. These cis-elements include canonical polyA signal AAUAAA and other auxiliary sequences such as U/GU-rich downstream elements40. The core polyadenylation trans-factors involve four multi-subunit protein complexes, CPSF (cleavage and polyadenylation specicity factor), CstF (cleavage stimulation factor), CFI and CFII (cleavage factors I and II). The chemical cleavage of premRNA process mainly employs CPSF to recognize the canonical polyA signal upstream of the cleavage site, and utilizes CstF to
bind downstream U/GU-rich elements40 mainly through the CstF64 subunit39.
To examine the role of genetic aberrations in the regulation of APA, we compared our 1,346 recurrent APA events with 64 Pancan12 Signicantly Mutated Genes (doi:10.7303/syn1750331). Surprisingly, there are only ve genes in common (Fig. 6a; P value0.48 by Fishers exact test). This result indicates that most of the dynamic APA events are probably not due to aberrations of underlying cis-elements but may be the result of aberrant expression of polyA trans-factors. To address this possibility, we investigated the gene expression of 22 important polyA trans-factors41 based on the TCGA RNA-seq data. The signicantly upand downregulated factors between tumours and matched normal tissues are indicated by yellow and blue, respectively (Fig. 6b). In general, we observed global upregulation of most polyA factors in ve tumour types (LUSC, LUAD, UCEC, BLCA and BRCA), which also have more 30-UTR-shortening events.
Therefore, we conclude that most core polyadenylation factors are expressed at higher levels in tumour types where proximal APAs
20
r 0.54; P-value:2.8e28
Dynamic APA genes
600
# 3 UTR shortening events per sample
100
# iClLIP reads per Kb in 3 UTR
P-value 3.3e21
Pancan12-mutated genes
500
LUSC UCEC BLCA LUAD BRCA HNSC KIRC
80
59 5
1,341
400
60
300
40
200
100
0
0.5 0.5
Log2(tumour/normal)
1.0
0
CPSF2
genes Other genes
2 0 2 4 6
3 UTR shortening
CPSF1
CPSF
CstF
CF Im
CF IIm
Others
homologues
CPSF3
CPSF4
CstF64 gene expression change log2(tumour/normal)
FIP1L1
CSTF3
CstF64
CSTF1
CPSF6
P-value 3.9e19
CPSF7
NUDT21
CstF64
PCF11
Genes preferring distal APAin Hela CstF64 and CstF64T KD
Genes preferring proximal APA in tumours
SYMPK
PAPOLG
PABPC1
PABPC4
CstF64 46
125
1,069
PABPN1
CSTF2T
WDR33
RBBP6
PPP1CA
KIRC
HNSC
BRCA
LUAD
BLCA
UCEC
LUSC
PPP1CB
Figure 6 | Potential mechanisms for APA regulation during tumorigenesis. (a) Only ve genes are in common between genes undergoing dynamic APA and genes signicantly mutated in Pancan12 tumour types. (b) Heatmap of gene expression fold-change of known polyadenylation factors. Each rectangle represents the mean log2 fold-change between tumour and matched normal tissues of one factor in one tumour type. A factor is considered differentially expressed if the false-discovery rate from edgeR55 is o0.05 and the mean absolute fold-change is 41.5. Yellow and blue boxes indicate the signicantly upregulated and downregulated genes, respectively. White boxes are non-signicant genes. (c) Correlation between CstF64 expression fold-change and number of 30-UTR-shortening events per sample. Each point represents a patient sample across seven tumour types. The x axis is the
CstF64 log2 fold-change between tumours and matched normal tissues. The y axis is the number of shortening events per sample. Spearmans correlation coefcient (0.54) and P value (2.8e 28) are indicated on the top. (d) Venn diagram comparison between genes preferring proximal APAs in tumour
with higher expression of CstF64 and genes preferring distal APA in Hela cells with knockdown of CstF64 and CstF64T. (e) Genes with 30-UTR shortening in tumours have more CstF64 iCLIP data derived from HeLa cells than background (P value 3.3e 21 by t-test).
NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications 9
& 2014 Macmillan Publishers Limited. All rights reserved.
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274
are favoured. Our results are consistent with previous studies showing that 30-UTR shortening in proliferating cells is also accompanied by an increased expression of polyadenylation factors9,12,27.
We further investigated the correlation between gene expression and 30-UTR shortening for four polyadenylation factors (CPSF1, CPSF3, CstF64 and PABPC1), which are differentially expressed between tumour and normal in at least three cancer types (Fig. 6b). Among them, CstF64 has the greatest correlation between gene expression fold-change and the number of shortening events per patient in tumours (Spearmans correlation 0.54 with P value 2.8e 28, Fig. 6c), followed by CPSF3. In contrast,
CPSF1 and PABPC1 have weak correlations (Supplementary Fig. 9). This result is consistent with a recent iClip-seq study, suggesting that CstF64 is one of the top three most important factors for polyA site selection42. Also, a recent study indicated that CPSF plays an important role in recruiting CstF64 to RNAs43. Furthermore, a recent global study in HeLa cells suggests that CstF64 induces the usage of proximal APAs43. They reported 171 genes with lengthening in 30 UTRs upon knockdown of
CstF64, among which 46 genes from our analysis have shortened 30 UTRs in tumours where CstF64 is upregulated (Fig. 6d;
P value 3.9e 19 using Fishers exact test; Supplementary
Fig. 10). This signicant overlap indicates that a subset of 30-UTR-shortening events we observed in tumours can indeed be explained by the expression level of CstF64. Finally, using CstF64 iCLIP-seq in HeLa cells43, we showed that those 1,346 genes have more CstF64 bindings in their 30 UTRs than other genes (Fig. 6e).
Together, our study provides strong evidence that key polyA trans-factors, such as CstF64, are upregulated in tumorigenesis, leading to preferential 30-UTR shortening in tumours.
DiscussionWe have developed a novel bioinformatics algorithm, termed DaPars, dedicated to the de novo identication and quantication of dynamic APA events using standard RNA-seq. The accuracy of DaPars is evidenced by the fact that our de novo predicted APAs are enriched for the canonical polyA signal AATAAA and have a high degree of overlap with annotated polyA sites (Fig. 2b,c). Our extensive DaPars analysis of TCGA data sets convincingly demonstrate that any investigator(s) conducting standard RNASeq is now capable of identifying the majority of functionally important APA events in most biological systems. DaPars is not just yet another APA assay; instead, its key methodology innovation is the inference of de novo APA events from existing RNA-seq data without relying on any additional wet-bench experiments. For example, our current APA analysis was based on RNA-seq of 358 tumour/normal pairs across 7 cancer types. An analysis of this scale would be prohibitively cumbersome using any previous method, such as microarrays, EST and PolyA-seq, but was made possible now with our DaPars method.
While our paper was under review, Wang et al.44 reported a change-point model to detect 30-UTR switching using RNA-seq.
The model by Wang et al. relies on the annotated distal polyA sites to infer the proximal ones, only supports genes with two polyA sites and only supports pair-wise comparison. In contrast, our DaPars method is fully de novo, can handle multiple (42)
polyA sites and multiple (42) samples and thus is much more powerful and exible than the model by Wang et al.. Most of our analyses based on hundreds of TCGA patient samples would not be possible using the model by Wang et al.
It has been reported that shorter 30 UTRs are preferentially used by several oncogenes in cancer cell lines8, but what was not clear from this work is how pervasive and recurrent APA is in clinical samples. Lin et al.45 reported 126 30-UTR-shortening
genes in 5 tumour/normal pairs but unfortunately did not provide a supplementary table for those genes. To directly compare our results with Lin et al.45, we repeated the same analysis as described in their paper and detected a total of 120 genes with 30-UTR shortening and upregulation of the short isoform. Among them, 53% were also found in our 1,201 shortening APA genes (Supplementary Fig. 11; P value 2e 43 by Fishers exact test),
including POLR2K, the main APA gene reported by Lin et al. Two examples of consistence and inconsistence between TCGA RNA-seq and PolyA-seq from Lin et al.45 are shown in Supplementary Figs 12 and 13, respectively. In this study, we have substantially increased the number of dynamic APA events based on 358 tumour/normal pairs. To our knowledge, this is the largest global APA analysis to date, leading to a 71-fold increase in sample size compared with Lin et al.45
Several novel and signicant biological and clinical insights are noticeable from our large-scale APA analysis. First, dynamic APA events are highly tumour type and patient specic. We observe that lung, uterus, breast and bladder cancers have signicantly more APAs than head/neck and kidney cancers. Moreover, similar to other caner genomic data, there is considerable APA heterogeneity among patients within the same tumour type. Second, our saturation analysis indicates that APA events derived from 358 samples across 7 tumour types remain far from complete, highlighting the need for de novo discovery of APA, and the need for expanding DaPars analysis to more tumour types and samples when they become available. Third, selected APA events provide a surprisingly strong additional prognostic power beyond common clinical covariates and conventional molecular data, such as somatic mutation and gene expression. A recent study46 also indicated that conventional molecular data had poor prognostic power beyond clinical data. Although the exact cause is unknown, we speculate that it may reect a role for APA as a driver of tumour progression. Fourth, our study reveals a novel link between altered 30-UTR usage and cancer metabolism. We observed that the GAC isoform of the glutaminase gene (GLS), which lacks any predicted miRNA binding, is predominantly expressed in LUSC, LUAD and KIRC tumours. Therefore, this APA event would abrogate the need to attenuate miR-23 expression through MYC upregulation and result in increased Glutaminase expression and altered glutamine metabolism. Fifth, our observation of correlating CstF64 levels with increased 30-UTR shortening suggests that this factor is a potential master activator of proximal APA usage in tumor-igenesis. This hypothesis predicts that tumour cells increase CstF64 levels to promote the 30-end processing at the proximal and weaker polyA sites thereby preventing the usage of the distal polyA sites39,43. Finally, APA is likely to be regulated by many factors in a tissue-specic manner. For example, we recently reported CFIm25 (ref. 47) as a global repressor of proximal APA in brain tumour. CFIm25 has opposite function of CstF64, since its decreased expression correlates with increased 30-UTR shortening. However, CFIm25 is not a master APA regulator in the cancer types we studied here, because it is not differentially expressed between tumour and normal (NUDT21 in Fig. 6b).
Our DaPars analysis of RNA-seq reveals a comprehensive list of previously unobserved, highly recurrent and functionally important 30-UTR somatic RNA aberrations. These RNA aberrations represent an illustrative case of genomic dark matter beyond coding regions, and thus may also provide new directions for tumour gene discovery48. Although there is a lack in observed genetic aberrations within 30 UTRs of most genes undergoing
APA, caution should be taken as current TCGA mutation analyses utilize primarily exome sequencing, which excludes 30
UTR. We will revisit this issue in the future when more whole-genome sequencing data are available. Finally, although focused
10 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications
& 2014 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE
towards the internal exons, so that splicing-coupled APA events can also be detected.
Non-uniformity correction. It has been reported that RNA-seq reads are not uniformly distributed along the gene body. DaPars provides an option to address the issue of non-uniformity by statistical modelling50. Since it is technically difcult to distinguish non-uniform distribution from dynamic APA, we decide to train our statistical model based on a subset of genes with no APA change, that is, with only one 30 UTR. We rst run DaPars to select those genes with no APA change and divide their RNA-seq gene body coverage into 100 bins, yielding an observed gene body-sequencing prole (Supplementary Fig. 1e). In the conventional DaPars, the elements of IL and IP in equation (1) are unweighted and all 1s on 30-UTR regions.
We will infer the weighted IL and IP based on the observed gene body-sequencing prole, then re-run DaPars with the weighted IL and IP to correct the nonuniformity in RNA-seq (Supplementary Fig. 1e).
Differential percentage of distal APA usage index. We used the following three criteria to detect the most signicant APA events:rst, given long 30-UTR expression level wiL and short 30-UTR expression level wiS estimated from (equation (1)), we used Fishers exact test to determine the P value of PDUI difference between tumour and matched normal tissue of the same patient, which is further adjusted by the BenjaminiHochberg procedure to control the false-discovery rate (FDR) at 5%. Second, the absolute mean difference of PDUIs of all the patients in the same tumour type must be no less than 0.2. Third, the mean fold-change of PDUIs of all the patients in the same tumour type must be no less than 1.5.
FDR 0:05
DPDUI
j j PDUI
tumor
on cancer genomics in this study, our novel DaPars framework will open the door for APA analysis in numerous biological and pathologic systems. It also underscores the power of innovative bioinformatics analyses that can derive novel biological insights from existing sequence data.
Methods
Data sets. All the RNA-seq BAM les were downloaded from the UCSC Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu/
Web End =https://cghub.ucsc.edu/ ). Here we only processed BRCA, BLCA, LUSC, LUAD, head and neck squamous cell carcinoma, UCEC and KIRC cancers that have 410 tumournormal pairs (Supplementary Table 1). Gene expression and miRNA expression were downloaded from The Cancer Genome Atlas data portal (https://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm
Web End =https://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm). MAQC brain and UHR RNA-seq reads were obtained from Sequence Read Archive with accessions ERP000016 and ERP000400, respectively. For MAQC PolyA-seq, the ltered polyA sites with normalized read counts were downloaded from the UCSC browser3.
DaPars algorithm. DaPars performs de novo identication and quantication of dynamic APA events between two conditions, regardless of any prior APA annotation. DaPars identies a distal polyA site based on RNA-seq data, uses a regression model to infer the exact location of the proximal APA site after correcting the potential RNA-seq non-uniformity bias along gene body, detects statistically signicant dynamic APAs and has the potential to detect 42 dynamic
APA events.
Distal polyA site identication from RNA-seq. Given two or more RNA-seq samples, distal polyA site refers to the end point of the longest 30 UTR among all the samples, which will be used in the next step to identify the proximal polyA within this longest 30-UTR region. To identify possible distal polyA site that may locate outside of gene annotation, we extend the annotated gene 30 end by up to10 kb before reaching a neighbouring gene. RNA-seq data from all input samples will be merged to have a combined coverage along the extended gene model. To address possible uneven and discontinuous issues, we applied a 50-bp window to smooth this combined coverage. We then scan the extended 30 UTR from 50 to 30 to nd the distal polyA site whose coverage is signicantly lower (that is,opredened cutoff at 5%) than the coverage at the start of the preceding exon. A similar strategy has been successfully used to detect lengthening of 30 UTRs in the mammalian brain21. The de novo distal APA estimated directly from RNA-seq, which may not be included in gene model, will benet the downstream proximal APA identication (Supplementary Fig. 1a).
Since most current RNA-seq data sets are not strand-specic, potential overlapping of 30 UTRs from two neighbouring tail-to-tail genes from different strands may give false-positive distal polyA. So after previous distal APA analysis, if 30 UTRs of two neighbouring genes overlap, we will gradually increase the cutoffs until the two 30 UTRs are separated. In this way, we can recover the proper distal polyA, which may be overlooked by other methods such as Cufinks (Supplementary Fig. 1b). The distal polyA site identication method implemented in DaPars has very good performance. For all the predicted distal polyA sites from TCGA RNA-seq, on average 81% are within 50 bp of the annotated polyA sites.
Regression model in DaPars. For each RefSeq transcript with a distal APA estimated from previous step, we use a regression model to infer the exact location of a de novo proximal polyA site at single-nucleotide resolution, by minimizing the deviation between the observed read density and the expected read density based on the two-polyA-site model, in both tumour and matched normal samples simultaneously. This regression model solves the following optimization problem:
w1 L; w2 L; w1 S; w2 S; P arg min
w ;w ;w ;w 0;1oPoL
8 >
<
>
:
: 3
To avoid false-positive estimation on lowly expressed genes, we only included genes with 430-fold mean coverage (reads per base gene model).
More than 2 dynamic APAs. Our DaPars framework can be easily extended to address 42 dynamic APAs. We formulated the multiple APA analysis in the following matrix format,
c11 c21
c12 c22
... ...c1m 1 c2m 1 c1m c2m
2
666664
PDUI PDUI
PDUInormal
j j 0:2
log2
0:59
3
777775
2
66664
... ...0 0 1 1
0 0 0 1
1 1 1 1
0 1 1 1
... ...
3
77775m m
w11 w21
w12 w22
... ...w1m 1 w2m 1 w1m w2m
2
666664
3
777775m 2
4
where m is the length of the longest 30 UTR of a transcript. wij is the expression level of one possible 30 UTR j on sample i. The number of non-zero wij determines how many polyA sites will be derived from RNA-seq. In most cases, there are only a few wij will be non-zero. So we can solve this equation using a positive Lasso optimization method as reformulated in the following form:
arg min
W
1
2 C MW
k k22 l W
2 Ci wiLIL wiSIP jj 22 1
where wiL and wiS are the abundances of transcripts with distal and proximal polyA sites for sample i, respectively, Ci [Ci1, ,Cij, ,CiL]T is the read coverage of
sample i at single-nucleotide resolution normalized by total sequencing depth, L is the length of the longest 30 UTR from previous step, P is the length of alternative proximal 30 UTR to be estimated, IL and IP are indicator functions such that
IL 1; ; 1 |{z}
L
k k1 5
where C, M and W are corresponding to the left, middle and right matrix in equation (4), respectively. In practice, we only consider no more than 4 APAsin a real data set to reduce the complexity of model selection and avoid over-tting issues. In Supplementary Fig. 1f, we showed that our DaPars can also identify 42 APAs from RNA-seq and the predictions are highly consistent with the annotation. Although many genes have 42 annotated APAs, the majorityof dynamic APAs only involve 2 polyA sites1. Therefore in the current large-scale TCGA RNA-seq analysis, we only focus on 2 APAs in the dynamic APA detection.
PolyA-seq processing. We downloaded the processed polyA sites with normalized read counts of MAQC brain and UHR PolyA-seq data sets (two replicates for each tissue) from the UCSC Genome Browser3. We calculated the signal intensity of a given polyA site based on all the same-strand PolyA-seq reads within 50 bases of the polyA site. We then used Fishers exact test to detect the statistically signicant differential APAs between brain and UHR with a BenjaminiHochberg-adjusted FDR cutoff of 0.1 and read count difference of 410%. For a fair comparison, we also used FDR of 0.1 and 10% DPDUI for DaPars analysis of MAQC RNA-seq data derived from the same brain and UHR samples.
Survival analysis using Cox proportional hazards model. A standard Cox proportional hazards model29 implemented in the R package survival was used for patient survival and KaplanMeier plotting. Hazard ratios exceeding 1 indicate poor prognosis for patients possessing shorter 30 UTR, whereas those below 1 are associated with better outcome. The high-risk group and low-risk group were generated based on the prognostic index (PI). The PI is the linear component of the Cox model, PI Pmi1 bixi where xi is the value of covariate i and its risk
coefcient, bi was estimated from the Cox tting. The high-risk and low-risk groups were generated for survival plot by splitting the ordered PI (higher values for higher risk) with equal number of samples in each group.
.
For each given 1oPoL, the expression levels of two transcripts with distal and proximal polyA sites in both tumour and normal tissues can be estimated by optimizing this linear regression model using quadratic programming49. The optimal de novo proximal polyA site P* is the one with the minimal objective function value, as demonstrated by the vertical arrow in Fig. 1a. To quantify the relative polyA site usage, we dene the PDUI for sample i as the following:
PDUI
wi L
wi L wi S
2
and IP 1; ; 1 |{z}
P
; 0; ; 0
|{z}
L P
where wi L and wi S are the estimated expression levels of transcripts with distal and proximal polyA sites for sample i. The greater the PDUI is, the more distal polyA site of a transcript is used and vice versa. Finally, the regression model is extended
NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications 11
& 2014 Macmillan Publishers Limited. All rights reserved.
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274
Survival analysis using Cox model and LASSO feature selection. We combined tumour-vs-normal shortening/lengthening events of APA genes (DPDUI values) with clinical covariates, such as age, gender, stage and smoking status (lung cancer), in survival analysis. We used a Cox regression model with LASSO feature selection to determine the contributions of APAs in survival prediction using the R package glmnet51. We chose the optimal APA genes based on the leave-one-out CV. Here the clinical covariates are not penalized and always selected. Finally, we used a LRT to estimate the additional prediction power of the new APA-clinical models over the clinical-only models.
Software availability. The open source DaPars program is freely available at https://code.google.com/p/dapars/
Web End =https://code.google.com/p/dapars/ . We will update this website periodically with new versions.
References
1. Elkon, R., Ugalde, A. P. & Agami, R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat. Rev. Genet. 14, 496506 (2013).
2. Sandberg, R., Neilson, J. R., Sarma, A., Sharp, P. A. & Burge, C. B. Proliferating cells express mRNAs with shortened 30 untranslated regions and fewer microRNA target sites. Science 320, 16431647 (2008).
3. Derti, A. et al. A quantitative atlas of polyadenylation in ve mammals. Genome Res. 22, 11731183 (2012).
4. Tian, B., Hu, J., Zhang, H. & Lutz, C. S. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 33, 201212 (2005).
5. Barreau, C., Paillard, L. & Osborne, H. B. AU-rich elements and associated factors: are there unifying principles? Nucleic Acids Res. 33, 71387150 (2005).
6. Jacobsen, A. et al. Analysis of microRNA-target interactions across diverse cancer types. Nat. Struct. Mol. Biol. 20, 13251332 (2013).
7. Takagaki, Y. & Manley, J. L. Levels of polyadenylation factor CstF-64 control IgM heavy chain mRNA accumulation and other events associated with B cell differentiation. Mol. Cell 2, 761771 (1998).
8. Mayr, C. & Bartel, D. P. Widespread shortening of 30 UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673684 (2009).
9. Elkon, R. et al. E2F mediates enhanced alternative polyadenylation in proliferation. Genome. Biol. 13, R59 (2012).
10. Singh, P. et al. Global changes in processing of mRNA 30 untranslated regions characterize clinically distinct cancer subtypes. Cancer Res. 69, 94229430 (2009).
11. Morris, A. R. et al. Alternative cleavage and polyadenylation during colorectal cancer development. Clin. Cancer Res. 18, 52565266 (2012).
12. Ji, Z., Lee, J. Y., Pan, Z., Jiang, B. & Tian, B. Progressive lengthening of30 untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proc. Natl Acad. Sci. USA 106, 70287033 (2009).
13. Hoque, M. et al. Analysis of alternative cleavage and polyadenylation by 30 region extraction and deep sequencing. Nat. Methods 10, 133139 (2013).
14. Sun, Y., Fu, Y., Li, Y. & Xu, A. Genome-wide alternative polyadenylation in animals: insights from high-throughput technologies. J. Mol. Cell Biol. 4, 352361 (2012).
15. Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768772 (2010).
16. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 10091015 (2010).
17. Trapnell, C. et al. Transcript assembly and quantication by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515 (2010).
18. Griebel, T. et al. Modelling and simulating generic RNA-Seq experiments with the ux simulator. Nucleic Acids Res. 40, 1007310083 (2012).
19. Bullard, J. H., Purdom, E., Hansen, K. D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94 (2010).
20. Consortium, M. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 11511161 (2006).
21. Miura, P., Shenker, S., Andreu-Agullo, C., Westholm, J. O. & Lai, E. C. Widespread and extensive lengthening of 30 UTRs in the mammalian brain.
Genome Res. 23, 812825 (2013).22. Ulitsky, I. et al. Extensive alternative polyadenylation during zebrash development. Genome Res. 22, 20542066 (2012).
23. Zhang, H., Hu, J., Recce, M. & Tian, B. PolyA_DB: a database for mammalian mRNA polyadenylation. Nucleic Acids Res. 33, D116D120 (2005).
24. Bailey, T. L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202W208 (2009).
25. Beaudoing, E., Freier, S., Wyatt, J. R., Claverie, J. M. & Gautheret, D. Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10, 10011010 (2000).
26. Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 16531659 (2011).
27. Shepard, P. J. et al. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA 17, 761772 (2011).
28. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495501 (2014).
29. Andersen, P. K. & Gill, R. D. Coxs regression model for counting processes: a large sample study. Ann. Statist. 11001120 (1982).
30. Hsu, P. P. & Sabatini, D. M. Cancer cell metabolism: Warburg and beyond. Cell 134, 703707 (2008).
31. Cassago, A. et al. Mitochondrial localization and structure-based phosphate activation mechanism of Glutaminase C with implications for cancer metabolism. Proc. Natl Acad. Sci. USA 109, 10921097 (2012).
32. Aledo, J. C., Gomez-Fabre, P. M., Olalla, L. & Marquez, J. Identication of two human glutaminase loci and tissue-specic expression of the two related genes. Mamm. Genome 11, 11071110 (2000).
33. de la Rosa, V. et al. A novel glutaminase isoform in mammalian tissues. Neurochem. Int. 55, 7684 (2009).
34. Elgadi, K. M., Meguid, R. A., Qian, M., Souba, W. W. & Abcouwer, S. F. Cloning and analysis of unique human glutaminase isoforms generated by tissue-specic alternative splicing. Physiol. Genomics 1, 5162 (1999).
35. Gao, P. et al. c-Myc suppression of miR-23a/b enhances mitochondrial glutaminase expression and glutamine metabolism. Nature 458, 762765 (2009).
36. van den Heuvel, A. P., Jing, J., Wooster, R. F. & Bachman, K. E. Analysis of glutamine dependency in non-small cell lung cancer: GLS1 splice variant GAC is essential for cancer cell growth. Cancer Biol. Ther. 13, 11851194 (2012).
37. Stacey, S. N. et al. A germline variant in the TP53 polyadenylation signal confers cancer susceptibility. Nat. Genet. 43, 10981103 (2011).
38. Wiestner, A. et al. Point mutations and genomic deletions in CCND1 create stable truncated cyclin D1 mRNAs that are associated with increased proliferation rate and shorter survival. Blood 109, 45994606 (2007).
39. Di Giammartino, D. C., Nishida, K. & Manley, J. L. Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853866 (2011).
40. Shi, Y. Alternative polyadenylation: new insights from global analyses. RNA 18, 21052117 (2012).
41. Shi, Y. et al. Molecular architecture of the human pre-mRNA 30 processing complex. Mol. Cell 33, 365376 (2009).
42. Martin, G., Gruber, A. R., Keller, W. & Zavolan, M. Genome-wide analysis of pre-mRNA 30 end processing reveals a decisive role of human cleavage factor I in the regulation of 30 UTR length. Cell Rep. 1, 753763 (2012).
43. Yao, C. et al. Transcriptome-wide analyses of CstF64-RNA interactions in global regulation of mRNA alternative polyadenylation. Proc. Natl Acad. Sci. USA 109, 1877318778 (2012).
44. Wang, W., Wei, Z. & Li, H. A change-point model for identifying 30 UTR switching by next-generation RNA sequencing. Bioinformatics 30, 21622170 (2014).
45. Lin, Y. et al. An in-depth map of polyadenylation sites in cancer. Nucleic Acids Res. 40, 84608471 (2012).
46. Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratication of tumor mutations. Nat. Methods 10, 11081115 (2013).
47. Masamha, C. P. et al. CFIm25 links alternative polyadenylation to glioblastoma tumour suppression. Nature 509, 412416 (2014).
48. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 15461558 (2013).49. Bohnert, R. & Ratsch, G. rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Res. 38, W348W351 (2010).
50. Li, J., Jiang, H. & Wong, W. H. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 11, R50 (2010).
51. Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 122 (2010).
52. Grimson, A. et al. MicroRNA targeting specicity in mammals: determinants beyond seed pairing. Mol. Cell 27, 91105 (2007).
53. Lewis, B. P., Burge, C. B. & Bartel, D. P. Conserved seed pairing, often anked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 1520 (2005).
54. John, B. et al. Human MicroRNA targets. PLoS Biol. 2, e363 (2004).55. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139140 (2010).
Acknowledgements
We would like to thank Benjamin Rodriguez for critical reading of this manuscript and members of the Li laboratory for helpful discussions. This work was partially supported by NIH R01HG007538, CPRIT RP110471 and DOD W81XWH-10-1-0501 (to W.L.), NIH grants CA167752 and CA166274 (to E.J.W.). We gratefully acknowledge the contributions from TCGA Research Network and its TCGA Pan-Cancer Analysis Working Group.
12 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications
& 2014 Macmillan Publishers Limited. All rights reserved.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE
Author contributions
Z.X. and W.L. conceived the project, performed the analysis and wrote the manuscript. Z.X. developed the DaPars algorithms. Z.X., L.A.D., T.A.C., J.R.N., D.A.W., E.J.W. and W.L. interpreted the results and edited the manuscript.
Additional information
Supplementary Information accompanies this paper at http://www.nature.com/naturecommunications
Web End =http://www.nature.com/ http://www.nature.com/naturecommunications
Web End =naturecommunications
Competing nancial interests: The authors declare no competing nancial interests.
Reprints and permission information is available online at http://npg.nature.com/reprintsandpermissions/
Web End =http://npg.nature.com/ http://npg.nature.com/reprintsandpermissions/
Web End =reprintsandpermissions/
How to cite this article: Xia, Z. et al. Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 30-UTR landscape across seven tumour types. Nat. Commun.
5:5274 doi: 10.1038/ncomms6274 (2014).
NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications
Web End =www.nature.com/naturecommunications 13
& 2014 Macmillan Publishers Limited. All rights reserved.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright Nature Publishing Group Nov 2014
Abstract
Alternative polyadenylation (APA) is a pervasive mechanism in the regulation of most human genes, and its implication in diseases including cancer is only beginning to be appreciated. Since conventional APA profiling has not been widely adopted, global cancer APA studies are very limited. Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identification of dynamic APAs from standard RNA-seq. When applied to 358 TCGA Pan-Cancer tumour/normal pairs across seven tumour types, DaPars reveals 1,346 genes with recurrent and tumour-specific APAs. Most APA genes (91%) have shorter 3'-untranslated regions (3' UTRs) in tumours that can avoid microRNA-mediated repression, including glutaminase (GLS), a key metabolic enzyme for tumour proliferation. Interestingly, selected APA events add strong prognostic power beyond common clinical and molecular variables, suggesting their potential as novel prognostic biomarkers. Finally, our results implicate CstF64, an essential polyadenylation factor, as a master regulator of 3'-UTR shortening across multiple tumour types.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer