Full Text

Turn on search term navigation

ARTICLE

Received 11 Jul 2014 | Accepted 15 Sep 2014 | Published 20 Nov 2014

Zheng Xia1,2, Lawrence A. Donehower3,4, Thomas A. Cooper2,5,6, Joel R. Neilson6, David A. Wheeler4,7, Eric J. Wagner8 & Wei Li1,2

Alternative polyadenylation (APA) is a pervasive mechanism in the regulation of most human genes, and its implication in diseases including cancer is only beginning to be appreciated. Since conventional APA proling has not been widely adopted, global cancer APA studies are very limited. Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identication of dynamic APAs from standard RNA-seq. When applied to 358 TCGA Pan-Cancer tumour/normal pairs across seven tumour types, DaPars reveals 1,346 genes with recurrent and tumour-specic APAs. Most APA genes (91%) have shorter 30-untranslated regions (30 UTRs) in tumours that can avoid microRNA-mediated repression, including glutaminase (GLS), a key metabolic enzyme for tumour proliferation. Interestingly, selected

APA events add strong prognostic power beyond common clinical and molecular variables, suggesting their potential as novel prognostic biomarkers. Finally, our results implicate CstF64, an essential polyadenylation factor, as a master regulator of 30-UTR shortening across multiple tumour types.

1 Division of Biostatistics, Dan L Duncan Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA. 2 Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas 77030, USA. 3 Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, Texas 77030, USA. 4 Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA. 5 Department of Pathology and Immunology, Baylor College of Medicine, Houston, Texas 77030, USA. 6 Department of Molecular Physiology and Biophysics, Baylor College of Medicine, Houston, Texas 77030, USA. 7 Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA. 8 Department of Biochemistry and Molecular Biology, The University of Texas Medical School at Houston, Houston, Texas 77030, USA. Correspondence and requests for materials should be addressed to W.L. (email: mailto:[email protected]

Web End [email protected] ).

NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications 1

DOI: 10.1038/ncomms6274

Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 30-UTR landscape across seven tumour types

ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274

The dynamic usage of messenger RNA (mRNA) 30-untranslated region (30 UTR), mediated through alternative polyadenylation (APA), plays an important role in

post-transcriptional regulation under diverse physiological and pathological conditions1,2. Approximately 70% of human genes3 are characterized by multiple polyA sites that produce distinct transcript isoforms with variable 30-UTR length and content, thereby signicantly contributing to transcriptome diversity4. The majority of APA examples utilize alternative polyA sites located within the terminal exon proximally downstream of the stop codon (tandem APA). As a result, while the protein-coding sequence is unaltered, the regulatory elements in the distal 30 UTR that might reduce mRNA stability or impair translation efciency can be removed, including AU-rich elements5 and microRNA (miRNA)-binding sites6. A small percentage of APA sites can be located within internal introns/exons (splicing APA) and are coupled with alternative splicing to produce mRNA isoforms encoding distinct proteins. A well-documented example occurs during B-cell differentiation, where IgM switches from a membrane-bound form to a secreted form using a proximal polyA site instead of a distal one7. More recent studies8 have shed light on the importance of APA in human diseases such as cancer, but its clinical signicance to tumorigenesis is only beginning to be appreciated. Both proliferating cells2,9 and transformed cells10 have been shown to favour expression of shortened 30 UTRs through APA, leading to activation of several proto-oncogenes, such as cyclin D1 (ref. 8). Collectively, these observations imply that truncation of the 30 UTRs may serve as prognostic biomarkers10,11. While compelling, these studies were highly limited to either a limited number of genes or a small sample size. It remains to be determined to what extent APA occurs in cancer patients, what level of clinical utility APA may have and the molecular mechanisms and functional consequences of APA during tumorigenesis across multiple tumour types.

RNA-seq has become a routine protocol for gene expression analysis; however, methods to quantify relative APA usage are still under development. Previous global APA studies use microarrays2,12, which are limited by the dependence on annotated polyA databases as well as inherent technical biases such as cross-hybridization. Recent APA protocols use polyA junction sites enrichment followed by high-throughput sequencing (PolyA-seq)13,14. These PolyA-seq protocols, although powerful in providing the precise locations of polyA sites, are hampered by technical issues, such as internal priming artefacts, and thus have not been widely adopted by the cancer community. In contrast, RNA-seq has been widely employed in almost every large-scale genomics project, including The Cancer Genome Atlas (TCGA). However, very few RNA-seq reads contain polyA tails, challenging our ability to identify APA events. For example, an ultra-deep sequencing study15 only identied B40 thousand putative polyA reads (B0.003%) from1.2 billion total RNA-seq reads. Moreover, although the popular RNA-seq tool MISO16 can detect annotated alternative tandem 30

UTRs, it cannot identify any novel APA events beyond polyA databases. Finally, the short 30 UTRs are often embedded within the long ones, and thus the isoforms with short 30 UTRs are commonly overlooked by transcript assembly tools, such as Cufinks17. Despite these inherent limitations, we hypothesize that any major changes in APA usage between different conditions will result in localized changes in RNA-seq density near the 30-end of mRNA. And this localized RNA-density change can be readily detected through single-nucleotide resolution RNA-seq analysis. We therefore developed a novel bioinformatics algorithm, Dynamic analyses of Alternative PolyAdenylation from RNA-Seq (DaPars), to directly infer

dynamic APA events through the comparison of standard RNA-seq data between different conditions.

TCGA has characterized a comprehensive list of genomic, epigenomic and transcriptomic features in thousands of tumour samples; however, it lacks a PolyA-seq platform for APA analysis. To ll this knowledge gap, we used DaPars to retrospectively analyse the existing RNA-seq data of tumours and matched normal tissues derived from 358 patients across 7 tumour types. We discover 1,346 genes with highly recurrent tumour-specic dynamic APA events, demonstrate the additional prognostic power of APA beyond common clinical and molecular variables and expand our knowledge of the mechanisms and consequences of APA regulation during tumorigenesis.

ResultsDaPars identies dynamic APA events. DaPars performs de novo identication and quantication of dynamic APA events between tumour and matched normal tissues, regardless of any prior APA annotation. For a given transcript, DaPars rst identies the de novo distal polyA site based on a continuous RNA-seq signal independent of the gene model (Fig. 1a; Supplementary Fig. 1a,b). Assuming there is an alternative de novo proximal polyA site, DaPars models the normalized single-nucleotide resolution RNA-seq-read densities of both tumour and normal as a linear combination of both proximal and distal polyA sites. DaPars then uses a linear regression model to identify the location of the de novo proximal polyA site as an optimal tting point (vertical arrow in Fig. 1a) that can best explain the localized read-density change. Furthermore, this regression model is extended towards internal exons, so that splicing-coupled APA events can also be detected. Finally, the degree of difference in APA usage between tumour and normal can be quantied as a change in Percentage of Distal polyA site Usage Index (DPDUI), which is capable of identifying lengthening (positive index) or shortening (negative index) of 30 UTRs. The dynamic APA events with statistically signicant DPDUI between tumour and normal will be reported. The DaPars algorithm is described in further detail in the Methods. One example of an identied dynamic APA event is given for the TMEM237 gene (Fig. 1b), where the shorter 30

UTR predominates in both breast (breast invasive carcinoma (BRCA)) and lung (lung squamous cell carcinoma (LUSC)) tumours compared with matched normal tissues. Another example is LRRFIP1 (Fig. 1c), where the distal 30 UTR is nearly absent in both breast and lung tumours.

DaPars evaluation using simulated and experimental APA data. To assess the performance of DaPars, we conducted a series of proof-of-principle experiments. First, we used simulated RNA-seq data with predened APA events to evaluate DaPars as a function of sequencing coverage. We simulated 1,000 genes in tumour and normal at different levels of sequencing coverage (reads per base gene model). For each gene, we simulated two isoforms with long and short 30 UTRs (3,000 and 1,500 bp), respectively. The relative proportion of these two isoforms is randomly generated, so that the DPDUI between tumour and normal for each gene is a random number ranging from 1 to 1.

According to these gene models and expression levels, we used Flux Simulator18 to generate 50-bp paired-end RNA-seq reads with a 150-bp fragment length, taking into account typical technical biases observed in RNA-seq. The simulated RNA-seq reads were used as the input for DaPars analysis, while the short/ long isoforms and the DPDUI values were hidden variables to be determined by DaPars. As a criterion for accuracy, the DaPars dynamic APA prediction is considered to be correct if the predicted de novo APA is within 50-bp distance of the bona de

2 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications

NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE

2 kb

hg19

780

BRCA LUSC

Matched

normal

Tumour

Matched normal

Tumour

285

1 501

1 140

1 185

Total 10kb

Normal

Tumor

999

Identified from RNA-seq

TMEM237

Proximal APA Distal APA

Possible

3 UTRs

10 kb

hg19

410

1 890

1 615

1 894

Matched

Normal

Tumour

Regression model

Fitted value

Concatenated exons + 10 kb 3 UTR extension

4,000

1,000

BRCA LUSC

Matched

Normal

Tumour

0 500 1,000 1,500

LRRFIP1

Proximal Distal

100%

Prediction accuracy under differenct mean coverage

Percentage of recovered APAs

20%

90%

1 kb hg19

152

146

0 550

541

16.24

010.25

27.21

31.74

MAQC brain1

MAQC brain2

UBE2I

80%

70%

60%

RNA-seq PolyA-seq

50%

40%

30%

MAQC

UHR1

MAQC

UHR2

MAQC UHR1

DaPars Prediction

110

130

150

170

190

Mean coverage (reads per base)

PolyA-seq

DaPars analysis of RNA-seq

MAQC brain1

MAQC brain2

MAQC UHR2

RefSeq genes

222

127 150

Cufflinks

Figure 1 | Overview of the DaPars algorithm and its performance evaluation. (a) Diagram depicts the DaPars algorithm for the identication of dynamic APA between tumour and normal samples. The top panel shows RNA-seq coverage on exons with 10 kb extension without any prior knowledge of APA sites. The distal APA site is inferred directly from the combined RNA-seq data of tumour and normal tissues (middle panels). The y axis of the bottom panel is the tted value of our regression model and the locus with the minimum tted value (red point below vertical arrow) corresponds to the predicted proximal APA site (red horizontal bar). (b) An example of DaPars identied dynamic APA from the TCGA RNA-seq data. The shorter 30 UTR of

TMEM237 is preferred in BRCA and LUSC tumours. (c) Another example of dynamic APA, here the distal APA of LRRFIP1 is nearly absent in both BRCA and LUSC tumours while the proximal APA is unchanged. (d) A simulation study to demonstrate DaPars performance. The percentage of recovered APA events is plotted against different sequencing coverage. The quantile box shows the variation of DaPars prediction based on 1,000 simulated events. The black line in each box is the median recovery rate. (e) An example of dynamic APA between MAQC UHR and brain detected by both DaPars analysis of RNA-seq and PolyA-seq. The three bottom tracks are the RefSeq gene structure, Cufinks prediction and DaPars prediction. (f) Venn diagram comparison between PolyA-seq and DaPars analysis of RNA-seq based on the same MAQC UHR and brain samples.

NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications 3

ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274

polyA site, and the predicted DPDUI is within 0.05 from the predetermined DPDUI. The nal prediction accuracy (percentage of recovered APAs) is plotted as a function of the different coverage levels (Fig. 1d). Using genes with a single isoform as negative controls, we also reported receiver-operating characteristic curves at different coverage levels with areas under receiver-operating characteristic curves (AUCs) ranging from 0.762 to 0.985 (Supplementary Fig. 2). Our results indicate that dynamic APA events can be readily identied across a very broad range of coverage levels. Importantly, we determined that a sequencing coverage of 30-fold can achieve 470% accuracy and close to 0.9

AUC in dynamic APA detection. Therefore, we ltered out genes with o30-fold coverage for all further analysis.

As an additional proof-of-principle, we directly compared APA events detected by DaPars with that of PolyA-seq. To achieve this, we used the RNA-seq data19 and PolyA-seq data3 based on the same Human Brain Reference and the Universal Human Reference (UHR) MAQC samples20. For PolyA-seq, the differentially altered 30-UTR usage was identied as described in Methods. From the comparison between brain and UHR, we found that B60% (P valueo2.2e 16; Fishers exact test) of 372

DaPars predicted APA events could be strongly supported by PolyA-seq (Fig. 1e,f). Both PolyA-seq and DaPars reported longer 30 UTRs in brain than in UHR in 494% dynamic APA events, which is consistent with recent reports that brain tissues normally have the longest 30 UTRs21,22. Close inspection of the raw data indicates that the non-overlapping dynamic APA events can be partially explained by the individual assay limitations. For example, PolyA-seq is designed to amplify polyA tags; therefore, some dynamic APA events reported by PolyA-seq may have a small magnitude of changes that are not readily detectable by RNA-seq (Supplementary Fig. 1c). Meanwhile, probably due to additional steps such as fractionation, PolyA-seq may also fail to detect dynamic APAs that are clearly observed by RNA-seq (Supplementary Fig. 1d). Together, we conclude that DaPars can reliably detect dynamic APA events between different conditions using standard RNA-seq.

Broad and recurrent shortening of 30 UTRs across tumour types. Since TCGA lacks a PolyA-seq platform for APA analysis, we sought to ll this knowledge gap through DaPars retrospective analysis of existing TCGA RNA-seq data, which were originally sequenced for gene expression. We focused our analysis on seven tumour types that have 410 tumour/normal pairs, including bladder urothelial carcinoma (BLCA), head and neck squamous cell carcinoma, LUSC, lung adenocarcinoma (LUAD), BRCA, kidney renal clear cell carcinoma (KIRC) and uterine corpus endometrioid carcinoma (UCEC) (Supplementary Table 1). TCGA RNA-seq data are of high quality with a mean coverage of around 50-fold, which corresponds to 80% accuracy for DaPars APA analysis based on our simulation study (Fig. 1d). For each tumour type, we identied 224744 genes with statistically signicant and recurrent (occurrence rate 420%) dynamic APA events during tumorigenesis, leading to a total of 1,346 non-redundant events across 7 tumour types (Fig. 2a; Supplementary Fig. 3a; Supplementary Data 1). As a negative control, we did not observe any recurrent APA events between different batches of normal tissues of the same tumour type, indicating that the 1,346 DaPars reported tumour-specic APA events are not likely due to technical artefacts, such as sequencing bias or batch effect. Overall, lung (LUSC and LUAD), uterine (UCEC), breast (BRCA) and bladder (BLCA) cancers possess the highest amount dynamic APA events than the other tumour types (Fig. 2a; Supplementary Fig. 3a,b). Furthermore, 55% of the 1,346 dynamic APA events occur in at least 2 tumour types (Supplementary Fig. 3c),

indicating potential concerted mechanisms in APA regulation across tumour types. Strikingly, the majority (6198%) of APA events have shorter 30 UTRs in tumours (Fig. 2a; Supplementary

Fig. 3a), which is consistent with previous reports that transformed cells preferentially express mRNAs with shortened 30

UTRs8.

Multiple lines of evidence indicate that DaPars reported de novo APA events are indeed regulated through APA. First, 51% of DaPars predictions are within 50 bp of the annotated APAs compiled from Refseq, ENSEMBL, UCSC gene models and polyA database23. There is an approximately sixfold enrichment of annotated APAs in our DaPars predictions compared with random controls (Fig. 2b). Second, in the upstream ( 50 nt) of our de nov APA sites, canonical polyA

signal AATAAA can be successfully identied by MEME motif enrichment analysis24 (Fig. 2c). In addition, AATAAA and ATTAAA are the most prevalent motifs among variants25 of polyA signals (Supplementary Fig. 4)4. By comparing 50 bp anking sequences of the distal and proximal polyA sites of the 30-UTR shortening events, DREME26 discriminative motif discovery algorithm reported that AATAAA motif is signicantly stronger in distal polyA sites (Supplementary Fig. 5), suggesting the molecular basis for differential polyA site selection27. Furthermore, the canonical polyA signal can also be identied (Supplementary Figs 6 and 7) on those de novo APA sites that do not coincide with previous annotation. As expected, the de novo DaPars analysis enables us to detect novel APAs that are not annotated in any database. For example, we found a potential novel proximal APA site in AGPS that is signicantly upregulated in LUSC tumour (Fig. 2d). Together, we conclude that DaPars reliably identied a comprehensive list of novel and existing APA target genes across seven TCGA tumour types, and the preferential shortening of 30 UTR is a major layer of transcriptomic dynamics during tumorigenesis.

APA events remain far from complete. To explore to what extent the discovered 1,346 APA events have reached saturation, we performed down-sampling saturation analysis. We repeated DaPars analysis (occurrence rate 420%) on random subsets of samples of various smaller sizes. Saturation is expected to occur when increasing sample size fails to discover additional APA events. The results indicate that the number of APA events increases steadily with increasing sample size in total (Fig. 2e), sample size per tumour type (Supplementary Fig. 3d) and the number of tumour types studied (Fig. 2f). This suggests that APA events derived from 358 samples across 7 tumour types remain far from complete. DaPars analysis on a larger sample size or more tumour types is likely to reveal many more novel APA events. This prediction is consistent with a recent report demonstrating that cancer genome sequencing normally requires thousands of samples per tumour type to approach saturation28. This observation also highlights the need for de novo discovery of APA, since any prior annotation-based detection methods are likely to miss a signicant portion of novel APA events from tumour samples.

Genes with shorter 30 UTRs are prone to be upregulated. The current model predicts that 30-UTR shortening through APA during tumorigenesis may upregulate its parental gene by escaping miRNA repression. To test this hypothesis, we calculated the numbers of miRNA-binding sites lost due to 30-UTR shortening in tumours (Fig. 3a). Using this approach, we determined that B67% genes with shorter 30 UTRs in tumours have lost at least 1 predicted miRNA-binding site (Fig. 3a). Furthermore, when compared with all the genes of sufcient sequencing

4 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications

NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE

571

0 Percentage of patients with lengthening (%)

# Shortening# Lengthening

# Events

P < 2.2e16

60%

30%

Overlapping with annotated PASs

51.3%

Percentage of patients with shortening (%)

UCEC

LUSC

BLCA

LUAD

BRCA

HNSC

KIRC

50 0 40 60

0 DaPars prediction

8.7%

Random

AE-value: 1.8e135

Bits

2 kb hg19

LUSC normal

LUSC tumour

Percentage of total events

100%

80%

60%

40%

20%

Percentage of total events

100%

80%

60%

40%

20%

DaPars prediction

AGPS

Hs.516543.1.25 Hs.516543.1.26

DaPars predicted polyA site

UCSC

RefSeq genes

Reported polyA sites from polyA_DB

Ensembl gene

0 50 100 150 200 250 300 350 # Samples

1 2 3 4 5 6 7 # Tumour types

Figure 2 | Broad shortening of 30 UTRs across seven TCGA tumour types. (a) The central heatmap shows genes (rows) undergoing 30-UTR shortening (blue) or lengthening (red) in each of the 358 tumours (columns) compared with matched normal tissues across seven tumour types. The upper histogram shows the number of APA events per tumour. The side histograms show the percentage of tumours with 30-UTR shortening (left) or lengthening (right) for each APA gene. (b) Bar plots show the percentages of DaPars-predicted APAs and randomly selected APAs from 30-UTR regions overlapping with annotated APAs from four databases (Refseq, UCSC, ENSEMBL and PolyA_DB). The P value was calculated by t-test using 50 bootstrapping

of data. (c) MEME identies the canonical polyA motif AATAAA with a very signicant E-value (1.8e 135) from the upstream ( 50 bp) of the

proximal polyA sites predicted by DaPars. (d) An example of DaPars-predicted novel polyA site (red bar) in a LUSC tumour that is far away from any annotated polyA sites. (e) Saturation analysis of APA events by adding more samples. Each point is a random subset of samples of various smaller sizes. All the points were tted by a smoothed read line. (f) Saturation analysis by adding more tumour types. Each grey line represents a random orderingof seven tumour types and the red curve is the tting line. The percentage of dynamic APA events increased with the number of tumour types.

coverage, those genes with shorter 30 UTRs in tumours have overall greater miRNA-binding site density in their gene models (P value 1.8e 11, t-test; Fig. 3b). These data imply that APA

regulation tends to maximize the miRNA-binding loss through preferentially shortening those 30 UTRs already heavily regulated by miRNA. To examine the consequences of 30 UTR and miRNA-binding loss, we compared the gene expression between tumours and matched normal tissues. As expected, those genes with shorter 30 UTRs in tumours tend to be more upregulated in tumours (Fig. 3c). In conclusion, our data strongly support the hypothesis that many genes are upregulated during tumorigenesis

by shortening their 30 UTRs to escape post-transcriptional miRNA repression.

APA events add prognostic power beyond common covariates. Very little is known of the clinical implications of the dynamic 30

UTRs in cancer patients. To address this issue, we used a standard Cox proportional hazards model29 for the correlation between patient overall survival and multiple clinical and molecular covariates. Here we only used BRCA, LUSC and KIRC due to high mortality rate and large sample size. We rst used common

NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications 5

ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274

400

300

P-value 1.8e11

67%

# Events

# miRNA binding per kb

200

100

0 5 10 15 20 25 30 35 40

# Lost miRNA targets

Gene expression:log2 (tumour/normal)

All genes Genes with shorter3 UTRs in 7 tumour types

868

1,000

122

Shortening

Lengthening

15 Up

44 Dn

0.6 0.4 0.2 0.2 0.4 0.6

Percentage of distal 3 UTR usage difference (PDUI) Tumour normal

Figure 3 | Genes with shorter 30 UTRs in tumours are prone to be upregulated. (a) Number of genes losing miRNA-binding sites due to the shortening of their 30 UTRs. Here we selected miRNA-bindings sites predicted by both TargetScanHuman V6.2 (refs 52,53) and miRanda54, as a more conservative list of miRNA targets. Number in the bracket represents the percentage of genes losing at least 1 miRNA-binding site. (b) Genes with shorter 30 UTRs in tumours have greater miRNA-binding-site density in the 30-UTR region than all RefSeq genes. We used RefSeq gene models for all the calculations regardless of the APA detection. The y axis is the number of miRNA-binding sites normalized by 30-UTR length (per kb). The P value was calculated by a t-test. (c) For genes with shorter 30 UTRs in tumours, their fold-change expression between tumours and normal tissues are plotted against their DPDUI values. All isoforms of the same gene were combined for the expression measurement. The genes signicantly up- or downregulated in tumours are shown in red and blue, respectively, which were identied by paired t-test with BenjaminiHochberg (BH) false-discovery rate at 5%. Accordingly, the red and blue bar plots indicate the number of up- and downregulated genes, respectively.

clinical covariates including only tumour stage, age, gender (excluding breast cancer) and smoking status (lung cancer only) to generate low- and high-risk groups, which are visualized by KaplanMeier plots and compared by the log-rank test (Fig. 4a). We next used the same Cox regression model integrated with LASSO to select the APA (DPDUI) events besides clinical covariates that can best separate risk groups. With clinical covariates always included, we used leave-one-out cross-validation (CV) to select the optimal 13 APA events (Supplementary Table 2) to constitute new APA-clinical Cox regression models (Fig. 4d), which have much more signicant P values in the risk group comparison. To quantify the added prognostic power of APA events, we used a likelihood-ratio test (LRT) to compare the new APA-clinical models with the clinical only models. The LRT results (Fig. 4e) clearly demonstrate a strong additional prognostic power of APA events beyond clinical covariates. Among these six APA covariates, signicant worse survival is associated with 30-UTR shortening of three genes (SYNCRIP in BRCA; TMCO7 and PLXDC2 in KIRC) and 30-UTR lengthening of two genes (ATP5S in BRCA; RAB23 in LUSC) (Supplementary Table 2). This result strongly suggests that, depending on the tumour types or genes studied, either lengthening or shortening of 30 UTRs may be associated with poor clinical outcome. Since our CV procedure only selects the optimal APA events, it is highly likely that even more APA events

can be associated with patient survival. Furthermore, we combined clinical covariates with tumour mRNA expression (mRNA-clinical) and tumour-vs-normal gene expression fold-change (mRNA-FC-clinical model) of the same APA genes (Supplementary Table 2) as two additional Cox regression models and repeated the same analyses. Compared with the APA-clinical model, both mRNA-clinical and mRNA-FC-clinical models provide much less additional prognostic power (Fig. 4e), less-signicant log-rank P values in risk group comparison (Fig. 4bd). Finally, we show that the separated high- and low-risk groups by APA-clinical models have no correlation with the TCGA Pancan12 signicantly mutated gene (http://dx.doi.org/doi:10.7303/syn1750331

Web End =doi:10.7303/ http://dx.doi.org/doi:10.7303/syn1750331

Web End =syn1750331 ) (Fig. 4f). Together, APA events provide additional power in survival prediction beyond clinical covariates, and independent of commonly used molecular data such as gene expression and somatic mutations.

Cancer metabolism gene GLS is regulated through APA. Ingenuity IPA and literature searches were used to characterize the pathways enriched in 1,346 dynamic APA events (Fig. 5a; Supplementary Data 2). The vast majority of enriched pathways are cancer related, such as ERK/MAP signalling and glutamine metabolism. The metabolism gene glutaminase (GLS) is of particular interest. It is well known that tumours are considerably

6 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications

NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE

Clinical only

Clinical + mRNA

Clinical + mRNA FC Clinical + APA

BRCA BRCA BRCA BRCA

1.0

P = 0.2

n=49

n=48

1.0

0.0

P = 0.013

n=49

n=48

1.0

0.0

P = 0.022

n=49

n=48

1.0

0.0

P = 2.9e05

n=49

n=48

0.8

Survival probability

0.6

0.4

0.2

0.0

0 1,000 2,000 3,000 4,000Days

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000Days

Days

LUSC LUSC

LUSC

KIRC KIRC KIRC

LUSC

1.0

Survival probability

P = 0.07

n=17

n=16

1.0

0.0

P = 0.018

n=17

n=16

1.0

0.0

P = 0.17

n=17

n=16

1.0

0.0

P = 0.00019

n=17

n=16

0.8

Survival probability

0.6

0.4

0.2

0.0

0 1,000 2,000 3,000 4,000Days

KIRC

1.0

P = 2.6e05

n=34

1.0

0.0

P = 7e05

n=34

1.0

0.0

P = 0.00012

n=34

1.0

0.0

0.8

P = 3.8e06

n=34

Survival probability

0.6

0.4

0.2

0.0

0 1,000 2,000 3,000Days

0 1,000 2,000 3,000 4,000Days

4,000

106

Clinical + APA

102

100

Clinical + mRNA

Clinical + mRNA fold change

BRCA LUSC KIRC

105

Additional predictive power

(P-value)

104

0.05

101

103

P-value

102

101

100

BRCA KIRC LUSC

NOTCH1

TBX3

BAP1

PHF6

ATRX

VHL

ERCC2

KRAS

RB1

SPOP

SMAD2

APC

MLL3

MLL2

HLA.A

PIK3R1

BRAF

STAG2

FLT3

MAP3K1

SETD2

TP53

NUMA1

KDM5C

SMAD4

CDK12

CDKN1B

CDKN1A

BRCA1

BRCA2

FBXW7

FGFR2

FGFR3

CTNNB1

EP300

ATM

CDKN2A

KDM6A

NRAS

NF1

CASP8

IDH1

EGFR

CDH1

KEAP1

NFE2L2

ELF3

PTEN

U2AF1

KIT

PIK3CA

TBL1XR1

PPP2R1A

CBFB

RUNX1

CHEK2

PBRM1

PDGFRA

GATA3

CCND1

CTCF

SF3B1

ARID1A

Figure 4 | Prognostic power of dynamic APA events. (ad) KaplanMeier survival plots for high- (red line) and low (blue line)-risk groups separated by clinical only (a), clinical with mRNA expression (b), clinical with tumour-vs-normal mRNA expressions fold-change (c) and clinical with dynamic APAs (d). P value was calculated using the log-rank test. (e) Additional prognostic power of APA, mRNA expression and mRNA tumour-vs-normal expression fold-change beyond clinical variables. The P value is calculated by the likelihood-ratio test. (f) No correlation between risk groups separated by APA-clinical models and mutation proles of signicantly mutated genes (SMGs). The dotted vertical line represents the P value (MannWhitneytest) cutoff of 0.05. All SMG P values are below this cutoff and thus are not signicant.

more dependent on the glycolytic pathway, regardless of oxygen availability, to supply a great deal of their energetic and biosynthetic demand for cell division. This phenomenon, termed the

Warburg effect, is a hallmark of cancer30. GLS is a key enzyme in glutaminolysis and its high expression is essential to support the cancer metabolic phenotype31. There are two major GLS isoforms

NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications 7

ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274

10 kb

Normal

Tumour

Normal

Tumour

Normal

Tumour

Log10(P-value)

hg19

0 1 2 3 4

Breast cancer regulation by stathmin1

LUAD

LUSC

KIRC

CCR3 signalling in eosinophils

Role of p14/p19ARF in tumor suppression

ERK/MAPK signalling

Protein ubiquitination pathway

Rac signalling

PAK signalling

Molecular mechanisms of cancer

PDGF signalling

Role of CHK in cell cycle checkpoint control

PI3K signalling in B lymphocytes

Myc-mediated apoptosis signalling

FAK signalling

Chemokine signalling

SAPK/JNK signalling

IGF1 signalling

Small-cell lung cancer signalling

IL3 signalling

GLS

GAC

RefSeq genes

EIF2 signalling

miRNA sites

KGA

Distal

HGF signalling

Proximal

Telomerase signalling

Signalling by Rho family GTPases

EGF signalling

Role of tissue factor in cancer

KIRC survival based on GAC percentage

PTEN signalling

Glutamine degradation I

1.0

Log-rank P 1.5e06

Low GAC ratio (248)

High GAC ratio (248)

Glycine biosynthesis I

Alanine biosynthesis II

Actin cytoskeleton signalling

0.8

LUAD

Normal Tumour

LUSC

Normal Tumour

KIRC

Normal Tumour

0.6

100%

80%

60%

40%

20%

1.3e09

3.7e18

2.1e06

GAC percentage

100%

80%

60%

40%

20%

GAC percentage

100%

80%

60%

40%

20%

Survival probability

0.4

0.2

0.0

0 20 40 60 80 100 120

Months

Figure 5 | Pathway analysis. (a) Signicantly enriched (P valueo0.05; Fishers exact test) Ingenuity canonical pathways in the 1,346 dynamic APA events. (b) GLS has a signicant 30-UTR shift from KGA long isoform in normal to GAC short isoform in tumour. (c) GAC percentages are signicantlyhigher in LUAD, LUSC and KIRC tumours. The P value in each box was calculated using a paired t-test. (d) KaplanMeier survival plot of two KIRC tumour groups stratied by the GAC ratios with equal patient number in each group. P value was calculated using the log-rank test.

termed distal Kidney-type (KGA) and proximal Glutaminase C (GAC), which have distinct 30 UTRs and slightly different C termini3234 (Fig. 5b). KGA has a number of miRNA-binding sites within its 30 UTR, whereas GAC surprisingly is not predicted to have any (Fig. 5b). Furthermore, it has been shown that miR-23 represses KGA in most cells. However, in myc-transformed cells, MYC overexpression de-represses GLS through down-regulation of miR-23, resulting in glutamine-dependent growth characteristics35. Interestingly, we found a strong alternative-splicing-coupled 30-UTR shift from KGA in normal to GAC in tumour, leading to a signicantly increased percentage of GAC in LUAD, LUSC and KIRC (Fig. 5b,c). This is consistent with previous report that GAC is key to the mitochondrial glutaminase metabolism of cancer cells31. The implication of the 30 UTR switch to GAC is that the expression of GLS is no longer regulated by miR-23 or MYC. Consistently, we did not observe any signicant expression changes of miR-23 between tumours and normal tissues, although MYC is upregulated in LUSC and KIRC tumours (Supplementary Fig. 8a), suggesting that GLS potentially utilizes 30-UTR switch, rather than MYC to escape miR-23-mediated repression.

To investigate the potential clinical utility of the APA-mediated GLS isoform switch, we examined the correlation between GAC percentage and clinical survival information for KIRC tumours, using the Cox proportional hazards model with age and gender as covariates. We found that higher GAC percentage is highly correlated with worse survival (P 3.2e 13, hazard ratio 50,

95% condence interval: 17141; Fig. 5d), which is consistent with previous studies indicating that GAC is essential for cancer cell growth36. Overall, patients with high GAC ratios in KIRC have a median survival of B55 months, compared with 492 months for those with low GAC ratios. We did not nd a statistically signicant correlation between GAC percentage and survival outcome in LUSC and LUAD possibly because the GAC percentages ((0.5, 0.97) and (0.59,0.98), respectively) (Supplementary Fig. 8b) have very limited dynamic range in these two tumour types, and thus may not have enough power to stratify patients. In contrast, GAC percentage ranges from 0.05 to0.96 in KIRC (Supplementary Fig. 8b), allowing patient stratication based on a full range of GAC levels. Together, the GLS APA regulation suggests a novel and potentially MYC-independent and miRNA-independent mechanism of

8 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications

NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE

glutaminase regulation in tumours, and GLS APA can be used to predict patient survival in KIRC.

Potential mechanisms for APA regulation during tumorigenesis. We sought to investigate the potential mechanisms governing APA dynamics in tumorigenesis. Although many details remain poorly understood, APA is thought to be regulated in cis through genetic aberrations37,38 of the underlying nascent mRNA (derived from DNA), and in trans by regulatory proteins in responding to dynamic environmental changes39. These cis-elements include canonical polyA signal AAUAAA and other auxiliary sequences such as U/GU-rich downstream elements40. The core polyadenylation trans-factors involve four multi-subunit protein complexes, CPSF (cleavage and polyadenylation specicity factor), CstF (cleavage stimulation factor), CFI and CFII (cleavage factors I and II). The chemical cleavage of premRNA process mainly employs CPSF to recognize the canonical polyA signal upstream of the cleavage site, and utilizes CstF to

bind downstream U/GU-rich elements40 mainly through the CstF64 subunit39.

To examine the role of genetic aberrations in the regulation of APA, we compared our 1,346 recurrent APA events with 64 Pancan12 Signicantly Mutated Genes (doi:10.7303/syn1750331). Surprisingly, there are only ve genes in common (Fig. 6a; P value0.48 by Fishers exact test). This result indicates that most of the dynamic APA events are probably not due to aberrations of underlying cis-elements but may be the result of aberrant expression of polyA trans-factors. To address this possibility, we investigated the gene expression of 22 important polyA trans-factors41 based on the TCGA RNA-seq data. The signicantly upand downregulated factors between tumours and matched normal tissues are indicated by yellow and blue, respectively (Fig. 6b). In general, we observed global upregulation of most polyA factors in ve tumour types (LUSC, LUAD, UCEC, BLCA and BRCA), which also have more 30-UTR-shortening events.

Therefore, we conclude that most core polyadenylation factors are expressed at higher levels in tumour types where proximal APAs

r 0.54; P-value:2.8e28

Dynamic APA genes

600

# 3 UTR shortening events per sample

100

# iClLIP reads per Kb in 3 UTR

P-value 3.3e21

Pancan12-mutated genes

500

LUSC UCEC BLCA LUAD BRCA HNSC KIRC

59 5

1,341

400

300

200

100

0.5 0.5

Log2(tumour/normal)

1.0

CPSF2

genes Other genes

2 0 2 4 6

3 UTR shortening

CPSF1

CPSF

CstF

CF Im

CF IIm

Others

homologues

CPSF3

CPSF4

CstF64 gene expression change log2(tumour/normal)

FIP1L1

CSTF3

CstF64

CSTF1

CPSF6

P-value 3.9e19

CPSF7

NUDT21

CstF64

PCF11

Genes preferring distal APAin Hela CstF64 and CstF64T KD

Genes preferring proximal APA in tumours

SYMPK

PAPOLG

PABPC1

PABPC4

CstF64 46

125

1,069

PABPN1

CSTF2T

WDR33

RBBP6

PPP1CA

KIRC

HNSC

BRCA

LUAD

BLCA

UCEC

LUSC

PPP1CB

Figure 6 | Potential mechanisms for APA regulation during tumorigenesis. (a) Only ve genes are in common between genes undergoing dynamic APA and genes signicantly mutated in Pancan12 tumour types. (b) Heatmap of gene expression fold-change of known polyadenylation factors. Each rectangle represents the mean log2 fold-change between tumour and matched normal tissues of one factor in one tumour type. A factor is considered differentially expressed if the false-discovery rate from edgeR55 is o0.05 and the mean absolute fold-change is 41.5. Yellow and blue boxes indicate the signicantly upregulated and downregulated genes, respectively. White boxes are non-signicant genes. (c) Correlation between CstF64 expression fold-change and number of 30-UTR-shortening events per sample. Each point represents a patient sample across seven tumour types. The x axis is the

CstF64 log2 fold-change between tumours and matched normal tissues. The y axis is the number of shortening events per sample. Spearmans correlation coefcient (0.54) and P value (2.8e 28) are indicated on the top. (d) Venn diagram comparison between genes preferring proximal APAs in tumour

with higher expression of CstF64 and genes preferring distal APA in Hela cells with knockdown of CstF64 and CstF64T. (e) Genes with 30-UTR shortening in tumours have more CstF64 iCLIP data derived from HeLa cells than background (P value 3.3e 21 by t-test).

NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications 9

ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274

are favoured. Our results are consistent with previous studies showing that 30-UTR shortening in proliferating cells is also accompanied by an increased expression of polyadenylation factors9,12,27.

We further investigated the correlation between gene expression and 30-UTR shortening for four polyadenylation factors (CPSF1, CPSF3, CstF64 and PABPC1), which are differentially expressed between tumour and normal in at least three cancer types (Fig. 6b). Among them, CstF64 has the greatest correlation between gene expression fold-change and the number of shortening events per patient in tumours (Spearmans correlation 0.54 with P value 2.8e 28, Fig. 6c), followed by CPSF3. In contrast,

CPSF1 and PABPC1 have weak correlations (Supplementary Fig. 9). This result is consistent with a recent iClip-seq study, suggesting that CstF64 is one of the top three most important factors for polyA site selection42. Also, a recent study indicated that CPSF plays an important role in recruiting CstF64 to RNAs43. Furthermore, a recent global study in HeLa cells suggests that CstF64 induces the usage of proximal APAs43. They reported 171 genes with lengthening in 30 UTRs upon knockdown of

CstF64, among which 46 genes from our analysis have shortened 30 UTRs in tumours where CstF64 is upregulated (Fig. 6d;

P value 3.9e 19 using Fishers exact test; Supplementary

Fig. 10). This signicant overlap indicates that a subset of 30-UTR-shortening events we observed in tumours can indeed be explained by the expression level of CstF64. Finally, using CstF64 iCLIP-seq in HeLa cells43, we showed that those 1,346 genes have more CstF64 bindings in their 30 UTRs than other genes (Fig. 6e).

Together, our study provides strong evidence that key polyA trans-factors, such as CstF64, are upregulated in tumorigenesis, leading to preferential 30-UTR shortening in tumours.

DiscussionWe have developed a novel bioinformatics algorithm, termed DaPars, dedicated to the de novo identication and quantication of dynamic APA events using standard RNA-seq. The accuracy of DaPars is evidenced by the fact that our de novo predicted APAs are enriched for the canonical polyA signal AATAAA and have a high degree of overlap with annotated polyA sites (Fig. 2b,c). Our extensive DaPars analysis of TCGA data sets convincingly demonstrate that any investigator(s) conducting standard RNASeq is now capable of identifying the majority of functionally important APA events in most biological systems. DaPars is not just yet another APA assay; instead, its key methodology innovation is the inference of de novo APA events from existing RNA-seq data without relying on any additional wet-bench experiments. For example, our current APA analysis was based on RNA-seq of 358 tumour/normal pairs across 7 cancer types. An analysis of this scale would be prohibitively cumbersome using any previous method, such as microarrays, EST and PolyA-seq, but was made possible now with our DaPars method.

While our paper was under review, Wang et al.44 reported a change-point model to detect 30-UTR switching using RNA-seq.

The model by Wang et al. relies on the annotated distal polyA sites to infer the proximal ones, only supports genes with two polyA sites and only supports pair-wise comparison. In contrast, our DaPars method is fully de novo, can handle multiple (42)

polyA sites and multiple (42) samples and thus is much more powerful and exible than the model by Wang et al.. Most of our analyses based on hundreds of TCGA patient samples would not be possible using the model by Wang et al.

It has been reported that shorter 30 UTRs are preferentially used by several oncogenes in cancer cell lines8, but what was not clear from this work is how pervasive and recurrent APA is in clinical samples. Lin et al.45 reported 126 30-UTR-shortening

genes in 5 tumour/normal pairs but unfortunately did not provide a supplementary table for those genes. To directly compare our results with Lin et al.45, we repeated the same analysis as described in their paper and detected a total of 120 genes with 30-UTR shortening and upregulation of the short isoform. Among them, 53% were also found in our 1,201 shortening APA genes (Supplementary Fig. 11; P value 2e 43 by Fishers exact test),

including POLR2K, the main APA gene reported by Lin et al. Two examples of consistence and inconsistence between TCGA RNA-seq and PolyA-seq from Lin et al.45 are shown in Supplementary Figs 12 and 13, respectively. In this study, we have substantially increased the number of dynamic APA events based on 358 tumour/normal pairs. To our knowledge, this is the largest global APA analysis to date, leading to a 71-fold increase in sample size compared with Lin et al.45

Several novel and signicant biological and clinical insights are noticeable from our large-scale APA analysis. First, dynamic APA events are highly tumour type and patient specic. We observe that lung, uterus, breast and bladder cancers have signicantly more APAs than head/neck and kidney cancers. Moreover, similar to other caner genomic data, there is considerable APA heterogeneity among patients within the same tumour type. Second, our saturation analysis indicates that APA events derived from 358 samples across 7 tumour types remain far from complete, highlighting the need for de novo discovery of APA, and the need for expanding DaPars analysis to more tumour types and samples when they become available. Third, selected APA events provide a surprisingly strong additional prognostic power beyond common clinical covariates and conventional molecular data, such as somatic mutation and gene expression. A recent study46 also indicated that conventional molecular data had poor prognostic power beyond clinical data. Although the exact cause is unknown, we speculate that it may reect a role for APA as a driver of tumour progression. Fourth, our study reveals a novel link between altered 30-UTR usage and cancer metabolism. We observed that the GAC isoform of the glutaminase gene (GLS), which lacks any predicted miRNA binding, is predominantly expressed in LUSC, LUAD and KIRC tumours. Therefore, this APA event would abrogate the need to attenuate miR-23 expression through MYC upregulation and result in increased Glutaminase expression and altered glutamine metabolism. Fifth, our observation of correlating CstF64 levels with increased 30-UTR shortening suggests that this factor is a potential master activator of proximal APA usage in tumor-igenesis. This hypothesis predicts that tumour cells increase CstF64 levels to promote the 30-end processing at the proximal and weaker polyA sites thereby preventing the usage of the distal polyA sites39,43. Finally, APA is likely to be regulated by many factors in a tissue-specic manner. For example, we recently reported CFIm25 (ref. 47) as a global repressor of proximal APA in brain tumour. CFIm25 has opposite function of CstF64, since its decreased expression correlates with increased 30-UTR shortening. However, CFIm25 is not a master APA regulator in the cancer types we studied here, because it is not differentially expressed between tumour and normal (NUDT21 in Fig. 6b).

Our DaPars analysis of RNA-seq reveals a comprehensive list of previously unobserved, highly recurrent and functionally important 30-UTR somatic RNA aberrations. These RNA aberrations represent an illustrative case of genomic dark matter beyond coding regions, and thus may also provide new directions for tumour gene discovery48. Although there is a lack in observed genetic aberrations within 30 UTRs of most genes undergoing

APA, caution should be taken as current TCGA mutation analyses utilize primarily exome sequencing, which excludes 30

UTR. We will revisit this issue in the future when more whole-genome sequencing data are available. Finally, although focused

10 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications

NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE

towards the internal exons, so that splicing-coupled APA events can also be detected.

Non-uniformity correction. It has been reported that RNA-seq reads are not uniformly distributed along the gene body. DaPars provides an option to address the issue of non-uniformity by statistical modelling50. Since it is technically difcult to distinguish non-uniform distribution from dynamic APA, we decide to train our statistical model based on a subset of genes with no APA change, that is, with only one 30 UTR. We rst run DaPars to select those genes with no APA change and divide their RNA-seq gene body coverage into 100 bins, yielding an observed gene body-sequencing prole (Supplementary Fig. 1e). In the conventional DaPars, the elements of IL and IP in equation (1) are unweighted and all 1s on 30-UTR regions.

We will infer the weighted IL and IP based on the observed gene body-sequencing prole, then re-run DaPars with the weighted IL and IP to correct the nonuniformity in RNA-seq (Supplementary Fig. 1e).

Differential percentage of distal APA usage index. We used the following three criteria to detect the most signicant APA events:rst, given long 30-UTR expression level wiL and short 30-UTR expression level wiS estimated from (equation (1)), we used Fishers exact test to determine the P value of PDUI difference between tumour and matched normal tissue of the same patient, which is further adjusted by the BenjaminiHochberg procedure to control the false-discovery rate (FDR) at 5%. Second, the absolute mean difference of PDUIs of all the patients in the same tumour type must be no less than 0.2. Third, the mean fold-change of PDUIs of all the patients in the same tumour type must be no less than 1.5.

FDR 0:05

DPDUI

j j PDUI

tumor

on cancer genomics in this study, our novel DaPars framework will open the door for APA analysis in numerous biological and pathologic systems. It also underscores the power of innovative bioinformatics analyses that can derive novel biological insights from existing sequence data.

Methods

Data sets. All the RNA-seq BAM les were downloaded from the UCSC Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu/

Web End =https://cghub.ucsc.edu/ ). Here we only processed BRCA, BLCA, LUSC, LUAD, head and neck squamous cell carcinoma, UCEC and KIRC cancers that have 410 tumournormal pairs (Supplementary Table 1). Gene expression and miRNA expression were downloaded from The Cancer Genome Atlas data portal (https://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm

Web End =https://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm). MAQC brain and UHR RNA-seq reads were obtained from Sequence Read Archive with accessions ERP000016 and ERP000400, respectively. For MAQC PolyA-seq, the ltered polyA sites with normalized read counts were downloaded from the UCSC browser3.

DaPars algorithm. DaPars performs de novo identication and quantication of dynamic APA events between two conditions, regardless of any prior APA annotation. DaPars identies a distal polyA site based on RNA-seq data, uses a regression model to infer the exact location of the proximal APA site after correcting the potential RNA-seq non-uniformity bias along gene body, detects statistically signicant dynamic APAs and has the potential to detect 42 dynamic

APA events.

Distal polyA site identication from RNA-seq. Given two or more RNA-seq samples, distal polyA site refers to the end point of the longest 30 UTR among all the samples, which will be used in the next step to identify the proximal polyA within this longest 30-UTR region. To identify possible distal polyA site that may locate outside of gene annotation, we extend the annotated gene 30 end by up to10 kb before reaching a neighbouring gene. RNA-seq data from all input samples will be merged to have a combined coverage along the extended gene model. To address possible uneven and discontinuous issues, we applied a 50-bp window to smooth this combined coverage. We then scan the extended 30 UTR from 50 to 30 to nd the distal polyA site whose coverage is signicantly lower (that is,opredened cutoff at 5%) than the coverage at the start of the preceding exon. A similar strategy has been successfully used to detect lengthening of 30 UTRs in the mammalian brain21. The de novo distal APA estimated directly from RNA-seq, which may not be included in gene model, will benet the downstream proximal APA identication (Supplementary Fig. 1a).

Since most current RNA-seq data sets are not strand-specic, potential overlapping of 30 UTRs from two neighbouring tail-to-tail genes from different strands may give false-positive distal polyA. So after previous distal APA analysis, if 30 UTRs of two neighbouring genes overlap, we will gradually increase the cutoffs until the two 30 UTRs are separated. In this way, we can recover the proper distal polyA, which may be overlooked by other methods such as Cufinks (Supplementary Fig. 1b). The distal polyA site identication method implemented in DaPars has very good performance. For all the predicted distal polyA sites from TCGA RNA-seq, on average 81% are within 50 bp of the annotated polyA sites.

Regression model in DaPars. For each RefSeq transcript with a distal APA estimated from previous step, we use a regression model to infer the exact location of a de novo proximal polyA site at single-nucleotide resolution, by minimizing the deviation between the observed read density and the expected read density based on the two-polyA-site model, in both tumour and matched normal samples simultaneously. This regression model solves the following optimization problem:

w1 L; w2 L; w1 S; w2 S; P arg min

w ;w ;w ;w 0;1oPoL

8 >

: 3

To avoid false-positive estimation on lowly expressed genes, we only included genes with 430-fold mean coverage (reads per base gene model).

More than 2 dynamic APAs. Our DaPars framework can be easily extended to address 42 dynamic APAs. We formulated the multiple APA analysis in the following matrix format,

c11 c21

c12 c22

... ...c1m 1 c2m 1 c1m c2m

666664

PDUI PDUI

PDUInormal

j j 0:2

log2

0:59

777775

66664

... ...0 0 1 1

0 0 0 1

1 1 1 1

0 1 1 1

... ...

77775m m

w11 w21

w12 w22

... ...w1m 1 w2m 1 w1m w2m

666664

777775m 2

where m is the length of the longest 30 UTR of a transcript. wij is the expression level of one possible 30 UTR j on sample i. The number of non-zero wij determines how many polyA sites will be derived from RNA-seq. In most cases, there are only a few wij will be non-zero. So we can solve this equation using a positive Lasso optimization method as reformulated in the following form:

arg min

2 C MW

k k22 l W

2 Ci wiLIL wiSIP jj 22 1

where wiL and wiS are the abundances of transcripts with distal and proximal polyA sites for sample i, respectively, Ci [Ci1, ,Cij, ,CiL]T is the read coverage of

sample i at single-nucleotide resolution normalized by total sequencing depth, L is the length of the longest 30 UTR from previous step, P is the length of alternative proximal 30 UTR to be estimated, IL and IP are indicator functions such that

IL 1; ; 1 |{z}

k k1 5

where C, M and W are corresponding to the left, middle and right matrix in equation (4), respectively. In practice, we only consider no more than 4 APAsin a real data set to reduce the complexity of model selection and avoid over-tting issues. In Supplementary Fig. 1f, we showed that our DaPars can also identify 42 APAs from RNA-seq and the predictions are highly consistent with the annotation. Although many genes have 42 annotated APAs, the majorityof dynamic APAs only involve 2 polyA sites1. Therefore in the current large-scale TCGA RNA-seq analysis, we only focus on 2 APAs in the dynamic APA detection.

PolyA-seq processing. We downloaded the processed polyA sites with normalized read counts of MAQC brain and UHR PolyA-seq data sets (two replicates for each tissue) from the UCSC Genome Browser3. We calculated the signal intensity of a given polyA site based on all the same-strand PolyA-seq reads within 50 bases of the polyA site. We then used Fishers exact test to detect the statistically signicant differential APAs between brain and UHR with a BenjaminiHochberg-adjusted FDR cutoff of 0.1 and read count difference of 410%. For a fair comparison, we also used FDR of 0.1 and 10% DPDUI for DaPars analysis of MAQC RNA-seq data derived from the same brain and UHR samples.

Survival analysis using Cox proportional hazards model. A standard Cox proportional hazards model29 implemented in the R package survival was used for patient survival and KaplanMeier plotting. Hazard ratios exceeding 1 indicate poor prognosis for patients possessing shorter 30 UTR, whereas those below 1 are associated with better outcome. The high-risk group and low-risk group were generated based on the prognostic index (PI). The PI is the linear component of the Cox model, PI Pmi1 bixi where xi is the value of covariate i and its risk

coefcient, bi was estimated from the Cox tting. The high-risk and low-risk groups were generated for survival plot by splitting the ordered PI (higher values for higher risk) with equal number of samples in each group.

For each given 1oPoL, the expression levels of two transcripts with distal and proximal polyA sites in both tumour and normal tissues can be estimated by optimizing this linear regression model using quadratic programming49. The optimal de novo proximal polyA site P* is the one with the minimal objective function value, as demonstrated by the vertical arrow in Fig. 1a. To quantify the relative polyA site usage, we dene the PDUI for sample i as the following:

PDUI

wi L

wi L wi S

and IP 1; ; 1 |{z}

; 0; ; 0

|{z}

L P

where wi L and wi S are the estimated expression levels of transcripts with distal and proximal polyA sites for sample i. The greater the PDUI is, the more distal polyA site of a transcript is used and vice versa. Finally, the regression model is extended

NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications 11

ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274

Survival analysis using Cox model and LASSO feature selection. We combined tumour-vs-normal shortening/lengthening events of APA genes (DPDUI values) with clinical covariates, such as age, gender, stage and smoking status (lung cancer), in survival analysis. We used a Cox regression model with LASSO feature selection to determine the contributions of APAs in survival prediction using the R package glmnet51. We chose the optimal APA genes based on the leave-one-out CV. Here the clinical covariates are not penalized and always selected. Finally, we used a LRT to estimate the additional prediction power of the new APA-clinical models over the clinical-only models.

Software availability. The open source DaPars program is freely available at https://code.google.com/p/dapars/

Web End =https://code.google.com/p/dapars/ . We will update this website periodically with new versions.

References

1. Elkon, R., Ugalde, A. P. & Agami, R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat. Rev. Genet. 14, 496506 (2013).

2. Sandberg, R., Neilson, J. R., Sarma, A., Sharp, P. A. & Burge, C. B. Proliferating cells express mRNAs with shortened 30 untranslated regions and fewer microRNA target sites. Science 320, 16431647 (2008).

3. Derti, A. et al. A quantitative atlas of polyadenylation in ve mammals. Genome Res. 22, 11731183 (2012).

4. Tian, B., Hu, J., Zhang, H. & Lutz, C. S. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 33, 201212 (2005).

5. Barreau, C., Paillard, L. & Osborne, H. B. AU-rich elements and associated factors: are there unifying principles? Nucleic Acids Res. 33, 71387150 (2005).

6. Jacobsen, A. et al. Analysis of microRNA-target interactions across diverse cancer types. Nat. Struct. Mol. Biol. 20, 13251332 (2013).

7. Takagaki, Y. & Manley, J. L. Levels of polyadenylation factor CstF-64 control IgM heavy chain mRNA accumulation and other events associated with B cell differentiation. Mol. Cell 2, 761771 (1998).

8. Mayr, C. & Bartel, D. P. Widespread shortening of 30 UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673684 (2009).

9. Elkon, R. et al. E2F mediates enhanced alternative polyadenylation in proliferation. Genome. Biol. 13, R59 (2012).

10. Singh, P. et al. Global changes in processing of mRNA 30 untranslated regions characterize clinically distinct cancer subtypes. Cancer Res. 69, 94229430 (2009).

11. Morris, A. R. et al. Alternative cleavage and polyadenylation during colorectal cancer development. Clin. Cancer Res. 18, 52565266 (2012).

12. Ji, Z., Lee, J. Y., Pan, Z., Jiang, B. & Tian, B. Progressive lengthening of30 untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proc. Natl Acad. Sci. USA 106, 70287033 (2009).

13. Hoque, M. et al. Analysis of alternative cleavage and polyadenylation by 30 region extraction and deep sequencing. Nat. Methods 10, 133139 (2013).

14. Sun, Y., Fu, Y., Li, Y. & Xu, A. Genome-wide alternative polyadenylation in animals: insights from high-throughput technologies. J. Mol. Cell Biol. 4, 352361 (2012).

15. Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768772 (2010).

16. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 10091015 (2010).

17. Trapnell, C. et al. Transcript assembly and quantication by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515 (2010).

18. Griebel, T. et al. Modelling and simulating generic RNA-Seq experiments with the ux simulator. Nucleic Acids Res. 40, 1007310083 (2012).

19. Bullard, J. H., Purdom, E., Hansen, K. D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94 (2010).

20. Consortium, M. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 11511161 (2006).

21. Miura, P., Shenker, S., Andreu-Agullo, C., Westholm, J. O. & Lai, E. C. Widespread and extensive lengthening of 30 UTRs in the mammalian brain.

Genome Res. 23, 812825 (2013).22. Ulitsky, I. et al. Extensive alternative polyadenylation during zebrash development. Genome Res. 22, 20542066 (2012).

23. Zhang, H., Hu, J., Recce, M. & Tian, B. PolyA_DB: a database for mammalian mRNA polyadenylation. Nucleic Acids Res. 33, D116D120 (2005).

24. Bailey, T. L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202W208 (2009).

25. Beaudoing, E., Freier, S., Wyatt, J. R., Claverie, J. M. & Gautheret, D. Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10, 10011010 (2000).

26. Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 16531659 (2011).

27. Shepard, P. J. et al. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA 17, 761772 (2011).

28. Lawrence, M. S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495501 (2014).

29. Andersen, P. K. & Gill, R. D. Coxs regression model for counting processes: a large sample study. Ann. Statist. 11001120 (1982).

30. Hsu, P. P. & Sabatini, D. M. Cancer cell metabolism: Warburg and beyond. Cell 134, 703707 (2008).

31. Cassago, A. et al. Mitochondrial localization and structure-based phosphate activation mechanism of Glutaminase C with implications for cancer metabolism. Proc. Natl Acad. Sci. USA 109, 10921097 (2012).

32. Aledo, J. C., Gomez-Fabre, P. M., Olalla, L. & Marquez, J. Identication of two human glutaminase loci and tissue-specic expression of the two related genes. Mamm. Genome 11, 11071110 (2000).

33. de la Rosa, V. et al. A novel glutaminase isoform in mammalian tissues. Neurochem. Int. 55, 7684 (2009).

34. Elgadi, K. M., Meguid, R. A., Qian, M., Souba, W. W. & Abcouwer, S. F. Cloning and analysis of unique human glutaminase isoforms generated by tissue-specic alternative splicing. Physiol. Genomics 1, 5162 (1999).

35. Gao, P. et al. c-Myc suppression of miR-23a/b enhances mitochondrial glutaminase expression and glutamine metabolism. Nature 458, 762765 (2009).

36. van den Heuvel, A. P., Jing, J., Wooster, R. F. & Bachman, K. E. Analysis of glutamine dependency in non-small cell lung cancer: GLS1 splice variant GAC is essential for cancer cell growth. Cancer Biol. Ther. 13, 11851194 (2012).

37. Stacey, S. N. et al. A germline variant in the TP53 polyadenylation signal confers cancer susceptibility. Nat. Genet. 43, 10981103 (2011).

38. Wiestner, A. et al. Point mutations and genomic deletions in CCND1 create stable truncated cyclin D1 mRNAs that are associated with increased proliferation rate and shorter survival. Blood 109, 45994606 (2007).

39. Di Giammartino, D. C., Nishida, K. & Manley, J. L. Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853866 (2011).

40. Shi, Y. Alternative polyadenylation: new insights from global analyses. RNA 18, 21052117 (2012).

41. Shi, Y. et al. Molecular architecture of the human pre-mRNA 30 processing complex. Mol. Cell 33, 365376 (2009).

42. Martin, G., Gruber, A. R., Keller, W. & Zavolan, M. Genome-wide analysis of pre-mRNA 30 end processing reveals a decisive role of human cleavage factor I in the regulation of 30 UTR length. Cell Rep. 1, 753763 (2012).

43. Yao, C. et al. Transcriptome-wide analyses of CstF64-RNA interactions in global regulation of mRNA alternative polyadenylation. Proc. Natl Acad. Sci. USA 109, 1877318778 (2012).

44. Wang, W., Wei, Z. & Li, H. A change-point model for identifying 30 UTR switching by next-generation RNA sequencing. Bioinformatics 30, 21622170 (2014).

45. Lin, Y. et al. An in-depth map of polyadenylation sites in cancer. Nucleic Acids Res. 40, 84608471 (2012).

46. Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratication of tumor mutations. Nat. Methods 10, 11081115 (2013).

47. Masamha, C. P. et al. CFIm25 links alternative polyadenylation to glioblastoma tumour suppression. Nature 509, 412416 (2014).

48. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 15461558 (2013).49. Bohnert, R. & Ratsch, G. rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Res. 38, W348W351 (2010).

50. Li, J., Jiang, H. & Wong, W. H. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 11, R50 (2010).

51. Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 122 (2010).

52. Grimson, A. et al. MicroRNA targeting specicity in mammals: determinants beyond seed pairing. Mol. Cell 27, 91105 (2007).

53. Lewis, B. P., Burge, C. B. & Bartel, D. P. Conserved seed pairing, often anked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 1520 (2005).

54. John, B. et al. Human MicroRNA targets. PLoS Biol. 2, e363 (2004).55. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139140 (2010).

Acknowledgements

We would like to thank Benjamin Rodriguez for critical reading of this manuscript and members of the Li laboratory for helpful discussions. This work was partially supported by NIH R01HG007538, CPRIT RP110471 and DOD W81XWH-10-1-0501 (to W.L.), NIH grants CA167752 and CA166274 (to E.J.W.). We gratefully acknowledge the contributions from TCGA Research Network and its TCGA Pan-Cancer Analysis Working Group.

12 NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications

NATURE COMMUNICATIONS | DOI: 10.1038/ncomms6274 ARTICLE

Author contributions

Z.X. and W.L. conceived the project, performed the analysis and wrote the manuscript. Z.X. developed the DaPars algorithms. Z.X., L.A.D., T.A.C., J.R.N., D.A.W., E.J.W. and W.L. interpreted the results and edited the manuscript.

Additional information

Supplementary Information accompanies this paper at http://www.nature.com/naturecommunications

Web End =http://www.nature.com/ http://www.nature.com/naturecommunications

Web End =naturecommunications

Competing nancial interests: The authors declare no competing nancial interests.

Reprints and permission information is available online at http://npg.nature.com/reprintsandpermissions/

Web End =http://npg.nature.com/ http://npg.nature.com/reprintsandpermissions/

Web End =reprintsandpermissions/

How to cite this article: Xia, Z. et al. Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 30-UTR landscape across seven tumour types. Nat. Commun.

5:5274 doi: 10.1038/ncomms6274 (2014).

NATURE COMMUNICATIONS | 5:5274 | DOI: 10.1038/ncomms6274 | http://www.nature.com/naturecommunications

Web End =www.nature.com/naturecommunications 13

Word count: 10972

Show less

Abstract

Translate

Alternative polyadenylation (APA) is a pervasive mechanism in the regulation of most human genes, and its implication in diseases including cancer is only beginning to be appreciated. Since conventional APA profiling has not been widely adopted, global cancer APA studies are very limited. Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identification of dynamic APAs from standard RNA-seq. When applied to 358 TCGA Pan-Cancer tumour/normal pairs across seven tumour types, DaPars reveals 1,346 genes with recurrent and tumour-specific APAs. Most APA genes (91%) have shorter 3'-untranslated regions (3' UTRs) in tumours that can avoid microRNA-mediated repression, including glutaminase (GLS), a key metabolic enzyme for tumour proliferation. Interestingly, selected APA events add strong prognostic power beyond common clinical and molecular variables, suggesting their potential as novel prognostic biomarkers. Finally, our results implicate CstF64, an essential polyadenylation factor, as a master regulator of 3'-UTR shortening across multiple tumour types.

Details

Title

Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3'-UTR landscape across seven tumour types

Author

Xia, Zheng; Donehower, Lawrence A; Cooper, Thomas A; Neilson, Joel R; Wheeler, David A; Wagner, Eric J; Li, Wei

Pages

5274

Publication year

2014

Publication date

Nov 2014

Publisher

Nature Publishing Group

e-ISSN

20411723

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/ncomms6274

ProQuest document ID

1626171535

Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3'-UTR landscape across seven tumour types

Jump to:

Full Text

Abstract

Details

Suggested sources