INTRODUCTION
Transcriptomic analyses have been central to insights into the biology and pathogenesis of eukaryotic pathogens. The best-characterized eukaryotic pathogen transcriptomes are those of the phylum
Early transcriptomic experiments sought to utilize techniques such as microarrays and Sanger sequencing of complementary DNA (cDNA) or expressed sequence tag (EST) libraries to understand changes in gene expression that define the pathogenesis of the parasites. These studies reveal that the timing of appearance and abundance of individual mRNAs follow developmentally distinct patterns (4), even for many predicted housekeeping genes. For example, the expression of the actin gene family in
RNA splicing provides one such source of co- and posttranscriptional regulation. In this process, introns are removed from the pre-mRNA and the exons retained to form one contiguous molecule that is then translated by the ribosome. However, for complex mRNAs, alternative splicing either of untranslated regions or the exonic chain can add additional complexity. Through this process, pre-mRNA species can be differentially spliced to create multiple distinct mature mRNAs from a single gene. This can alter regulation of the gene, e.g., by removing small-RNA binding sites (13), or diversify the proteome, as individual genes may encode multiple protein isoforms with altered structure or function (14). Indeed, proteomic analyses have revealed widespread protein isoforms arising from single genes, corresponding with various activity, stability, localization, and posttranslational modifications (15, 16). With advances in genome and transcript sequencing, it has become apparent over the last decade that alternative splicing of pre-mRNA occurs to a great extent. For example, more than 95% of human genes are alternatively spliced, and many transcript isoforms are specific to tissues or cellular states (17). Such observations suggest that RNA diversity is more complex than previously appreciated (18).
Although alternative splicing appears to play a major (though debated) role in posttranscriptional control in metazoans, the process is less understood in apicomplexans. Studies have identified apicomplexan genes with crucial alternative splicing outcomes (19). For example, alternative splicing is required for attaching a protein trafficking presequence onto two adjacent gene coding sequences (20), and normal multiorganellar targeting of the
The lack of data for apicomplexan gene isoforms is a major obstacle to dissecting the complexity of transcript outcomes. Traditionally, transcriptomic studies employing RNA-seq have relied on short-read technologies such as Illumina, 454, and Ion-Torrent (25). Despite the power of very high sequencing depth and low error rates, the short reads present a limitation in that simultaneously occurring alternative-splicing events within individual transcripts cannot be unambiguously detected or linked. Previously developed computational methods for full-length transcript assembly from short read sequencing data are often computationally intensive and can produce ambiguous or conflicting results between different algorithms (26). In addition, sequencing on cDNA strands amplified by PCR has a propensity to introduce biases in relative transcript abundances and rare isoform identification (27). Hence, it is difficult to draw functional relationships between simultaneous alternative splicing events and observable phenotypes. In apicomplexan parasites, simultaneously occurring alternative splicing events within a specific transcript isoform do occur (28). However, the studies that unearthed these transcript isoforms relied on cDNA probes and reverse transcription-PCR, and the wider extent of this phenomenon is unknown.
Recently developed third generation sequencing platforms, such as those developed by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), are capable of producing significantly longer reads at the single-molecule level. These technologies have been used in various applications such as resolving genomic and transcriptional landscapes (29, 30), single-cell transcriptome sequencing (31), and DNA or RNA methylation pattern profiling (32–34). PacBio has recently been used to generate an amplification-free transcriptome from
In this study, we evaluate the ability ONT direct RNA sequencing to characterize the alternative splicing landscape of two parasitic apicomplexans,
(This article was submitted to an online preprint archive [40].)
RESULTS
Direct RNA sequencing of
We generated ONT sequencing reads of poly(A)-selected RNA from asynchronous
FIG 1
Summary of the ONT direct RNA sequencing data from
To evaluate the capability of ONT sequencing to generate full-length transcript reads, we remapped the reads to the parasites’ annotated transcriptome and calculated the fraction of full-length transcripts per gene. We define full-length transcripts as reads that cover more than 95% of the predicted canonical transcript based on the annotation file and only considered genes with at least three mapped reads. As illustrated in Fig. 1C, many transcripts were observed to have full-length reads, particularly for the
ONT sequencing is comparable with traditional RNA-seq for quantifying gene expression levels.
To investigate the utility of ONT data to measure transcript abundance, we computed read count correlations between our ONT data set and reanalyzed published Illumina-based RNA-seq data sets on comparable parasite samples. In theory, ONT RNA sequencing reads directly correspond to complete transcripts, so quantifying the expression of genes can be done by simple counting of the assigned reads. This is dissimilar to traditional short read RNA-Seq, which necessitates a further normalization step (e.g., reads or fragments per kilobase of transcript or transcripts per million) to account for the higher number of reads that would be generated from longer transcripts. For
FIG 2
Comparisons between ONT direct RNA sequencing and Illumina data sets. (A) Correlation between transcriptome mapped read counts for
For both parasites, a higher number of gene transcripts were detected in the Illumina data sets than in the ONT data (see Table S1). We detected at least 1 Illumina read for 8,379 genes in
Intron retained transcripts are prevalent and generally nonproductive.
A major goal of mature full-length transcript sequencing is the identification of splicing isoforms. Alternative splicing can be broadly classed into four types: intron retention, alternative 3′ splice site selection, alternative 5′ splice site selection, and exon skipping (46). Of these, intron retention is the least-studied form of alternative splicing despite the numerous studies implicating the significance of the event (47–49). This is in part due to the limitations of short-read sequencing but also the relatively long and low-complexity introns in metazoan genomes, which impose limitations on sequencing and assembly. For example, the intronic sequence in the human genome is several magnitudes longer than the length of the exonic sequence (50, 51). In contrast, the compact genomes of
To monitor levels of intron retention, we identified junctions and reads that overlapped annotated intronic regions of a gene based on the annotated coordinates using FeatureCounts (54), and we tallied the proportion of reads mapping to that intron to the total reads for the same gene. Proportion scores are represented using the metric percent intron retention (PIR). Based on this analysis, 17.65% of the mapped reads were considered to have intron overlaps for
FIG 3
Analysis of intron retention using ONT direct RNA sequencing reads. Levels of intron retention are represented as the percent intron retention (PIR). (A) Distribution of intron retention levels over each gene, represented as total counts. (B) Distribution of intron retention levels over each junction, represented as the proportion of each bin count over the sum all intronic junctions. (C) Table of transcript productivity based on intron retention events as analyzed using the FLAIR pipeline. Numbers in bracket are the transcript counts (D) Relationship between levels of intron retention and gene expression. Genes were classified into 10 bins of equal number based on expression. The median expression of the genes in each bin is used to represent expression for each bin. Error bars represent the 95% confidence intervals. (E) Comparisons of intron retention quantification between ONT and Illumina data sets.
When we only consider full-length or near-full-length reads (as previously defined), the absolute quantification of intron retained genes is expectedly reduced, but the proportions of intron retained genes to expressed genes do not differ significantly (
Unusually, there are a considerable number of genes where none of the transcripts appears to have all of their annotated introns removed (
The most extreme cases of conflict between the junctions we detect and canonical gene models often highlight potential annotation errors, but there are still a strikingly large number of genes where genuine introns are retained in a high (>50%) proportion of transcripts (
By taking advantage of the full-length reads made possible by ONT, we are able to predict the protein-coding productivity of the alternate transcripts. We performed productivity analysis on full-length intron-retained reads using the FLAIR (57) pipeline, which corrects and defines unproductive transcripts as transcripts with an annotated start codon and a termination codon that is 55 nucleotides or more upstream of the 3′-most splice junction. The rationale for this definition is based on previous evidence suggesting that only premature terminating transcripts following that 55-nucleotide rule mediate an effect on mRNA turnover (58). This is a conservative estimate of productivity since it does not consider intron retention within the 3′-most splice junction. The Flair method identified >70% of the intron-retained reads to be nonproductive for either parasite (Fig. 3C), suggesting that the high level of observed intron retention only rarely corresponds to alternative protein products, and that most intron-retaining transcripts may instead be targets for nonsense mediated decay. Intron retention is known to fine-tune protein expression through this pathway in mammalian systems (59). A related prediction from other studies (47) is that the most highly expressed transcripts should have low levels of intron retention. In our analysis, we do observe a negative relationship between intron retention and gene expression levels, although it is less pronounced for
To validate the identification of intron retention events, we looked at whether retained introns apparent in the ONT data were directly supported by Illumina RNA-seq data. We normalized read counts by junction length and only considered intronic data that spanned the full junction. Based on the analysis, 77.88% for
Based on our sequencing of poly(A) tailed material combined with the sequencing chemistry of ONT direct RNA sequencing, which in theory only captures RNA with a poly(A) tail, and previous kinetic studies on RNA processing (62, 63), we do not expect the intron retained transcripts to simply be unprocessed transcripts. To confirm this, we looked for evidence that each transcript had at least been partially processed. On average, 92.76% of multi-intron genes identified as having intron retention within their transcripts had at least one junction that was canonically spliced in all the transcripts, demonstrating that our findings cannot be attributed to sequencing of pre-mRNA. To exclude the possibility that the efficacy of poly(A) tail selection of RNA in
Alternate junction splicing is often proximal and nonproductive.
Having previously identified intron-retained, read junctions using an annotated gene model approach, we used RSeQC (see Materials and Methods) to identify and quantify the other three classes of alternative splicing read junctions (exon skipping and 5′- and 3′-splice site changes) based on a similar methodology. The levels of alternative spliced junctions are calculated as the proportion of alternate junction over the total junction reads and are represented using the metric percent spliced (PS) value. Here, we filtered out junctions unsupported by a minimum coverage of three reads. Using the same threshold as before (≥10%), we identified a total of 1,138 genes for
FIG 4
Summary analysis of alternative splicing from the ONT direct RNA sequencing data sets. Pie charts show the number, proportion, and categorization of genes with alternatively spliced transcripts equaling or exceeding 10% of its total transcript. An example of each event is presented. Red and blue represents sense and antisense transcripts, respectively. Uncategorized genes represent genes where there are major mismatches between the RNAseq data and the annotated gene model.
Having identified junctions subject to alternative splicing, we then quantified what proportion of the transcripts produced at those junctions represented the noncanonical isoform. For alternative splicing involving 5′ or 3′ changes and for intron retention, substantial proportions of isoforms were represented by the alternative transcript, but the median abundance remained below 50% (data shown in Fig. 5A). However, while exon skipping was a relatively rare event across the genome (Fig. 4), for genes where exon skipping did occur, it represented a higher proportion (approximately two-thirds) of transcript isoforms for those genes than observed for the other forms of alternative splicing (Fig. 5A). We had previously published a list of genes with alternative splicing excluding intron retention in
FIG 5
(A) Levels of each major alternative splice type, represented as the percent splice (PS). (B) Distribution of alternate 5′/3′ splice positions within five bases to the dominant splice site.
To further explore consequences of alternate 5′ or 3′ junction splicing events, we investigated the length distribution of the change in intron length. We found a surprisingly high proportion (∼50%) of alternate 5′/3′ splicing to occur proximal (<6 bases) to the expected canonical site. We graphed the distribution of splicing change positions in Fig. 5B and found a substantial spike of splicing changes occurring at the position four bases inside the canonical intron boundary. We tested these junction reads for productivity for a subset of 50 of these “near-miss” alternate splice events and found all reads to prematurely terminate (unsurprisingly, given the necessary frameshift). This striking over-representation of isoforms departing from the canonical model by specifically four bases has been previously identified in metazoans (66). Very small movements in splice site usage have been described as junction wobbling, and this has been proposed as minor splicing noise, or alternatively, as an additional mechanism of regulation through the NMD pathway (66), although the reason for the specific peak of AS four bases away from the canonical junction is unknown.
DISCUSSION
Despite the apparent significance of alternative splicing in metazoans, very little is known about the process in apicomplexans. A number of targeted experiments highlight single splicing events and their impact on parasite survival, but global splicing networks have been poorly described. Untargeted RNA-seq experiments have mainly focused on whole transcriptome assembly and/or gene expression. Those studies that do monitor alternative splicing reveal its occurrence in multiple genes and stages, but the extent of these events and their phenotypic significance remains unknown. The lack of a robust methodology in defining transcript isoforms from short-read data are a particular challenge in dissecting whole-transcriptome splicing. In this study, we investigated whether ONT direct RNA sequencing could be used to explore the splicing landscape of two apicomplexan parasites:
To our knowledge, ONT direct RNA sequencing has not been previously described in apicomplexans and so our first objective was to evaluate the capability of ONT sequencing in generating sequencing reads from these parasites. We successfully obtained high-quality sequencing reads for both parasites that were comparable to that previously described in the literature for other organisms (29, 67). In particular, we obtained read lengths that exceeded 1 kb on average, many of them predicted to represent full-length or near-full-length transcripts. Interestingly, although we obtained a higher number of sequencing reads for
The quantification of gene expression is one important goal of RNA-seq. Traditional, short-read sequencing requires the generation and amplification of cDNA, which can introduce artifacts and biases. Transcriptional amplification or repression is a commonly overlooked bias (72), where the levels of global mRNA, rather than specific mRNA, may vary between different samples. Thus, using a standard amount of total RNA, as is commonly done, can mask actual detection of specific mRNA levels, even after normalization (72). Direct RNA sequencing allows these caveats to be bypassed because a standard amount of isolated mRNA instead is used as the sequencing material. However, because there is no amplification step, direct RNA sequencing is limited by the amount of mRNA that can be practically obtained and used in the sequencing process. Without sufficient sequencing material, it can be difficult to achieve the high levels of sequencing depth that is needed to analyze gene expression (73). In line with the literature, our analysis shows that the current protocol for ONT direct RNA sequencing is comparable to Illumina for quantifying gene expression in the organisms we analyzed. It can be noted, however, that sequencing depth remains the main limitation of ONT sequencing in our study. The reduced throughput and sequencing depth from ONT sequencing compared to Illumina sequencing means that genes or transcript isoforms with low expression may not be captured.
Several previous analyses have reported differences in alternative splicing types and levels among different organisms (56, 74, 75). Notably, the increase in intron number and its retention correlates strongly with multicellular complexity (as defined by numbers of distinct cell types) (76, 77). In apicomplexans, the splicing machinery appears to be largely conserved, but features of gene structure such as intron number, length, and distribution can be highly variable (19). In our study, the difference in intron number between
In addition to the difference in alternative splicing levels, there are differences in the composition of alternative splicing types between different organisms as reported by Kim et al. (75) and McGuire et al. (79). For example, exon skipping is the predominant form of alternative splicing in metazoans, and intron retention the rarest (75). Our analysis reveals that the opposite occurs in
Consistent with our observation of alternative splicing playing a minor role in generating true proteome diversity in apicomplexans, many splicing events in other eukaryotes contribute little to the protein isoform repertoire. In particular, many transcripts contain premature termination codons (PTCs), at least in humans and yeast (82, 83). Often, PTC transcripts are the result of the retention of intronic sequences that contain PTCs (84), but translational frameshifts from active splicing events such as alternative splice site selections have been similarly implicated (83, 85). PTC transcripts are not normally translated but rather targeted for degradation through the nonsense-mediated decay (NMD) pathway (86, 87). This is vital because the transcripts encode altered or truncated proteins which may exhibit deleterious activity (88). Some studies postulate that the predicted alternative splice events are the result of either experimental or transcriptional noise (89) or that a substantial portion of such transcripts are contaminating pre-mRNA molecules and so do not represent true alternative splicing (59, 90). Nevertheless, many RNA-seq-based analyses operate on the assumption that PTC transcripts are biologically significant or relevant (91). Congruously, studies focusing on mature mRNA isoforms in other organisms suggest that nonproductive transcripts mediate an additional layer of posttranscriptional regulation, through downstream RNA processing changes such as mRNA turnover, export, and microRNA silencing (47, 92, 93). Alternative splicing in apicomplexans may also play a role in these processes. Strikingly, unique PTC transcript signatures are associated with distinct cell lineages (48, 94, 95) in multicellular eukaryotes, which may be analogous to the essential role of alternative splicing observed in stage differentiation observed in
Nonproductive transcripts are typically degraded though the NMD pathway, and this has been shown to regulate gene expression at the posttranscriptional level (96). However, it is difficult to conclusively define the function of the nonproductive transcripts without experimentally testing the proteomic fates of these transcripts. In metazoans, nonproductive transcripts often highlight genes that were downregulated following a transition of cellular states (95). Our study is consistent with this, given the observation that the number of nonproductive transcripts generally decreased with increasing transcript number. This, in association with NMD, has been shown to be crucial to the maintenance and differentiation of many cell types (97, 98). In contrast, in organisms such
Genome annotation is a crucial element of RNA-seq data analysis. For
Conclusions.
In this study, we have performed the first direct transcriptomic analyses on
We demonstrated that alternative splicing is widespread in the two parasites, particularly intron retention. ONT direct RNA sequencing enabled us to determine the productivity of these transcripts without complex computational methodologies, and we show that most the transcripts are premature terminating. This has implications for the quantification of gene expression, since it is highly unlikely for the wealth of transcript diversity that we identified to directly translate to protein isoforms.
MATERIALS AND METHODS
Cell culture and RNA extraction.
Library preparation and Nanopore sequencing.
Libraries for the direct RNA sequencing were generated using the recommended protocol for the SQK-RNA001 kit (Oxford Nanopore Technologies). We loaded and sequenced the libraries on MinION R9.4 flow cells (Oxford Nanopore Technologies) for 48 h. Base calling was performed concurrent with sequencing using Albacore (v 2.0), which was integrated within the MinION software (MinKNOW, v1.10.23). Only “pass” reads as designated by the tool were used for subsequent analyses.
Mapping and qualitative analysis.
ONT sequencing data were first checked for quality with FastQC (v.0.11.7) (102). We then utilized Minimap2 (v2.1) (41) to map raw reads to the parasite genome and transcriptome from ToxoDB and PlasmoDB (r. 39), using the recommended preset commands except that intron length thresholds were set at 5,000 and 1,500 bases for
We correlated the ONT sequencing data with the Illumina data sets as previously described (36) using the wub package (v.0.2) (111). The genome coverage of sequencing data sets were generated using bedtools genome coverage (v2.27) (112) and visualized via J-Circos (v1) (113). Log2-fold ratios were calculated using deepTools bamCompare (v.2.5.1) (114). Gene body coverage was analyzed with the geneBody_coverage.py script of RSeQC (v.2.6.4) (115) using the default settings. For additional details and command lines used in these analyses, see Text S1 in the supplemental material.
Alternative splicing analysis.
We applied two approaches to analyzing alternative splicing. We first identified intron retained junctions and transcripts using featureCounts (v.1.6.2) (54) on the genome mapped reads. featureCounts matches features specified in an annotation file (gff) to mapped reads. The annotation files used in the analyses were obtained from ToxoDB and PlasmoDB (r.39), and preprocessed via ToolShed (v.1.0) (116) to specifically extract intron coordinates and gene IDs. We set a minimum threshold requiring mapping to at least six bases of the intron feature and a minimum threshold of three reads mapping to the junction/transcript to be considered for further analysis. PIR scores were calculated as the proportion of alternative splicing events to the sum of reads for each junction/gene. Productivity of full-length transcripts was analyzed using the Flair pipeline (57) using default parameters.
For the second approach, we used the junction_annotation.py script of RSeQC (v.2.6.4) (115) to identify novel or partial-novel junctions from the genome mapped reads based on the unmodified annotation file. Again, we filtered out junctions that had fewer than three supporting reads. The junctions were summarized into a table based on coordinate matching to the 5′ and/or 3′ of the expected canonical junction. We identified alternative 5′/3′ splicing and exon skipping based on the coordinates and strandedness of junctions identified by RSeQC that were either consistent with or conflicted with the annotated junctions. We manually validated the data, matched junctions to available gene IDs, and again calculated the proportion of alternative splicing events to the sum of reads for each junction. Using the final data set, we re-curated the list of intron-retained junctions to exclude for alternate 5′/3′ splice changes. Proportional Venn diagrams were drawn using BioVenn (117).
Gene set enrichment analyses were carried out by ranking the genes based on their alternative splicing levels and using the first and third quartiles of the ranked list as input for GO enrichment analysis via ToxoDB/PlasmoDB based on curated and computed assigned associations. We required the adjusted
Data availability.
Our data are viewable through the ToxoDB and PlasmoDB web resources (101), and raw data are available from the SRA (PRJNA606986).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2021 Lee et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
ABSTRACT
Alternative splicing is a widespread phenomenon in metazoans by which single genes are able to produce multiple isoforms of the gene product. However, this has been poorly characterized in apicomplexans, a major phylum of some of the most important global parasites. Efforts have been hampered by atypical transcriptomic features, such as the high AU content of
IMPORTANCE We have used a novel nanopore sequencing technology to directly analyze parasite transcriptomes. The very long reads of this technology reveal the full-length genes of the parasites that cause malaria and toxoplasmosis. Gene transcripts must be processed in a process called splicing before they can be translated to protein. Our analysis reveals that these parasites very frequently only partially process their gene products, in a manner that departs dramatically from their human hosts.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer