Introduction
The phylum Apicomplexa is comprised of eukaryotic, unicellular, obligate intracellular parasites, infecting a diverse range of hosts from marine invertebrates, amphibians, reptiles, birds to mammals including humans. More than 5000 species have been described to date, and over 1 million apicomplexan species are estimated to exist (Adl et al., 2007; Pawlowski et al., 2012). Clinically and economically important apicomplexan pathogens, for example, Babesia, Cryptosporidium, Eimeria, Neospora, Theileria, Toxoplasma (Tenter et al., 2000), and the malaria-causing parasite Plasmodium wreak profound negative impacts on animal and human welfare.
Despite their diverse host tropism (Roos, 2005) and life cycle strategies, apicomplexans possess several unifying molecular and cellular features, including the abundance of specific classes of nucleic acid-binding proteins with regulatory functions in parasitic processes (Campbell et al., 2010; Flueck et al., 2010; Radke et al., 2013; Kafsack et al., 2014; Sinha et al., 2014), extracellular proteins for interactions with the host (Templeton et al., 2004a; Anantharaman et al., 2007), an apical complex comprising a system of cytoskeletal elements and secretory organelles (Hu et al., 2006), an inner membrane complex (IMC) derived from the alveoli (Eisen et al., 2006; Kono et al., 2012; Shoguchi et al., 2013), and a non-photosynthetic secondary plastid, termed the apicoplast (McFadden et al., 1996). How and when these features arose is unclear, owing to the lack of suitable outgroup species for comparative analyses.
Chromerids comprise single-celled photosynthetic colpodellids closely associated (and likely symbiotic) with corals (Cumbo et al., 2013; Janouškovec et al., 2013). Phylogenetic analysis demonstrates that these algae are closely related to Apicomplexa (Janouškovec et al., 2013), confirming the long-standing hypothesis that apicomplexan parasites originated from a free-living, photosynthetic alga (McFadden et al., 1996; Moore et al., 2008). Two known chromerid species, Chromera velia and Vitrella brassicaformis (Moore et al., 2008; Oborník et al., 2011, 2012), can be cultivated in the laboratory, and their plastid (Janouškovec et al., 2010) and mitochondrial genomes (Flegontov et al., 2015) have been described. We explored whole nuclear genomes of Chromera and Vitrella to understand how obligate intracellular parasitism has evolved in Apicomplexa.
Results and discussion
Genome assembly and annotation
A shotgun approach was used to sequence and assemble the Chromera and Vitrella nuclear genome into 5953 and 1064 scaffolds totaling 193.6 and 72.7 million base-pairs (Mb). The disparity in genome size is attributable largely to the presence of transposable elements (TEs) totaling ∼30 Mb in Chromera vs only 1.5 Mb in Vitrella, as the predicted number of protein-coding genes is almost the same at 26,112 and 22,817, respectively. Detailed characterizations of the two genomes and their gene structures are described in Appendix 1 and Supplementary files 1, 2.
Ancestral gene content of free-living and parasitic species
We constructed a phylogenetic tree of 26 species, comprising Chromera, Vitrella, 15 apicomplexans, 2 dinoflagellates, 2 ciliates, 4 stramenopiles, and a green alga. On the phylogenetic tree (Figure 1A), Chromera and Vitrella formed a group closest to the apicomplexan clade, consistent with previous phylogenies (Moore et al., 2008; Janouškovec et al., 2010, 2013, 2015; Oborník et al., 2012). The long branches from their common node are consistent with drastic differences in morphology, life cycle (Oborník et al., 2012), plastid (Janouškovec et al., 2010) and mitochondrial genomes (Flegontov et al., 2015) between the two chromerids (Figure 1A). Likewise, despite common origins, apicomplexans show extensively diverse lifestyles, including host tropism and invasion phenotypes (Figure 1B).
Figure 1.
Phylogenetic, parasitological, and genomic context of chromerids.
(A) Phylogenetic tree of 26 alveolate and outgroup species (see Figure 1—source data 1 for the list of species). Multiple sequence alignments of 101 genes, which have 1:1 orthologs across all species (Figure 1—source data 2) were concatenated to a single matrix of 33,997 aligned amino acids. A maximum likelihood tree was inferred using RAxML with 1000 bootstraps, with Chlamydomonas reinhardtii as an outgroup. All clades are supported with bootstrap values of 100% except one node (*) with 99%, and also with 1.00 posterior probability from a bayesian phylogenetic tree based on PhyloBayes (Lartillot and Philippe, 2004) (CAT-GTR). (B) Lifestyles of the apicomplexan and chromerid species under investigation. ‘?’: uncertainty due to lack of relevant data.
DOI: http://dx.doi.org/10.7554/eLife.06974.003
We reconstructed the parsimonious gene repertoires for the ancestors of the 26 species, at the nodes of the phylogenetic tree (Figure 2A; Figure 2—figure supplement 1). We note five key nodes on the evolutionary paths to present-day apicomplexans: the alveolate ancestor; the common ancestor of Apicomplexa and chromerids, termed the proto-apicomplexan ancestor; the apicomplexan ancestor; the ancestor of apicomplexan lineages, for example, coccidia and hematozoa; and extant apicomplexans (Figure 2A). Protein-coding genes from the 26 species were clustered by OrthoMCL (Li et al., 2003) into groups of homologous genes, hereafter defined as orthogroups. We note that an orthogroup could have homologous genes from different species (putative orthologs) or from the same species (putative paralogs arising from gene duplications). Gains or losses of orthogroups are displayed as green or red sections of a pie on the phylogenetic tree in Figure 2A. Divergence of the proto-apicomplexan ancestor from the alveolate ancestor (Stage I) was accompanied by losses of 1668 and gains of 2197 orthogroups (sum of the two ‘pies’ in Stage I). Transition of the free-living proto-apicomplexan ancestor to the apicomplexan ancestor (Stage II) is accompanied by many gene losses (3862 orthogroups) but few gains (81 orthogroups) (Figure 2A). Divergence of coccidians, for example, Toxoplasma gondii, from the apicomplexan ancestor (Stage III) is characterized by modest changes (537 losses; 414 gains), whereas divergence of hematozoans, for example, Plasmodium spp., is marked by drastic losses (1384 losses; 77 gains) (Figure 2A). Further divergence of apicomplexan taxa beyond Stage III is characterized by modest, lineage-specific gains (Figure 2A). Functional composition of gained genes at various stages will be discussed in later sections. Paucity of gained genes (81 orthogroups) during Stage II indicates that the genome of the free-living ancestor possessed most of the genes that were present in the common ancestor of apicomplexans and survived in their present-day descendants.
Figure 2.
Gene content changes during apicomplexan evolution.
(A) Gains and losses of orthogroups inferred based on Dollo parsimony (Csuros, 2010). Analysis based on a gene birth-and-death model provided similar results (Figure 2—figure supplement 1A). Stages I, II, and III (shown in blue, pink and green, respectively) represent groups of branches from the alveolate ancestor to apicomplexan lineage ancestors. Stage III could not be determined for Cryptosporidium lineage because of sparse taxon sampling. The area of a green or red section in a pie is proportional to the number of gained or lost orthogroups, respectively. (B, C) Overview of metabolic capabilities (B) and endomembrane components (C) in apicomplexan and chromerid ancestors. Gains and losses of enzymes and components were inferred, based on Dollo parsimony (Csuros, 2010). The pie charts are color-coded based on the fraction of enzymes or components present. Additional results from analysis of individual components and enzymes can be found in Figure 2—figure supplements 2,3,4,5, Supplementary file 3. Individual components and enzymes are listed in Figure 2—source data 1, 2. Similar analyses were performed for components encoding flagellar apparatus (Figure 2—figure supplement 5B).
DOI: http://dx.doi.org/10.7554/eLife.06974.006
Figure 2—figure supplement 1.
Gene gains and losses across the hypothetical ancestors of the 26 species under study.
Triangles pointing upward indicate gains, triangles pointing downward losses, and the total number of orthogroups at that particular node are proportional to the length or the darkness of the shade of the gray boxes. Gene gains and losses were inferred with gene birth-and-death model with posterior probability >0.3 (‘Materials and methods’).
DOI: http://dx.doi.org/10.7554/eLife.06974.009
Figure 2—figure supplement 2.
Overview of chromerid Carbamoyl Phosphate Synthetase (CPS) and Fatty Acid Synthase I (FAS I).
(A) Phylogenetic tree of CPS amino acid sequences demonstrates that Chromera and Vitrella contain only cytosolic CPS involved in pyrimidine biosynthesis, which has been duplicated in Vitrella. An additional gene coding for CPS was identified only in the Vitrella genome assembly (marked by *) was found to be bacterial contamination. None of the sequences encode a mitochondrial leader at the N-terminus of the corresponding protein. (B) Structures of selected multi-modular enzymes in Apicomplexa and chromerids. (C) Treatment of Chromera by Triclosan, an inhibitor of FASII. FASI is responsible for synthesis of short saturated FAs, while FASII mediates their modifications and synthesis of structural lipids. Production of short unsaturated FAs (C14:0; C16:0; C18:0) was not affected by Triclosan, suggesting that, in Chromera, short saturated FAs are produced by FASI and are likely modified by FASII.
DOI: http://dx.doi.org/10.7554/eLife.06974.010
Figure 2—figure supplement 3.
Summary of metabolic pathways based on KEGG Assignments.
Schematic comparison of metabolism between Chromera, Vitrella, and selected species from Apicomplexa. Phyletic patterns for conservation of metabolic function are color-coded as shown in the panel on the right. A key for the abbreviations and the details of each enzymatic reaction are found in Appendix 2 and Figure 2—source data 1.
DOI: http://dx.doi.org/10.7554/eLife.06974.011
Figure 2—figure supplement 4.
An overview of endomembrane trafficking components.
Coulson plot representation of the retention/loss of genes encoding trafficking gene complement of the Retromer, Clathrin, ESCRT, AP, and MTC family proteins amongst the 26 species. The fill colors indicate different phyla, for example, red Coulson plots for apicomplexans. Legends at the top of each column denote subunit components of complexes. For each organism, filled sectors of the pie represent presence of the corresponding protein, whereas empty sectors represent a failure to identify the corresponding protein in the genome (the method is described in Appendix 3). In cases where multiple copies of the protein are present, and can confidently be ascribed to unique genes, numbers indicate relevant paralog counts. The 26 species are shown on the left side with a phylogenetic tree. For simplicity, all subunits are listed as per yeast nomenclature, and only revert to human nomenclature when no homologous yeast gene exists. Abbreviations: CHC, Clathrin heavy chain; CLC, Clathrin light chain; V, Vps; C, CHMP; Vt, Vta1; B, Beta, M, Mu; S, Sigma, G, Gamma; A, Alpha; D, Delta; E, Epsilon; Z, Zeta; T20, Tip20; D1, Dsl1; S39, Sec39; T, Trs; T17, Tca17; C, COG; S, Sec; E, Exo; ESCRT, Endosomal Sorting Complex Required for Transport; MCT, multi-subunit tethering complex; AP, Adaptor Protein. IDs of genes encoding the components are listed in Figure 2—source data 2.
DOI: http://dx.doi.org/10.7554/eLife.06974.012
Figure 2—figure supplement 5.
Evolutionary history of genes encoding cytoskeleton across 26 species.
(A) Heatmap showing the phyletic pattern of 25 known flagella-related genes (vertical) across the 26 species (horizontal). Gene copy numbers are displayed as numerals on each cell. Black, blue, and orange bars on the right indicate intraflagellar transport, basal body, and striated fiber assemblin (SFA), respectively. The IDs of genes encoding flagellar components are listed in Figure 2—figure supplement 5—source data 1. (B) Schematic representation of losses along the evolutionary paths. See Figure 2B,C for legend. Blue and Brown colored boxes denote presence of basal body and IFT proteins. (C) Heatmap showing distribution of actin and actin-regulatory proteins across the 26 species. They were annotated based on previously defined classification rules with Pfam domain or based on orthology (OrthoMCL clustering) with known actin and actin-related genes. The numbers of genes are shown as numerals within each cell. (D) Phylogenetic tree of SFA genes, identified with the canonical SF-assemblin domain (PF06705) (closed circles) and those with the variant SF-assemblin domain (open circles) for our downstream analyses in Figure 4. The variant SF-assemblin domain, where some amino acid sequences were rearranged, was confirmed by manual inspection of the alignment (data not shown). The gray shade indicates alveolate-specific SFAs. (E) A network view of amino acid sequence homology between ISP family genes. Edges are drawn depending on the strength of the sequence homology: dotted (BLASTP E value <10−20) or solid (BLASTP E value <10−30). The two letters within the node refer to acronyms of the species name and the node color species group: red (Plasmodium); green (coccidians); magenta (Cryptosporidia); orange (piroplasms); yellow (chromerids); and navy (dinoflagellates). ISP3 has been duplicated and diverged from ISP1 after the common ancestor of coccidians, piroplasms, and Plasmodium spp. split from Cryptosporidia.
DOI: http://dx.doi.org/10.7554/eLife.06974.013
Progressive, lineage-specific losses during apicomplexan evolution
Parasite evolution has been associated with genome reduction across several branches of the tree of life (Keeling, 2004; Sakharkar et al., 2004; Morrison et al., 2007). Examples also exist, however, where parasite genomes are not reduced (Pombert et al., 2014) but expanded (Raffaele and Kamoun, 2012), underscoring the fact that the genome reduction process during parasite evolution is not completely understood. We sought to characterize in detail the dynamics of gene loss across apicomplexan evolution, particularly for components of molecular processes that are hallmarks of free-living lifestyle. We performed a systematic analysis of the cellular components involved in: (1) cellular metabolic pathways; (2) the endomembrane trafficking systems, regulating the movement of molecules across intracellular compartments in eukaryotes (Leung et al., 2008); and (3) the flagellum, a highly conserved apparatus for motility in aqueous environment (Silflow and Lefebvre, 2001).
The inferred proto-apicomplexan ancestor, like present-day chromerids, possessed complete metabolic pathways for sugar metabolism, assimilation of nitrate and sulfite, and photosynthesis-related functions (Figure 2B, Figure 2—figure supplement 3, Appendix 2, and Supplementary file 3). Unlike in other photosynthetic algae, both Chromera and Vitrella initiate heme synthesis in the mitochondrion using aminolevulinate synthase (C4 pathway), which thus far has been found only in a few eukaryotic heterotrophs, such as Euglena gracilis, dinoflagellates, and apicomplexans (Kořený et al., 2011; van Dooren et al., 2012; Danne et al., 2013) (Appendix 2 and Supplementary file 4). Both chromerids and apicomplexans encode modular multi-domain fatty acid synthase I (FASI)/polyketide synthase enzymes and single-domain FASII components (Figure 2—figure supplement 2A,B). Treatment of Chromera with a FASII inhibitor triclosan showed decreased production of long chain fatty acids (Figure 2—figure supplement 2C and Appendix 2), suggesting that Chromera synthesizes short-chain saturated fatty acids using the FASI pathway, which are then elongated using the FASII pathway. This was previously demonstrated in Toxoplasma, an apicomplexan that possesses both FASI and FASII (Mazumdar and Striepen, 2007). Likely, the proto-apicomplexan ancestor was a phototrophic alga harboring characteristic metabolic features previously found only in apicomplexan parasites, especially with regard to plastid-associated metabolic functions (see above and other examples in Appendix 2) (Kořený et al., 2011; van Dooren et al., 2012; Danne et al., 2013).
Transition to an apicomplexan ancestor (Stage II) was accompanied by the loss of metabolic processes including photosynthesis and sterol biosynthesis (Figure 2B and Figure 2—figure supplement 3). The apicomplexan ancestor appeared to possess a significant complement of enzymes in various pathways (Figure 2B) (Lim and McFadden, 2010). The differentiation of apicomplexan lineages (Stage III) was accompanied by further lineage-specific losses: for example, loss of FASI in Plasmodium spp, loss of FASII in Cryptosporidium spp., which has also lost the apicoplast, and loss of enzymes mediating polyamine biosynthesis in all lineages except Plasmodium (Figure 2B and Figure 2—figure supplement 3). These support the notion that enzymes involved in cellular metabolism critical for free-living organisms were not completely lost during the transition to the apicomplexan ancestor, but were further lost during subsequent differentiation and host-adaptation of apicomplexan lineages.
The proto-apicomplexan had a nearly complete repertoire of the endomembrane trafficking complexes, and much of this repertoire persisted through to the apicomplexan ancestor (Stage II) (Hager et al., 1999; Klinger et al., 2013a) (Figure 2C, Figure 2—figure supplement 4 and Appendix 3). Differentiation of apicomplexan lineages (Stage III) was accompanied by lineage-specific losses, for example, loss of the Endosomal Sorting Complex Required for Transport II (ESCRTII) in all lineages except in piroplasms, whereas some components were retained across all lineages, such as the retromer complex components and clathrin, both systems implicated in invasion processes (Pieperhoff et al., 2013; Tomavo et al., 2013) (Figure 2C, Figure 2—figure supplement 4 and Appendix 3). These lineage-specific losses have led to diverse, reduced sets of endomembrane trafficking components in present-day apicomplexans (Hager et al., 1999; Klinger et al., 2013a). Some of these components that were present in chromerids were absent in specific apicomplexan lineages as well as in dinoflagellates and ciliates, further clarifying that these losses are independent, lineage-specific events rather than ancient, shared events.
All known components of flagella were present in the proto-apicomplexan ancestor (Figure 2—figure supplement 5A,B). Most of the components were retained in the apicomplexan ancestor (Stage II), but losses occurred as apicomplexan lineages differentiated (Stage III). Components of intraflagellar transport, which are typically essential for assembling flagella, were lost in the other lineages except in coccidians (Figure 2—figure supplement 5A,B). The basal body proteins, which support an organizing center for microtubules, were lost from piroplasms. Some striated fiber assemblin (SFA) proteins, typically associated with basal body rootlets, were maintained in all apicomplexan lineages including piroplasms (Figure 2—figure supplement 5A,B,D); their presence has been hailed as evidence that some flagellar-proteins are repurposed for new functions in apicomplexans (see below) (Francia et al., 2012).
In summary, one of the major events during apicomplexan evolution is progressive, continued loss of components important for free-living organisms. While Stage II was accompanied by a massive loss of such components including those implicated in photosynthesis, the apicomplexan ancestor still possessed many proteins, which were lost later during differentiation of lineages with diverse life strategies.
Emergent features of apicomplexans
Evolution of present-day apicomplexan parasites was accompanied not only by gene losses as noted above (Figure 2) but also by gene gains. We sought to determine if genes gained at a particular stage of apicomplexan evolution, as depicted by the gray violin in Figure 3, would be over-represented with those involved in parasitic processes such as intracellular invasion into and egress from host cells. For Plasmodium falciparum and T. gondii, we compiled three classes of protein-coding genes directly or indirectly involved in parasitic processes of apicomplexans based on in silico prediction or information from previous functional studies (‘Materials and methods’). Extracellular proteins are secreted by the apicomplexans for various parasitic processes, for example, some of them are targeted to the host cytoplasm, nucleus, and plasma membrane to modulate parasite–host interactions (Mundwiler-Pachlatko and Beck, 2013; Bougdour et al., 2014). Cytoskeletal proteins provide structural support to the cell and also the molecular machinery for motility and intracellular invasion (Baum et al., 2006; Soldati-Favre, 2008). Proteins with DNA-binding domains (DBDs) or RNA-binding domains (RBDs) can regulate various molecular processes of apicomplexan parasites. Indeed, proteins with AP2 (apiAP2) DBD have been shown to act as genetic control switches for diverse apicomplexan processes (Balaji et al., 2005; Campbell et al., 2010; Flueck et al., 2010; Radke et al., 2013; Sinha et al., 2014; Kaneko et al., 2015).
Figure 3.
Evolutionary history of Plasmodium falciparum and Toxoplasma gondii genes.
Violin plots showing distribution of evolutionary ages of genes (Y-axis: from species-specific (bottom) to deeply conserved (top)) in P. falciparum (A) and T. gondii (B). Evolutionary age of a gene is defined as the earliest node on the evolutionary path of the phylogenetic tree where homolog can be detected (‘Materials and methods’). The horizontal thickness of a violin is proportional to the number of genes (gray) or the fraction of genes (yellow) in a functional category (X-axis) out of all with the same evolutionary age. Selected functional sub-categories are overlaid with red, green, or blue violin plots. The maximum width of each violin is scaled to be uniform across categories. Inner boxes in the gray violins indicate inter-quartile ranges and circles indicate medians. Colored shades along the X-axis indicate Stages I–III (Figure 2). Extracellular proteins include proteins targeted to host cytoplasm, nucleus, and plasma membrane (‘exportome’) and all other proteins, which are secreted or localized on the parasite surface (‘others’). Cytoskeletal proteins include proteins associated with ‘actomyosin motor complex’ and ‘IMC’. All extracellular and cytoskeletal proteins are listed in Figure 3—source data 1, 2. Nucleic acid-binding proteins are predicted in silico based on presence of DNA-binding domains (DBDs) and RNA-binding domains (RBDs). See ‘Materials and methods’ for details on how these genes are defined and compiled. Domain architectures of representative extracellular proteins in apicomplexans and chromerids are displayed as schematics in Figure 3—figure supplement 4. Sequence homology networks (Figure 2—figure supplement 5E and Figure 3—figure supplements 1B, 2B, 3B) and gene gains and losses on the phylogenetic tree (Figure 3—figure supplements 1A, 2A, 3A) provide complementary views on the evolutionary history of these genes.
DOI: http://dx.doi.org/10.7554/eLife.06974.015
Figure 3—figure supplement 1.
Evolutionary history of apiAP2 genes.
(A) Gains and losses of apiAP2 genes inferred with Dollo parsimony. Triangles pointing upward indicate gains, triangles pointing downward losses, and the total number of orthogroups at that particular node are proportional to the length or the shade thickness of the gray boxes. (B) Network view of amino acid sequence homology among all apiAP2. Edges are drawn depending on the strength of the sequence homology: dotted (BLASTP E value <10−20) or solid (BLASTP E value <10−30). The two letters within the node refer to acronyms of the species name and the node color species group: red (Plasmodium); green (coccidians); magenta (Cryptosporidia); orange (piroplasms); yellow (chromerids); and navy (dinoflagellates). For example, nodes from P. falciparum are shaded red and lettered with ‘pf’. Connected nodes with different or the same species names indicate putative orthologs or paralog, respectively. Nodes without any edges, likely to be species-specific genes without other paralogs, were not displayed. Connections between nodes of different colors indicate deep evolutionary conservation. For example, connections between red and yellow nodes indicate orthologs shared between Plasmodium spp. and chromerids, which means that they have been gained by the proto-apicomplexan ancestor after its split from dinoflagellates (Stage I). (C) Gains and losses of DBD genes, excluding apiAP2, inferred with Dollo parsimony. (D) Bar chart showing putative apiAP2 paralogs (light gray), and singletons (black). We note a paucity of duplicate apiAP2 genes in apicomplexans compared to their abundance in chromerids.
DOI: http://dx.doi.org/10.7554/eLife.06974.018
Figure 3—figure supplement 2.
Evolutionary history of alveolins.
(A) Gains and losses of alveolins inferred with Dollo parsimony. Triangles pointing upward indicate gains, triangles pointing downward losses, and the total number of orthogroups at that particular node are proportional to the length or the shade thickness of the gray boxes. (B) Network view of amino acid sequence homology among all alveolins. Edges are drawn depending on the strength of the sequence homology: dotted (BLASTP E value <10−20) or solid (BLASTP E value <10−30). The two letters within the node refer to acronyms of the species name and the node color species group: red (Plasmodium); green (coccidians); magenta (Cryptosporidia); orange (piroplasms); yellow (chromerids); and navy (dinoflagellates).
DOI: http://dx.doi.org/10.7554/eLife.06974.019
Figure 3—figure supplement 3.
Evolutionary history of RAP genes.
(A) Gains and losses of RAP genes inferred with Dollo parsimony. Triangles pointing upward indicate gains, triangles pointing downward losses, and the total number of orthogroups at that particular node are proportional to the length or the shade thickness of the gray boxes. (B) Network view of amino acid sequence homology among all RAP genes. Edges are drawn depending on the strength of the sequence homology: dotted (BLASTP E value <10−20) or solid (BLASTP E value <10−30). The two letters within the node refer to acronyms of the species name and the node color species group: red (Plasmodium); green (coccidians); magenta (Cryptosporidia); orange (piroplasms); yellow (chromerids); and navy (dinoflagellates).
DOI: http://dx.doi.org/10.7554/eLife.06974.020
Figure 3—figure supplement 4.
Domain architectures of extracellular proteins in chromerids and apicomplexans.
Examples of domain architectures of predicted chromerid and apicomplexan extracellular proteins and their phyletic distribution. (A) Extracellular proteins with apparent orthologs in coccidians (for example, Toxoplasma) and Cryptosporidium. (B) Extracellular proteins shared with Cryptosporidium, but not identified in Toxoplasma and other apicomplexans. (C) Extracellular proteins conserved as apparent orthologs throughout the Apicomplexa. (D) An example of an apparent alveolate extracellular protein conserved in chromerids, Perkinsus and ciliates, but absent in apicomplexans. (E) Lineage-specific extracellular proteins identified only in one or both chromerids. (F) Examples of the diversity of domain architectures within chromerids for CRMP family proteins. (G) Examples of apicomplexan-specific extracellular proteins. Descriptions of domains and representative genes are provided in Appendix 5. A yellow rectangle indicates predicted signal peptide sequence. TM and GPI denote predicted transmembrane domain and glycosylphosphatidylinositol (GPI anchor), respectively. Protein lengths are not drawn to scale.
DOI: http://dx.doi.org/10.7554/eLife.06974.021
Genes encoding extracellular proteins exported into the host environments were over-represented among those gained after Stage III (Figure 3), suggesting that adaptation to specific hosts was accompanied by expansion of extracellular proteins mediating host–parasite interactions (Templeton et al., 2004a; Anantharaman et al., 2007). Stage III was accompanied by gains of those encoding DBD proteins, mostly apiAP2 proteins (Figure 3 and Figure 3—figure supplement 1A,B), suggesting extensive regulatory changes mediated by apiAP2 proteins during lineage differentiation. We note that losses of other canonical DBD proteins, for example, proteins with HSF_DNA-bind (Pfam: PF00447) domain during transition to apicomplexan ancestor (Stage II) and proteins with Tub (Pfam: PF01167) domain along the piroplasm lineage, contribute to further dominance of apiAP2 among the DBD proteins (Figure 3—figure supplement 1C). Stage II was accompanied by over-represented gains of various cytoskeletal components, including alveolins, those of the actomyosin complex (e.g., myosins) and glideosome-associated proteins with multiple membrane spans 1 and 3 (GAPM1 and GAPM3), suggesting that the molecular machinery powering gliding motility, which is essential for host cell invasion arose during evolution to apicomplexans (Frenal et al., 2010) (Figure 3, Figure 3—figure supplement 2, and Appendix 4). Gene gains during Stage I were over-represented by proteins with ‘RBD abundant in Apicomplexans’ (RAP, Pfam: PF08373) (Lee and Hong, 2004), many of which were conserved as one-to-one orthologs across descending lineages, suggesting development of evolutionarily conserved functions before apicomplexans and chromerids diverged (Figure 3, and Figure 3—figure supplement 3). Chromerid genomes encode many orthologs of apicomplexan cytoskeletal proteins (Appendix 4), including GAPM2, a member of an important protein family for apicomplexan cytoskeletal structure and gliding motility (Bullen et al., 2009), and the IMC sub-compartment protein family (ISP), implicated in establishing apical polarity and coordinating the unique cell cycle of apicomplexans (Poulin et al., 2013) (Figure 2—figure supplement 5E). These data suggest that some components existed in the free-living proto-apicomplexan ancestor and were subsequently repurposed for parasitic processes of apicomplexans.
The Chromera and Vitrella genomes encode many proteins that are specific to chromerids yet contain functional domains implicated in molecular processes of apicomplexan parasites. For example, there are chromerid-specific proteins with domain architectures similar to those in apicomplexan extracellular proteins, including those previously implicated in host interactions and described in apicomplexans only (Figure 3—figure supplement 4 and Appendix 5, and Supplementary file 5). Presence of such chromerid proteins implies some commonality in extracellular recognition and cross-species interactions and this correlates well with the presumed associations with the coral holobiont (Janouškovec et al., 2012, 2013; Cumbo et al., 2013). Importantly, chromerid genomes encode numerous apiAP2 proteins, more abundant than dinoflagellates, suggesting that they have expanded in the proto-apicomplexan ancestor after it split from dinoflagellates (Figure 3—figure supplement 1D). Many of the chromerid apiAP2 proteins belong to putative paralogous clusters, suggesting that their expansion was driven by gene duplication (Figure 3—figure supplement 1D; Appendix 6). Only a small subset of the apiAP2 proteins are shared across apicomplexans, suggesting that the large apiAP2 complement in the proto-apicomplexan ancestor has diversified independently in descending lineages (Figure 3—figure supplement 1A).
In summary, genes encoding critical components of the parasitic lifestyle of apicomplexans were gained at different stages of apicomplexan evolution, some implying subsequent specialization to particular host niches, but others suggesting early adaptations before committing to parasitic lifestyle. This is evident by chromerid orthologs of many such proteins, for example, RAP proteins and specialized cytoskeletal components. Further, chromerid genomes encode chromerid-specific proteins that are not detected as orthologs of apicomplexan proteins but still have functional domains implicated in parasitic processes in apicomplexans. Together, these data imply that a molecular transition had occurred in free-living ancestors of apicomplexans, providing a foundation for host–parasite interactions and further adaptation.
Conserved gene expression programs in the proto-apicomplexan ancestor
Chromera and Vitrella genomes allowed us to reconstruct the gene content of the free-living ancestor of apicomplexans. To infer their putative functions using genome-wide gene expression information (Hu et al., 2010), we cultured Chromera under 36 different combinations of temperatures, iron and salt concentrations, and generated their gene expression profiles by RNA-seq (Box et al., 2005). In addition, we have obtained a publicly available growth perturbation data set for P. falciparum (Hu et al., 2010). There were 1918 orthogroups shared between the two species. We identified pairs of orthogroups that are co-expressed, that is, showing similar expression patterns across the various conditions, in both species (‘Materials and methods’) (Figure 4—figure supplement 1A). Such an orthogroup pair, that is, those with conserved co-expression between the two species, would include candidate genes that have been co-regulated together during apicomplexan evolution, from the free-living ancestor to present-day parasites due to conserved functions. This approach, successfully utilized by several studies in the past (Stuart et al., 2003; Mutwil et al., 2011; Gerstein et al., 2014), led to the following two observations in this study.
Many RAP genes appeared during Stage I and have been conserved across the descending phyla (Figure 3 and Figure 3—figure supplement 3), but their precise cellular roles are unknown. For 11 out of 12 orthogroups with RAP domains, co-expressed orthogroups overlapped significantly (Fisher's exact test, p < 0.05) between P. falciparum and Chromera, suggesting involvement of RAP proteins in cellular processes evolutionarily conserved across apicomplexans and chromerids (Figure 4A). RAP and their co-expressed orthogroups encode proteins with putative mitochondrial import signals more often than expected by chance in Chromera and P. falciparum (Fisher's exact test, p < 0.05) (Figure 4B), and also in other apicomplexans and chromerids (Figure 4—figure supplement 1B). We have randomly chosen three Toxoplasma RAP genes with predicted mitochondrial localization signals (Supplementary file 6) and confirmed experimentally by 3′ endogenous gene-tagging with reporter epitopes that all three are localized to the organelle (Figure 4C). Some of the orthogroups co-expressed with orthogroups containing RAP domains encode protein products predicted to be metabolic enzymes, implying possible involvement of RAPs in mitochondrial metabolism (Figure 4—figure supplement 1C). Consistent with this, the Cryptosporidium lineage that has a highly reduced mitochondrion lacking both the genome and most canonical metabolic pathways (Abrahamsen et al., 2004; Xu et al., 2004) is the only apicomplexan group to have also lost its RAP repertoire (Figure 4—figure supplement 1D). Loss of RAPs along with a set of mitochondrial functions in this lineage is consistent with a mitochondrial role for RAPs. We speculate that the free-living proto-apicomplexan ancestor possessed within its mitochondrion a regulatory process mediated by RNA-binding activities of the RAP proteins, which has been retained by the extant apicomplexans and chromerids.
Figure 4.
Conserved transcriptional programs in apicomplexans and chromerids.
(A) Boxplot showing the extent of evolutionary conservation of transcriptional programs for all orthogroups or those with RAP domains. X-axis: ‘All’ (all orthogroups excluding RAP); ‘RAP’ (orthogroups with RAP domains). Y-axis: log-transformed odds-ratio, representing, for each orthogroup, the degree of overlap between its co-expressed orthogroups in Chromera and those in P. falciparum. (B) Bar chart showing the fraction of orthogroups (Y-axis) predicted to be targeted to mitochondria in both species (‘Materials and methods’). The number of genes are displayed below each bar. X-axis: ‘All’ (all orthogroups excluding the other two categories); ‘Coexpr’ (orthogroups co-expressed with RAP in both species); ‘RAP’ (orthogroups with RAP domains). The fractions in 'Coexpr' and 'RAP' groups were compared against the fraction in 'All', and p-values based Fisher's exact test are displayed above the bar. Files deposited in European Nucleotide Archive are listed in Figure 4—source data 1 with corresponding conditions. (C) Sub-cellular localization of RAP proteins encoded by TGME49_237010, TGME49_269830, and TGME49_289200 was tested in T. gondii by 3′ tagging of the endogenous genes with the coding sequence for the hemagglutinin epitope, together with a mitochondrial marker Tom40. See Supplementary file 6 for details of the localization predictions. (D) Distributions of Spearman's rank correlation coefficients of gene expression between all possible pairs from the 80 orthogroups implicated in invasion processes in apicomplexans (black outline) were compared against those from 80 randomly selected ones (histogram). The p value indicates statistical significance of the difference based on 10,000 random samplings. The 80 orthogroups and corresponding genes in Chromera and P. falciparum are listed in Figure 4—source data 2. (E) Heatmap showing a matrix of correlation coefficients amongst the 80 orthogroups. Based on a hierarchical clustering, we classified them into six co-expression modules, labeled as numeral 1–6. (F) Heatmap showing correlation coefficients with striated fiber assemblin (SFA) (Cvel_872). The color scheme is the same as in (E). (G) Heatmap indicating statistical significance of conserved transcriptional program, that is, the odds-ratio as defined in (A) (Fisher's exact test, p < 0.05 (gray); p < 0.005 (black)).
DOI: http://dx.doi.org/10.7554/eLife.06974.022
Figure 4—figure supplement 1.
Mitochondrial targeting of RAP and its putative role in mitochondrial metabolism.
(A) Heatmap displaying the extent of association between correlation coefficients of orthogroup-pairs in P. falciparum (Y-axis) and those in Chromera (X-axis). The color scale represents the percentile of the observed frequency amongst randomly expected frequencies when the orthology were shuffled. We observed high percentiles along the 45° diagonal, indicating that the number of orthogroup-pairs that are co-expressed in both species is greater than expected by chance. (B) Bar chart showing the fraction of orthogroups (Y-axis) predicted to be targeted to mitochondria in selected species. The orthogroups and the three categories, that is, ‘All’, ‘Coexpr’, and ‘RAP’ (X-asis) are based on those from Plasmodium and Chromera (Figure 4B). (C) Chromera expression profiles under diverse growth conditions (‘Materials and methods’) are shown for mitochondria targeted RAPs and co-expressed orthogroups. Expression levels were scaled to have a mean of 0 and a standard deviation of 1. Y-axis: orthogroups ordered based on the hierarchical clustering of their expression patterns. X-axis: combinations of different salt and iron (Fe) concentrations and temperatures in which the Chromera cultures were grown. The color scale ranged from red (low expression) to green (high expression). The asterisk (*) denotes genes encoding NADH-dependent oxidoreductase and mitochondrial acidic matrix protein 33, involved in mitochondrial oxidative phosphorylation. (D) Abundance of RAP proteins in alveolate species.
DOI: http://dx.doi.org/10.7554/eLife.06974.025
As discussed earlier, the proto-apicomplexan ancestor appears to have possessed genes implicated in invasion processes of present-day apicomplexans (Figure 3). Among the 1918 orthogroups, we identified 80 orthogroups comprising genes functionally annotated as implicated in invasion processes. The frequency of co-expression amongst them in the free-living Chromera was significantly higher than expected by chance (p < 0.0005), suggesting pre-existing functional relationships before transitioning to parasites (Figure 4D). We identified several modules or groups of co-expressed orthogroups (Figure 4E). In one of the co-expression modules (numbered 1 in Figures 4E), 9 out of 10 orthogroups are co-expressed with a gene encoding SFA (Cvel_872), a key protein for organizing the basal bodies of the flagellar apparatus in algae and the apical complexes in apicomplexans (Kawase et al., 2007; Francia et al., 2012) (Figure 4F). We note that SFAs are the only flagellar components found in all apicomplexans tested (Figure 2—figure supplement 5A). Also in this module, for 9 out of 10 orthogroups, their co-expressed orthogroups in Chromera overlapped significantly with those in P. falciparum (Fisher's exact test, p < 0.05), indicating that their regulatory programs have been evolutionarily conserved (Figure 4G). This module include various types of genes implicated in host cell invasion processes of apicomplexans such as genes encoding rhoptry protein ROP9, apical sushi protein ASP, and gliding motility components GAP40 and GAPM2. The apical complex has been postulated to have emerged from the flagellar apparatus and associated cellular transport systems in free-living algae, based on ultrastructural evidence (Okamoto and Keeling, 2014; Portman et al., 2014). These results suggest that, in the free-living ancestor, some of the genes implicated in the invasion process of present-day apicomplexans were functionally associated with those implicated in flagellar motility, providing the much-needed genetic evidence for the postulate. We speculate that a group of functionally related proteins associated with the flagellar apparatus was repurposed as a module of the apical complex and became a foundation for the invasion machinery.
Conclusion
Analysis of Chromera and Vitrella genomes has enabled insights into how apicomplexan parasites have evolved from free-living ancestors. The transition to parasitism was accompanied by massive genomic loss that continued as its descendants became specialized intracellular parasites infecting diverse hosts. The genome of free-living photosynthetic ancestors encodes many component proteins previously assumed to be restricted to the parasitic apicomplexan lineages. Such pre-existing components, including those of what would later become part of the invasion machinery, were co-opted during evolution to facilitate a successful parasitic lifestyle in multiple hosts. The genome of the proto-apicomplexan ancestor served as a molecular blueprint for evolution of the most successful group of eukaryotic parasites known to date.
Data access
Sequencing data have been deposited in the European Bioinformatics Institute under the European Nucleotide Archive (ENA) sample accession number ERP006228 for C. velia and ERP006229 for V. brassicaformis for all DNA- and RNA-seq experiments. The assembly and the annotations were submitted under accession numbers CDMZ01000001-CDMZ01005953 for C. velia and CDMY01000001-CDMY01001064 for V. brassicaformis. Some of the Vitrella DNA-seq experiments were done at Broad Institute and are deposited at Short Read Archive under accession numbers SRX152523 and SRX152525. The annotations and assemblies can be viewed and queried in EupathDB (http://cryptodb.org/cryptodb/).
Materials and methods
DNA preparation and sequencing
Genomic DNA of C. velia CCMP2878 (subsequently referred to as Chromera) and V. brassicaformis CCMP3155 (subsequently referred to as Vitrella) was extracted and then sheared into short fragment size libraries (300–500 base pair (bp)) and large fragment size libraries (3–8 kbp fragments) by focused-ultrasonication (Covaris Inc., Woburn, USA). The last 3–8 kb libraries were prepared following Nextera mate pair protocol, following manufacturer's instructions. We used three different methods to generate the library: the Illumina (Illumina, San Diego, CA) TruSeq DNA protocol LT Sample Prep Kit (catalog no. #FC-121-2001), an amplification-free method (Kozarewa et al., 2009) (TruSeq DNA PCR-Free LT Sample Preparation Kit catalog no. #FC-121-3001) and the Illumina Nextera Mate Pair Sample Preparation Kit (catalog no. #FC-132-1001). The libraries were sequenced on an Illumina HiSeq2000 platform following the manufacturers standard cluster generation and sequencing protocols (Bentley et al., 2008; Quail et al., 2012). Image analysis, base-calling, and quality filtering were processed by Illumina software.
RNA preparation and sequencing
For isolation of RNAs, Chromera and Vitrella were grown under standard culture conditions (Oborník et al., 2012). Total RNA was extracted from the cells using TRIzol. The polyA+ RNA fraction was selected using oligo(dT) beads, and RNA-seq libraries were prepared using TruSeq RNA Sample Prep kit (catalog no. FC-122-1001). Strand-specific RNA-seq libraries were prepared using TruSeq Stranded mRNA LT Sample Prep Kit (catalog no. RS-122-2101) and sequenced as paired-end (2 x 100 bp) reads on a HiSeq2000 platform.
We performed additional RNA sequencing of Chromera subject to various environmental perturbations, to construct a global gene expression network based on transcriptomes under various perturbation conditions during in vitro growth. Chromera cultures were exposed to a combination of stresses (Figure 4—figure supplement 1C). First, six different media were prepared from the combinations of salt concentration (16.7 g/l, 33.3 g/l, 66.6 g/l) and iron deficiency by chelation (Sutak et al., 2010). After seeding, the cultures were maintained in the normal temperature and light condition for eleven days (Oborník et al., 2011). After randomization, the cultures were incubated at 26°C, 37°C, or 14°C for 0 (control), 0.5, or 2 hr. There were two biological replicates of each, in total 66 flasks of the cultures. Then, the cultures were processed with centrifugation at 3500 RPM for 15 min at 4°C to precipitate the cells. Total RNA was extracted from the 66 cultures after the treatments using Norgen RNA Extraction kit based on manufacturer's protocol (Norgen Biotek Corporation, Canada). RNA quality was assessed using Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA). RNA concentration was determined with a Qubit (Invitrogen, Carlsbad, CA). Strand-specific RNA-seq libraries were prepared from extracted high-quality RNAs (RIN ≥8.0 as measured on an Agilent Bioanalyser 2100) using the Illumina TrueSeq LT stranded RNA sample kit according to manufacturer's instructions. Prior to cluster generation, concentration and size of libraries were assayed using the Agilent DNA1000 kit. Libraries from all samples were sequenced as single-end (1 x 50 bp) reads on the Illumina HiSeq 2000. The RNA-seq reads were aligned to the reference genome using tophat (version 2.0.8, default parameters) and cufflinks (version 2-1.0.2, default parameters) (Trapnell et al., 2012). The FPKM values were log2 normalized with an offset of 1 and were further corrected for different distributions across the samples using the quantile normalization method (Bolstad et al., 2003).
Genome assembly
For Vitrella, the reads were corrected and assembled followed by several base correction, scaffolding and gap filling steps as briefly described below. As first step, the short insert libraries were corrected with SGA (Simpson and Durbin, 2012) (version 0.9.19). The corrected reads were assembled with velvet (Zerbino and Birney, 2008) (version 1.2.08). Iterating through different parameter settings, we choose a k-mer of 75 bp as the best parameter set. The resulting scaffolds (larger than 1 kb) were further scaffolded with SSPACE (Boetzer et al., 2011) using first the Illumina library (insert = 550 bp) and larger insert (1 kb) Illumina library reads. Sequencing gaps were closed with Gapfiller (Boetzer and Pirovano, 2012) (version 1.1.1) with two iterations, using the bowtie mapping option and PCR-Free libraries. Base pair call errors were corrected in three iterations of ICORN (Otto et al., 2010), using the amplification-free library. Furthermore, sequencing gaps were closed, using IMAGE (Tsai et al., 2010) with the amplification-free library. The assembly was quality-controlled using REAPR (Hunt et al., 2013), breaking the contigs at possible miss-assemblies, using the mate pair libraries. This was followed by another scaffolding step. We systematically removed 620 scaffolds containing 25.65 Mb representing the bacterial contamination. The Vitrella CCMP3155 assembly contains 72.7 Mb (including 931,689 N's) in 1064 scaffolds (ENA accession numbers CDMY01000001-CDMY01001064). The scaffolds were constructed from 4177 contigs.
For Chromera, the assembly pipeline and the algorithms used were the same as Vitrella, but due to the larger size, higher amount of low-complexity regions, and difficulties in generating high-quality large insert size libraries, additional steps were included to the assembly process. First, the reads of the PCR-Free library were corrected with SGA (Simpson and Durbin, 2012) and then assembled with velvet and using a k-mer of 71 (version August 2011). Next, the contigs were scaffolded, gapfilled, and corrected with ICORN, as described earlier. We mapped the reads of all large insert size libraries using SMALT (ftp://ftp.sanger.ac.uk/pub/resources/software/smalt/). We excluded scaffolds smaller than 1 kb. Different iterations with SSPACE were undertaken and the assembly was quality-checked with REAPR. After scaffolding, gapfiller and IMAGE were run as above, followed by ICORN. The 1725 scaffolds (spanning 16.02 Mb) representing bacterial contamination were removed. The final assembly of Chromera CCMP2878 contains 193.66 Mb (including 582,995 N's) in 5953 scaffolds (ENA accession numbers CDMZ01000001-CDMZ01005953). The scaffolds are constructed from 13,987 contigs.
Gene prediction
We used Augustus (Stanke et al., 2006) (version 2.5.5) for gene prediction. We manually curated 716 and 245 gene models for Chromera and Vitrella, respectively, using BLAST similarity-based approaches, and we also generated automated gene models using Cufflinks (Trapnell et al., 2012) from RNA-seq data sets, in order to use them as a ‘training gene model set’ for Augustus prediction. The strand-specific RNA-Seq, mapped with TopHat2 (Kim et al., 2013), was used as evidence in Augustus for intron evidence.
In summary, from the Chromera and Vitrella genome, we ab initio predicted 30,478 and 23,503 protein-coding genes, respectively, of which 18,829 and 18,240 were detected as being expressed from RNA-seq evidence as poly A+ transcripts (Supplementary file 1). Excluding putative TEs, 26,112 and 22,817 genes were predicted as protein-coding genes in Chromera and Vitrella. We annotated partial genes, when a gene probably spans more than one scaffold, located at the borders of a scaffold. We demarcated and annotated as pseudo genes if they contain in frame stop codons. We flagged gene models as transposon elements, if they overlap with the predicted TE regions and had no more than three and two intron for Chromera and Vitrella, respectively. To annotate untranslated regions (UTRs) of the predicted protein-coding genes, we used CRAIG (Bernal et al., 2007) with default parameters with mapping of the RNA-Seq data as computed by GSNAP (Wu and Nacu, 2010) (version 2013-08-19, default parameters). The annotation of both genomes has the ENA accession numbers CDMZ01000001-CDMZ01005953 and CDMY01000001-CDMY01001064 and is also available in EuPathDB (Aurrecoechea et al., 2013).
Functional annotations
The predicted genes were assigned putative functions based on BLASTP (E value <10−6) matches against UNIPROT (version March 2012). The predicted protein products were assigned protein domains using hmmsearch (HMMER 3.1b1, May 2013) for Pfam A v26.0. Statistical threshold defined by the Pfam (Finn et al., 2014) database was used. We aligned AP2 sequences in apicomplexan species based on PfamA AP2 (PF00847), and built apicomplexan-specific AP2 (apiAP2) hidden Markov model (HMM), and scanned the predicted protein-coding genes for apiAP2 domains; we annotated api-AP2 DNA-binding transcription factor genes with both domain and sequence E values to be less than 10−3. The following Pfam RBDs were used to define RNA-binding proteins: ‘CAT_RBD’, ‘dsRNA_bind’, ‘S1’, ‘DEAD’, ‘KH_1’, ‘KH_2’, ‘KH_3’, ‘KH_4’, ‘KH_5’, ‘RRM_1’, ‘RRM_2’, ‘RRM_3’, ‘RRM_4’, ‘RRM_5’, ‘RRM_6’, ‘SET’, ‘PUF’, and ‘RAP’. The list of DBDs was downloaded from a database of DBDs (Wilson et al., 2008). Transmembrane domains and signal peptides were assigned with the tools TMHMM 2.0 (Krogh et al., 2001) and signalP 4.0 (Petersen et al., 2011), respectively, with default parameters.
We collected several categories of genes implicated in parasitic processes in apicomplexans for two archetypal apicomplexan parasites, Toxoplasma and Plasmodium. We primarily obtained annotations from PlasmoDB (Bahl et al., 2003) and ToxoDB (Gajria et al., 2008). Information for sub-cellular localization of genes is obtained from GeneDB (Logan-Klumpler et al., 2012) and ApiLoc, a database of published protein sub-cellular localization for apicomplexan species (http://apiloc.biochem.unimelb.edu.au/apiloc/apiloc). Some putative parasite genes were inferred based on orthology by OrthoMCL clustering (Li et al., 2003) with closely related species with results from functional studies. We performed exhaustive literature searches to manually curate individual genes, to define rules for in silico searches across the proteomes of this study, and to categorize the identified genes based on their localization and function. The categories of parasite genes are defined as follows.
Cytoskeleton
The cytoskeleton of an organism provides the necessary structural framework for the maintenance of cell shape and integrity. We compiled two groups of cytoskeletal proteins, IMC associated proteins and actomyosin complex. First, IMC associated proteins, comprises alveolin proteins, a membrane occupation and recognition nexus protein (MORN), which associate with IMC and spindle poles and are indispensible for asexual and sexual development (Ferguson et al., 2008). IMC sub-compartment proteins (ISPs) are critical for establishing apical polarity in the parasite (Poulin et al., 2013). Second, components of actomyosin motor complex, which powers the characteristic gliding motility (Soldati-Favre, 2008), comprises actin, myosin, tubulin, gliding associated proteins (GAPs), aldolase, and various actin-regulatory proteins, which will assist actin in the process of quick polymerization–depolymerization cycles between F-actin and G-actin during this process. Examples of actin-regulatory proteins are Arp2/3 complex and formins (FH2) for nucleation; F-actin capping for filament regulation; coronin for cross-linking/bundling and profilin, CAP, cofilin/ADF and gelsolin for monomer treadmilling (Baum et al., 2006).
Extracellular proteins
Extracellular proteins are defined as parasite proteins, which are localized either on the surface or secreted off the parasite. They are released in a concerted manner to ensure successful adhesion to the surface, entry into the host cell, multiplication, and escape. Extracellular proteins can be categorized as (1) ‘exportome’ are proteins translocated to the host cytoplasm, membranes, and nucleus crossing the boundary membrane parasitophorous vacuole (PV); and (2) ‘others’, which stay on the parasite surface or released from the parasite, but not into the host intracellular space. The exportome genes are released mostly from the parasite's secretary organelles such as rhoptries and dense granules (Ravindran and Boothroyd, 2008; Treeck et al., 2011; Mundwiler-Pachlatko and Beck, 2013; Bougdour et al., 2014). Some of these genes possess host targeting or also known as the Plasmodium export element (PEXEL). Many PEXEL-negative proteins have been identified too (Hsiao et al., 2013; Mundwiler-Pachlatko and Beck, 2013). These genes are sorted and targeted through a specialized structure known as Maurer's cleft formed in the host cytoplasm (Mundwiler-Pachlatko and Beck, 2013). These genes are mostly kinases, proteases, and surface molecules, which modulate the host and hijack the host machinery in favor of parasitic growth and host immune evasion (Treeck et al., 2011; Li et al., 2012; Bougdour et al., 2014). The ‘other’ extracellular proteins consist of surface antigens (e.g., MSPs), SERAs, TRAPs, AMA-1, microneme proteins, ROPs and RONs etc.
TEs
Repeat annotation was done by using the REPET pipeline (Flutre et al., 2011) and LTR finder (Xu and Wang, 2007). The overall pipeline comprises of two steps: de novo detection and classification. In the first step, the scaffolds are split into smaller batches (∼1000 batches of 200 kb each). These genomic fragments were aligned against each other to detect the HSPs (High-scoring pairs) using BLASTER (Quesneville et al., 2003). HSPs are then clustered using a combination of three methods such as GROUPER (Quesneville et al., 2003), RECON (Bao and Eddy, 2002), and PILER (Edgar and Myers, 2005). Structure-based LTR retrotransposons (RTs) detection tools such as LTRharvest (Ellinghaus et al., 2008) and LTR finder, which are based on 100–1000 bp long terminal repeats with a 1 kb–15 kb separation and target site duplication site at vicinity of 60 bp to the two terminal repeats. These LTRs detected are clustered using BlastClust. Multiple sequence alignment of each cluster was performed using MUSCLE (Edgar, 2004). Each cluster aligned was searched against Repbase (Jurka et al., 2005) using BLASTER (Quesneville et al., 2003) and HMMER (Johnson et al., 2010). A consensus feature was detected for each aligned cluster. Further PASTEC (Flutre et al., 2011), which is based on the Wicker classification, was used for consensus classification.
The repeats were annotated as follows. The genomic chunks were randomized and HSPs were detected using BLASTER (Quesneville et al., 2003), CENSOR (Jurka et al., 1996), and RepeatMasker (Tempel, 2012). These HSPs were filtered and combined. Again, full-length genomic scaffolds were compared to Repbase using MATCHER. Satellite and simple repeats were detected using the mreps (Kolpakov et al., 2003), TRF (Benson, 1999), RMSSR (RepeatMasker). Finally, a long-join procedure was followed to combine the nested repeats. The whole annotation was exported to a genome-browser readable GFF3 file.
Clustering homologous genes
OrthoMCL 2.0 (Li et al., 2003) was used with a default inflation parameter (I = 1.5) (Chen et al., 2006) to generate groups of homologous genes (defined as orthogroups), which could have homologs from different species (putative orthologs) or from the same species (putative paralogs from gene duplications). For some genes of high interest, we manually inspected the alignments of the protein sequences within the orthogroup, which were done with MAFFT (Katoh and Standley, 2013). We assigned Pfam domains to an orthogroup if more than half of the genes in an orthogroup were assigned the Pfam domains.
Sub-cellular localization prediction
There are several tools available for a general eukaryotic sub-cellular localization prediction (Du et al., 2011), but they are not applicable to alveolates due to its unique chloroplast membrane arising from secondary endosymbiosis. Therefore, HECTAR (Gschloessl et al., 2008), which was developed for the bipartite sub-cellular prediction, was used. There is no stand-alone version of HECTAR, and the online version allows only one sequence at a time. We implemented a modified HECTAR algorithm as a PERL script for batch prediction of the whole proteomes. Each protein sequence was predicted for signal sequence using SignalP 3.0 (Bendtsen et al., 2004), the signal sequence is cleaved, and the remaining amino acid sequence was used as input for the transit peptide prediction by TargetP (Emanuelsson et al., 2000). Sequences with both signal peptide and the transit peptide (either chloroplast or mitochondria) are predicted to be in the chloroplast. Sequences without the signal peptide but with the transit peptide (either chloroplast or mitochondria) are predicted to be in mitochondria. Sequences with signal peptide, without transit peptide, and predicted by TargetP to be secretory are classified as secretory proteins.
For the RAP proteins, we tested the validity of our sub-cellular localization prediction in two ways. First, we compared our in-house algorithm with other published tools: TargetP (Emanuelsson et al., 2000), MitoProt2 (Claros and Vincens, 1996), iPSORT (Bannai et al., 2002), and PredSL (Petsalaki et al., 2006) (Supplementary file 6, only mitochondrial prediction is shown). We found that our mitochondrial prediction for RAP genes is in concordance with other methods. Second, we experimentally verified mitochondrial localization in T. gondii by 3′ tagging of the endogenous genes with the coding sequence for the hemagglutinin epitope for three RAP proteins that were predicted to target to mitochondria with high probability.
Statistical analysis
A statistical environment software R was used for most of the analyses and generating parts of figures. An R package vioplot was used to generate the violin plot (Hintze and Nelson, 1998). A ward algorithm on the distance matrix based on (1- correlation coefficients) in an R function hclust was used for all hierarchical clustering of gene expression patterns unless noted otherwise.
Evolutionary analysis
We compiled the reference proteomes of 26 alveolate and stramenopile species (Figure 1—source data 2) from public databases such as EupathDB (Aurrecoechea et al., 2013) and NCBI Genome database (http://www.ncbi.nlm.nih.gov/genome/).
We generated a phylogenetic species tree using a data set composed of 101 one-to-one orthologs across the 26 species (see Figure 1—source data 1 for gene IDs). Amino acid sequences were aligned using MAFFT (Katoh and Standley, 2013), highly variable sites were edited by trimAL (Capella-Gutierrez et al., 2009) and after manual inspection. The resulting alignment of 33,997 amino acid positions was used to construct trees by a maximum likelihood (ML) method and Bayesian inference. The ML tree was computed using RAxML 8.1.16 by gamma corrected LG4X model (Stamatakis, 2014; Le et al., 2012). Robustness of the tree was estimated by bootstrap analysis in 1000 replicates. Bayesian tree was constructed by PhyloBayes (Lartillot and Philippe, 2004) using two-infinite mixture model CAT-GTR as implemented in PhyloBayes 3.3f. Two independent chains were run until they converged (i.e., maximum observed discrepancy was lower than 0.2), and the effective number of model parameters was at least 100 after the first 1/5 generation was omitted from topology and posterior probability inference. All clades in the tree were supported with posterior probability 1.00 and 100% bootstraps, except for one node, which representing the common ancestor of human Plasmodium spp. was supported by 99% bootstrap.
We performed the gene gain and loss analysis based on Dollo parsimony using Count software (Csuros, 2010). This approach allows reconstructing gene contents at observed species and at hypothetical ancestors, and gene gains and losses at branching points. The Dollo parsimony strictly prohibits multiple gains of genes. To test for validity of this assumption, we repeated analyses based on parsimony settings allowing multiple gene gains or on a phylogenetic birth-and-death model (Csuros, 2010) and reached the same conclusion (Figure 2—figure supplement 1). We have also repeated the analysis using Wagner's parsimony, allowing multiple gains per tree with gain penalty of 2 or greater, and obtained similar results (data not shown). For the analysis of metabolic enzymes, endomembrane trafficking system components, and flagellar apparatus components, the ancestral presence was inferred based on Dollo parsimony from the presence of components in the observed species. For the endomembrane trafficking component analysis, we assumed that the last common ancestor had a complete repertoire of the components.
We have inferred the evolutionary age of P. falciparum and T. gondii genes as the early node on the phylogenetic tree where the most distant species have genes with significant sequence homology (reciprocal BLASTP E value <10−10 and clustering with OrthoMCL).
Comparison of gene expression network between Chromera velia and Plasmodium falciparum
We studied if orthologs of Chromera and P. falciparum show similar gene expression changes to physiologically equivalent growth conditions. Identifying equivalent conditions is difficult as the two species have completely different lifestyles and live in different environments. Instead, we tested if a given gene and its ortholog would show correlated expression patterns with the same set of genes (and orthologs), allowing a way to compare gene expression behavior measured under different conditions. To uncover gene-to-gene co-expression relationships, the organisms from whom transcriptomes are sampled must be exposed to various growth conditions. This approach has been successfully used in other eukaryotes (Stuart et al., 2003; Hu et al., 2010; Mutwil et al., 2011). For Chromera, we generated RNA-seq-based transcriptome under combinations of varying salt concentrations, iron concentrations, and temperature changes, resulting in 36 unique combinations (see ‘Materials and methods’ and Figure 4—figure supplement 1C). For P. falciparum, we obtained previously published microarray-based gene expression data sets of 144 unique conditions from 23 time series, representing stresses from various growth-inhibiting compounds (Hu et al., 2010). It has been shown that gene expression data generated using different molecular platforms are reproducible and accurate enough for cross-platform comparisons (Woo et al., 2004). Based on each data set, we calculated Spearman correlation coefficients rho between all possible pairs from the 1918 orthogroups shared between Chromera and P. falciparum (1918 × 1918 matrix). We also calculated a 1918 × 1918 weighted adjacency matrix using CLR algorithm (Faith et al., 2007) as implemented in an R package minet (with parameters of method = ‘clr’, estimator = ‘mi.shrink’, and disc = ‘equalfreq’) (Meyer et al., 2008). Expression level of multiple genes in a given orthogroup was averaged. To rule out any potential systematic biases associated with averaging expression levels of homologous, yet distinct genes, we repeated some of the analyses with 1560 orthogroups that have one-to-one orthologs between the two species and reached the same conclusions (data not shown). A pair of genes (or orthogroup) were determined as co-expressed if the Spearman's correlation coefficient rho is greater than 0.3 and if the value from the weighted adjacency matrix of the network is greater than 0.01. We calculated an odds-ratio to measure the extent of conservation of co-expressed genes: (# of genes co-expressed in both species) × (# of genes co-expressed in none of the species)/([# of genes co-expressed in P. falciparum only] × [# of genes co-expressed in C. velia only]), and Fisher's exact test was used to assess the statistical significance. For calculation of the odds-ratios, co-expression was determined based on correlation coefficient to minimize count granularity in the two-by-two table.
2 Parasite Genomics, Wellcome Trust Sanger Institute , Wellcome Trust Genome Campus , Cambridge , United Kingdom
3 Department of Cell Biology , University of Alberta , Edmonton , Canada
4 Canadian Institute for Advanced Research, Department of Botany , University of British Columbia , Vancouver , Canada
5 Institute of Parasitology, Biology Centre , Czech Academy of Sciences , České Budějovice , Czech Republic
6 Faculty of Sciences , University of South Bohemia , České Budějovice , Czech Republic
7 Biochemical Sciences Division , CSIR National Chemical Laboratory , Pune , India
8 Ecology and Evolutionary Biology Section , Institut de Biologie de l'Ecole Normale Supérieure, CNRS UMR8197 INSERM U1024 , Paris , France
9 Bioscience Core Laboratory , King Abdullah University of Science and Technology , Thuwal , Saudi Arabia
10 Department of Biology , University of Pennsylvania , Philadelphia , United States
11 Life Science Research Centre, Faculty of Science , University of Ostrava , Ostrava , Czech Republic
12 School of Botany , University of Melbourne , Parkville , Australia
13 Seattle Biomedical Research Institute , Seattle , United States
14 Centro de Biología Molecular Severo Ochoa , CSIC/Universidad Autónoma de Madrid , Madrid , Spain
15 IE Business School, IE University , Madrid , Spain
16 Centre for GeoGenetics, Natural History Museum of Denmark , University of Copenhagen , Copenhagen , Denmark
17 European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton , Cambridge , United Kingdom
18 Wellcome Trust Centre For Molecular Parasitology, Institute of Infection, Immunity and Inflammation, College of Medical, Veterinary and Life Sciences , University of Glasgow , Glasgow , United Kingdom
19 Broad Genome Sequencing and Analysis Program , Broad Institute of MIT and Harvard , Cambridge , United States
20 Department of Microbiology , Monash University , Clayton , Australia
21 Department of Microbiology and Immunology , Weill Cornell Medical College , New York , United States
22 Department of Protozoology, Institute of Tropical Medicine , Nagasaki University , Nagasaki , Japan
23 Department of Biochemistry , University of Cambridge , Cambridge , United Kingdom
24 Canadian Institute for Advanced Research , Toronto , Canada
25 Institute of Microbiology , Czech Academy of Sciences , České Budějovice , Czech Republic
Vienna Biocenter , Austria
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2015, Woo et al. This work is licensed under the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/3.0/ ) (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The eukaryotic phylum Apicomplexa encompasses thousands of obligate intracellular parasites of humans and animals with immense socio-economic and health impacts. We sequenced nuclear genomes of Chromera velia and Vitrella brassicaformis, free-living non-parasitic photosynthetic algae closely related to apicomplexans. Proteins from key metabolic pathways and from the endomembrane trafficking systems associated with a free-living lifestyle have been progressively and non-randomly lost during adaptation to parasitism. The free-living ancestor contained a broad repertoire of genes many of which were repurposed for parasitic processes, such as extracellular proteins, components of a motility apparatus, and DNA- and RNA-binding protein families. Based on transcriptome analyses across 36 environmental conditions, Chromera orthologs of apicomplexan invasion-related motility genes were co-regulated with genes encoding the flagellar apparatus, supporting the functional contribution of flagella to the evolution of invasion machinery. This study provides insights into how obligate parasites with diverse life strategies arose from a once free-living phototrophic marine alga.
DOI: http://dx.doi.org/10.7554/eLife.06974.001
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer