INTRODUCTION
The term “pan-genome” was coined to represent the entire set of genes belonging to multiple strains of a microbial species. The pan-genome includes a core set of genes shared by all strains and an accessory genome of partially shared and strain-specific genes (18). Analysis of the pan-genome from the family Vibrionaceae has revealed both the extent of horizontal gene transfer (HGT) among strains and species (up to 75% of total genes in the family Vibrionaceae) as well as how genetic variability influences pathogenicity and niche adaptation (19). These easily exchanged mobile genetic elements (MGEs) can influence the composition of microbial communities and contribute to environmental selection pressures (20). Many computational tools exist to streamline the study of MGEs (21, 22). However, given their frequent roles in virulence and host interactions, sequences acquired by HGT are often subject to fast evolution that compromises functional and phylogenetic interpretations (23–25).
Identifying homologs of fast-evolving proteins using their amino acid sequences has been proven difficult, even for the most sensitive search methods (26–28). Protein structures are more conserved than sequences (29–31). Thus, determining structures of fast-evolving virulence proteins and identifying their relationship to known folds have traditionally informed function (32–34). Recent progress in protein structure prediction has shown that deep-learning networks such as AlphaFold can achieve atomic-resolution results, even for fast-evolving proteins (35–40). Folds rendered by these structure prediction methods can provide insight into distant homologs that have lost their sequence signals. For example, VtrA and VtrC encoded by the RIMD T3SS2 pathogenicity island serve as a co-component signal transduction system that turns on transcription of the T3SS2 virulence machine and its effectors in response to host bile acid. This system could only be identified in other enteric bacteria using their common domain and operon organization combined with structure comparison of AlphaFold models (41).
Recently, AlphaFold protein structure DataBase (AFDB) (42) has released models for over 200 million proteins, including those encoded by the RIMD genome. Such predictions should provide an invaluable resource for the functional annotation of RIMD proteins, especially for those that have evolved beyond sequence detection limits. One method for analyzing large numbers of predicted protein structures is to place them in context with an evolution-based database of domains from experimental structures. The Evolutionary Classification of Domains (ECOD) database provides such a resource, which has been used in combination with an automated Domain Parser for AlphaFold Models (DPAM) to classify the human proteome (33, 43, 44). ECOD is a hierarchical classification of protein domains from experimentally determined structures in Protein Data Bank (PDB), and domains are partitioned into groups by overall architecture (A-group), homology (H-group), and similar topology (T-group). A similar genome-wide classification of RIMD models into the ECOD hierarchy should expand our knowledge of pathogenic adaptations in RIMD MGEs that remain largely unknown.
This study provides an evolution-based classification of AlphaFold models for the RIMD proteins. The proteome was parsed into 7,696 domains, representing 85.6% of all residues. Among these domains, 92% were classified among known folds. To gain insights into the molecular function of fast-evolving MGEs, an expanded comparative analysis of 86 complete
RESULTS AND DISCUSSION
An evolutionary classification of RIMD AlphaFold models improves functional annotation
We classified AlphaFold models from the RIMD proteome into the ECOD hierarchy using an improved version (see Materials and Methods) of the DPAM pipeline (33, 43, 45). DPAM parsed 7,696 domains from 4,822 RIMD proteins, accounting for 85.6% of all residues. Among these, 7,107 (92%) are placed into the existing ECOD hierarchy, accounting for 80.9% of all residues (Fig. S1). The unassigned set includes 589 domains with low DPAM probability (< 0.85), which could either be classified into the existing ECOD hierarchy with manual inspection (half have DPAM probability >0.7) or might adopt new folds not yet cataloged by ECOD. A public database (http://prodata.swmed.edu/ecod/index_rimd.php) provides browsable RIMD domains assigned to the ECOD hierarchy and downloadable data sets of domain ranges, sequences, and structures.
A summary of the assigned ECOD domain architectures provides an overview of the fold types in the proteome (Fig. 1A). The most populated architecture, a/b three-layered sandwiches, represents 25% of RIMD domains, significantly higher than the fraction (18%) among non-redundant ECOD domains from experimental structures in the PDB (Table S1). This well-populated architecture includes proteins with a Rossmann-like motif that typically function in metabolism (46). Notably, small protein architectures featured by extended segments and few secondary structure elements are not well populated, as they tend to adopt flexible structures that are harder to classify and are filtered out in the process of identifying globular domains. Filtered sequences are mainly small and flexible, with few secondary structure elements, and often include signal peptides.
Fig 1
Automated structure classification of RIMD proteome into evolutionary domains. (A) A doughnut plot summarizes general fold class and ECOD architectures, which are defined by Secondary Structure Element (SSE) arrangement, parsed and assigned by the DPAM automated pipeline from RIMD AlphaFold models. (B–D) Classified RIMD domains (right) and functional insights gained by comparing to their parent domains with confident sequence similarity (left), with all fold topologies depicted in rainbow from the N-terminus (blue) to the C-terminus (red). (B) Parent domain (e6dlhA4) coordinates calcium (green sphere) with four residues (magenta stick), three of which are in a domain of VP0053 (magenta stick). (C) The parent domain (e4lvlA1) that binds DNA (black cartoon) near active site residues (magenta stick) coordinating Mn (gray sphere) is assigned to a domain of VP2137, which has identical active site residues (magenta stick). (D) The parent domain (e2r25A1) includes an active site, histidine, (magenta stick) that is also present in the assigned domain from VPA1046 (magenta stick). (E) DPAM parses the VPA1592 AlphaFold model into five domains (blue, cyan, green, yellow, and orange). (F) A domain of VPA1592 (right, conserved topology in a rainbow) is assigned to a parent domain in the same H-group as a structure (left, 2fue) bound to mannose 1-phosphate (black stick). (G) A domain in VPA1592 (right, rainbow topology) is assigned to the parent domain (1uh4) bound to malto-oligosaccharides (black stick).
Compared to the ECOD PDB domain set (Table S1), the RIMD genome encodes more α-helical domains (27%) than expected (20%). This over-representation reflects an elevated number of transmembrane transporters in the bacterial genome, including dicarboxylate/sodium symporters and ABC transporters. This over-representation could reflect the low fraction of experimentally determined transmembrane proteins (47) but can also signify the ability of bacteria to quickly adapt to their environments by importing or exporting various molecules like sugars, metals, bacteriocins, or antibiotics (48). Multiple bacterial sensing and chemotaxis domains are also over-represented, including HAMP domains and homodimeric domains of signal transducing histidine kinases.
The functions of classified proteins are more meaningful for those with close sequence relationships to known proteins. Among the assigned RIMD domains, 6,476 (90%) were classified with confident sequence relationships to known structures (HHpred probability >90%). While the functions of these proteins are mostly known or can be gleaned from close sequence relationships, a subset (278 proteins, 5.8% of the RIMD proteome) are not annotated with sequence-based domains from the comprehensive database of protein families, PFAM (49), which traditionally provides functional inference for hypothetical proteins. This previously unannotated set of RIMD proteins exemplifies the ability of AlphaFold models to inform function beyond traditional sequence-based methods, as illustrated by the following examples of uncharacterized RIMD proteins.
DPAM parses the VP0053 model into five domains. These include an immunoglobulin-related (Ig) domain with confident sequence similarity (HHpred probability 94%) to a glycoside hydrolase Ig-like calcium-binding domain. The glycoside hydrolase Ig coordinates calcium using four residues. Three are preserved in VP0053, suggesting it might also bind calcium (Fig. 1B). The remaining VP0053 domains (Fig. S2) have diverged beyond sequence recognition and include a duplicated Ig-like domain with similar calcium-binding residues, as well as two interacting lipocalin-like barrels that might bind hydrophobic ligands in the center, similar to the rest of the lipocalin superfamily (50). The last domain possesses an Ig-like topology but lacks a confident automated assignment due to an extension of several helices. VP0053 highlights the ability of AlphaFold models to resolve problems associated with sequence classification of multidomain proteins, especially those with nested insertions.
AlphaFold models also resolve domain assignments for proteins within MGEs that tend to evolve quickly. VP2137 belongs to a
The main novelty of AlphaFold domain assignments is in recognizing distant evolutionary relationships that extend beyond sequence recognition but still inform function. DPAM classified an additional 631 domains into the ECOD hierarchy without significant sequence relationships. These belong to 510 RIMD proteins (11% of the proteome). For example, DPAM parses the VPA1592 AraC-type transcription factor into five domains (Fig. 1E), including a sequence-related AraC-like helix-turn-helix (HTH) duplication. DNA-binding by AraC transcription factors is governed by accompanying companion domains that sense distinct signaling molecules (51). Additional domains in VPA1592 may also function as sensors. One resembles the phosphomannomutase cap domain that recognizes the enzyme’s substrate mannose-1-phosphate (Fig. 1F), while the other two adopt Ig-like folds whose related structures function as carbohydrate-binding modules (Fig. 1G). These relationships suggest a regulatory role for the cap and Ig domains of VPA1592 in sensing carbohydrates.
Comparison of
The automated assignments described above provide new insights into function, especially for those proteins encoded by known RIMD pathogenicity islands (7–14). These islands are acquired by HGT, allow RIMD to expand ecological and pathogenic niches, and often exhibit fast evolution beyond sequence recognition (20, 41). Previous studies identifying RIMD MGEs (7–14) concentrated on genomic islands and other relatively long sequence regions that typify HGT. However, some RIMD T6SS virulence factors are encoded outside previously discovered MGEs and can exist as smaller auxiliary modules (10, 52). To comprehensively characterize MGEs in RIMD, we compared its proteome to 86 publicly available
Fig 2
Multiple
To help identify RIMD virulence factors, we compared the proteome to those encoded by 86
DistX scores for each of the 86 strains reveal similar bimodal distributions, suggesting a presence (DistX close to 0) or absence (DistX close to 1) scenario for most of the RIMD proteome. Elevated DistX indicates HGT or fast evolution, and a composite distance score summarizing all strains should indicate the degree of these evolutionary events for each RIMD protein. We calculated the average of DistX, namely, DistC, across all 86 strains and a subset of 64 strains after removing redundancy, respectively. DistC also adopts a bimodal distribution, but the scores are more dispersed (dark blue and maroon bars in Fig. 2B). This dispersed distribution suggests that
To test whether DistC scores can be used to identify the RIMD mobilome, we examined the distribution of DistC for known RIMD MGEs. The DistC scores for MGE-encoded proteins shift to higher values than the non-MGE DistC scores (Fig. 2C), with ~60% showing DistC above 0.05. Nineteen out of the 26 known MGEs show an average DistC above 0.05, and MGEs with an identified integrase or transposon show much higher DistC (above 0.4, as in Table 1). Despite the lack of an integrase/transposon, the O3:K6 antigen region encoding the capsular polysaccharide (CPS) proteins and transport apparatus also shows a high average DistC (0.46). This region is reported to evolve by recombination and gene duplication (54, 55). Regions showing high DistC (average DistC above 0.1 in a window of 11 consecutive proteins) mostly overlap with known MGEs (Fig. 2D). However, we found additional, mostly shorter regions in the RIMD genome showing elevated DistC scores (discussed in Novel Mobile Genetic Elements Identified in the RIMD Mobilome).
TABLE 1
Previously identified
RIMD Open Reading Frames (ORF)s | Region type | Int | DistCavg | PubMed reference number (PMID) |
---|---|---|---|---|
Chromosome I | ||||
VP0081–VP0092 | NK | – | 0.02 | 18590559 |
VP0187–VP0238 | O3:K6 Antigen | – | 0.46 | 18195030 |
VP0380–VP0403 | VPaI-1 | Int | 0.516 | 16672049 |
VP0634–VP0643 | VPaI-2 | Int | 0.447 | 16672049 |
VP1071–VP1095 | VPaI-3 | Int | 0.582 | 16672049 |
VP1386–VP1420 | T6SS1 | – | 0.341 | 18590559 |
VP1549–VP1590 | Phage f237 | Int | 0.49 | 9399511 |
VP1658–VP1702 | T3SS1 | – | 0.04 | 16672049 |
VP1719–VP1728 | Osmotolerance | – | 0.017 | 16894340 |
VP1787–VP1865 | Integron class-1 | Int | 0.519 | 18590559 |
VP2131–VP2144 | VPaI-4 | Int | 0.672 | 16672049 |
VP2900–VP2910 | VPaI-5 | Int | 0.813 | 16672049 |
Chromosome II | ||||
VPA0434–VPA0458 | Degradative | Int | 0.438 | 18590559 |
VPA0887–VPA0914 | Phage f237-like | Int | 0.418 | 18590559 |
VPA0950–VPA0962 | Biofilm | – | 0.129 | 18590559 |
VPA0989–VPA0999 | Gametolysin | – | 0.085 | 18590559 |
VPA1024–VPA1046 | T6SS2 | – | 0.015 | 22924031 |
VPA1102–VPA1115 | Osmotolerance | – | 0.023 | 16894340 |
VPA1253–VPA1270 | VPaI-6 | Int | 0.662 | 16672049 |
VPA1312–VPA1395 | VPaI-7 (T3SS2) | Tnp | 0.583 | 16672049 |
VPA1403–VPA1412 | CPS | – | 0.026 | 18590559 |
VPA1440–VPA1444 | T1SS | – | 0.074 | 18590559 |
VPA1503–VPA1521 | Type I pilus | – | 0.02 | 18590559 |
VPA1559–VPA1583 | Multidrug efflux | – | 0.091 | 18590559 |
VPA1652–VPA1679 | Ferric uptake | – | 0.07 | 18590559 |
VPA1700–VPA1709 | Mannose metabolism | – | 0.309 | 18195030 |
Presence of an integrase (Int), transposase (Tnp), or none (–) according to a previous study (7).
RIMD MGEs are enriched with phage proteins and bacterial defense domains
We consider the set of proteins with DistC above 0.05 as the RIMD mobilome that potentially originates from HGT, and we thus refer to them as HGT proteins/domains. Figure 3A highlights which types of domains are over-represented in the RIMD mobilome compared to the rest of the proteome. The enriched HGT domains include homologous groups that typically belong to phages or mediate DNA exchange, such as nucleoplasmin-like/viral coat and plasmid proteins, lambda integrase domains, resolvase-like, His-Me finger endonucleases, and phage-related HTH DNA-binding domains (Fig. 3A). The presence of phage-related proteins highlights their involvement in passing DNA from one bacterial host to another by various documented mechanisms (56). Prophage-related elements can thus indicate passenger genes acquired in bacterial genomes and contribute to antibiotic resistance and pathogenicity (56).
Fig 3
Domain assignments for the RIMD mobilome. (A) Bar graph of observed/expected frequencies calculated for H-groups (labeled on the left) of RIMD domains from its mobilome (compared to all assigned domains). This graph includes over-represented and under-represented H-groups with significance of <0.05 by Fisher’s exact tests and MGE domain count of >2. (B) Heatmap of DistX scores among non-redundant proteomes for proteins whose frequency of being present and conserved (DistX <0.05) in clinical isolates is at least two times the frequency in environmental isolates. We colored high DistX by lighter green because a high score frequently indicates the protein is absent from a proteome. (C) The ospC2-like sequence in VPA1331 (top) is a deteriorated fragment of OspC2 (bottom, green cartoon, 7wzs) which binds host calmodulin (magenta) using an N-terminal helix (orange) that is retained in VPA1331. (D) Surface rendered AlphaFold model of MlaDEF-like putative ABC transport system (VP1361-VP1364), where color code distinguishes different domains. Amino acid positions constant between this system and a classic MlaBCDEF system of RIMD (VP2660-VP2664) are colored light, and variable positions are colored dark. These models are superimposed onto an ABC transporter (PDB: 7cge) multimeric assembly shown as a tube and colored by chain. A phospholipid bound to the ABC transporter is shown as magenta sticks and zoomed in in the insert.
Additional over-represented domains among the RIMD mobilome include toxin-antitoxin (TA) systems that play roles in bacterial adaptation to phage infection, antibiotics, or oxidative stress. RNA cleavage by RelE-like toxins is inhibited by interaction with extended C-termini of the more diverse antitoxins. The antitoxin N-termini adopt different DNA-binding folds, such as HTH in RelB-like and an α + β domain in the YefM-like antitoxins. These N-domains regulate their respective TA operons. Three RelE-like TA systems reside in the RIMD integron-class1 (IntC1) pathogenicity island (VP1787–VP1865): VP1820/VP1821 (YefM-like), VP1830/VP1829 (RelB-like), and VP1843/VP1842 (RelB-like), highlighting the ability of this region to acquire mobile gene cassettes through its recombination system.
Another category of over-represented HGT domains includes restriction-modification systems (restriction endonuclease-like, His-Me finger endonucleases, and HTH domains) involved in defense against phage (57). RIMD includes 12 HGT proteins with a restriction endonuclease-like fold, with 10 encoded by the known pathogenicity islands: VPaI-1 (VP0395, VP0400, and VP0401), VPaI-3 (VP1083 and VP1087), int-C1 (VP1805 and VP1823), VPaI-4 (VP2143), VPaI-6 (VPA1256), and multidrug efflux (VPA1572). Two restriction enzymes include P-loop NTPase domain helicases like the newly described nuclease-helicase immunity (Nhi) family that targets and degrades phage-specific replication intermediates (58). One putative Nhi (VP1083) includes a zincin-like metalloprotease domain and may also cleave proximal DNA-binding proteins, while the other (VP0395) includes HTH domains that likely bind DNA. In other bacterial pathogens, restriction-modification systems have been implicated in virulence and host immune evasion through methylation-controlled phase variation, i.e., the switch of gene expression profiles (57). The VPaI-1 region includes two proteins (VP0388 and VP0394) with S-adenosylmethionine-dependent methyltransferase domains that might play similar phase variation roles.
Finally, several domains involved in modifying sugars (GfcC, UDP-glycosyltransferase/glycogen phosphorylase, and nucleotide-diphosphate-sugar transferases) of bacterial lipopolysaccharide (LPS, O antigen) and CPS (K antigen) are over-represented among the RIMD mobilome. These groups stem from the known MGE encoding O3:K6 antigen determinants (14), which are responsible for stimulating innate immunity in humans and are targets of antibacterial drugs. The region is known to include a small set of conserved core genes that likely synthesize the common LPS precursor and a large set of variable genes among 40
Patterns of presence/absence for RIMD proteins in known MGEs suggest virulence
Proteins encoded by known MGEs may contribute to the pathogenicity of RIMD. We identified 51 proteins showing higher preference to be present and conserved among non-redundant clinical strains than environmental isolates (and see Materials and Methods), and 37 of them are encoded by known MGEs (Fig. 3B). These proteins include a thermostable direct hemolysin (VPA1314) and the CRISPR/Cas system (VPA1388-VPA1390) from the VPaI-7 (T3SS2). Consistent with this observation, a previous comparative genomic study proposed that the T3SS2 pathogenicity island in a related MAVP-Q strain was acquired independently to gain pathogenicity from a previously benign strain (59). Additional clinical strain-associated proteins are encoded by the filamentous phage f237, whose fragment encoding VP1561 has been used as a genetic marker for pandemic strains (60), and the IntC1 region (Fig. 3B). The IntC1 phage integrase (VP1865) includes a lambda-integrase N-terminal domain followed by two C-terminal HTH domains. Similar integrases mediate recombination between an integrase proximal primary site (attI) and a secondary target site (attC) found within mobile gene cassettes encoding resistance or virulence factors (61). Thus, integrons like IntC1 play a major role in the spread of multidrug resistance and other virulence mechanisms. Clinical isolate segregating proteins acquired by this region include a potential glyoxalase/bleomycin resistance protein (VP1798) and two acetyltransferases (VP1794 and VP1827) that could potentially modify antibiotic peptides, small molecules, or other proteins.
Our domain classification of RIMD proteins provides functional insights into known MGEs without well-defined functions (Table 1). Despite the challenge of classifying such proteins, we could confidently assign domains in 483 (73%) out of 663 (Table S3). As suggested by the enrichment of RelE-like TA systems and restriction-modification system domains among MGEs (Fig. 3A), the inferred functions of many other MGE-encoded proteins might provide a selective advantage to RIMD. For example, VPaI-1 encodes a high-persistence TA system that regulates the formation of persister cells to survive various stresses, including antibiotics (62), and VPaI-4 encodes an AAA +ATPase-containing protein (VP2142) with an adjacent MrcB-like restriction enzyme (VP2143). Homologous AAA + ATPase and restriction enzyme systems cleave invading methylated phage DNA (63). Acquisition of these two pathogenicity islands by RIMD likely confers phage resistance to the strain, providing a potential selective advantage over competing bacteria as well as allowing persistence in stressful environmental niches like those encountered in the host gut.
Similar patterns of presence and absence across
The T6SSs were previously analyzed using pan-genome comparisons, which highlighted acquisition of the T6SS1 by HGT and omnipresence of the T6SS2 (10). Consistent with this finding, proteins encoded by T6SS1 have elevated DistC, while proteins encoded by T6SS2 have low DistC. Notably, the annotated T6SS2 island appears to be restricted to the T6SS secretion machinery, auxiliary, and regulatory proteins, with identified effector immunity pairs residing outside of the T6SS2 genomic neighborhood (66). The T6SS2 includes a single hypothetical protein of unknown function (VPA1024). VPA1024 adopts a duplicated thiolase-like fold, with the N-terminal domain having similarity to a non-canonical ketosynthase FabY and the C-terminal domain having similarity to a fatty acid synthase alpha subunit. Many thiolase-like folds represent modules of polyketide synthases, whose metabolites include diverse chemical structures and biological activities including antibiotic and predator defense properties (67). The association of polyketide or fatty acid modification with the T6SS remains enigmatic, but an unknown protein (VP1399) belonging to the more recently acquired T6SS1 adopts a similar thiolase-like fold duplication.
Numerous fragmented proteins in RIMD genome MGEs have high DistX due to their small size. One such protein, VPA1331, encoded by the T3SS2 resembles a portion of the outer
Novel mobile genetic elements identified in the RIMD mobilome
Given the tendency of proteins from known MGEs to exhibit elevated DistC values, we propose that genes with similar elevated DistC might also be mobile. Potential novel MGEs with DistC above 0.05 encode an additional 314 proteins. We assigned 444 domains from 235 of these proteins (Table S4). These MGEs form small clusters:
The region
A neighborhood from
Orthologous genome neighborhoods distinguish fast-evolving proteins from HGT
Elevated DistC scores do not distinguish between proteins arising from HGT and those evolving rapidly. To better discriminate between these events, we compared the RIMD genomic neighborhood of each protein to its neighbors in the pre-pandemic environmental BB22OP strain. A protein displaying an elevated DistX (>0.05) score between RIMD and BB22OP but with similar genomic neighbors in both strains was considered as fast-evolving. Among 102 fast-evolving proteins (Table S5), 132 domains in 73 are confidently classified.
The elevated DistX scores may reflect differences in protein lengths (i.e., Fig. 3C), lowered sequence identity, or a combination of both. For example, the RIMD effector VopV (VPA1357) from the T3SS2 pathogenicity island is twice the size (1622 residues) of the orthologous sequence (VPBB_A1234, 876 residues) in the BB22OP T3SS2. Both sequences include intrinsically disordered glycine-rich repeats. Another fast-evolving RIMD protein (VPA1455) includes a 66-residue polyQ repeat that is longer than the corresponding BB22OP protein (39 residues). An orthologous protein from
To determine over-represented domain types among these fast-evolving proteins relative to the entire RIMD proteome, we performed enrichment analysis for ECOD homologous groups appearing more than once in the fast-evolving protein set (Fig. 4A). This comparison revealed pili subunits, lipocalins, immunoglobulin-related, and porins, among others. Typically, such domains mediate interactions at the bacterial cell surface that might evolve to adapt to environmental changes and evade unfavorable factors such as the host immune system. Therefore, analysis of fast-evolving RIMD proteins can provide insights into RIMD’s adaptation to their environment.
Fig 4
Fast-evolving RIMD proteins promote competitive traits
The T6SS1 pathogenicity island in RIMD and BB22OP strains injects toxic proteins into target bacteria, fungi, or host cells to gain a competitive advantage. The T6SS1 includes an operon consisting of a secreted co-effector (VP1388), an immunity protein (VP1389), and a coregulated toxin (VP1390) responsible for antibacterial toxicity (71, 72). All three proteins diverged rapidly between RIMD and BB22OP (DistX >0.4). Divergence in VP1388 is not uniform across the protein structure (Fig. 4B, spheres depict differences). The conserved N-terminus includes a deteriorated helix-extension-helix motif followed by an immunoglobulin-related domain representing part of a MIX secretion signal (72). Three divergent C-terminal immunoglobulin repeats resemble structures that mediate bacterial adhesion (VP1388 domain 3 and 4) and carbohydrate recognition domains (VP1388 domain 5). The fast-evolving C-terminal domains are likely co-evolving with the interacting toxin, while the invariant MIX secretion signal interacts with the conserved T6SS1 machinery.
Several RIMD pili domains are experiencing fast evolution, including chitin-regulated pilus PilA (VP2523), mannose-sensitive hemagglutinin (MSHA) biogenesis pilus (VP2696), and an isolated pilin (VPA0747). The globular heads of pili have been shown to undergo rapid adaptation to accommodate variation at the recipient cell surface for functioning in antigenicity, adhesion, and colony formation (73). Accordingly, the MshA pilus of RIMD binds human host intestinal endothelial cells, while both PilA and MshA mediate biofilm formation and adherence to chitin polymers abundant in estuarine waters serving as
Multiple outer membrane porins are also experiencing fast evolution, including maltoporin (VPA1644) and its two neighboring genes, maltose periplasmic protein MalM (VPA1642) with an assigned concanavalin A-like domain, and a presumed periplasmic protein with a starch specific carbohydrate-binding module (VPA1643). Interestingly, maltoporin also functions as a lambda phage receptor. Thus, the rapid evolution might be due to selection against this co-habiting phage. In support of this hypothesis, residues altered in the maltoporin line the extracellular surface of the protein, while conserved residues line the pore and the periplasmic surface (Fig. 4D, identical light cyan and variable dark cyan). Thus, the fast-evolving structure may maintain the essential maltose-transporting function of the protein while masking its surface from phage.
Several other fast-evolving domains belong to the RIMD O3:K6 antigen or CPS islands that determine the composition of the polysaccharides on the bacterial surface. These domains include homologs of the bacterial polysaccharide co-polymerase FepE (VP0221 and VPA1406) (76), the GfcC protein (duplicated in VP0216) that is essential for assembly of the O-antigen capsule and may be necessary for secreting biofilm-forming exopolysaccharides (77), and the UDP-glycosyltransferase/glycogen phosphorylase (VP0211 and VP0212) enzymes that transfer activated sugars to a variety of substrates. The fast evolution of these LPS island proteins likely correlates with altered substrate specificity for polysaccharides in RIMD (O3:K6) compared to BB22OP (O4:K8).
Conclusions
Evolution-based domain classification established in ECOD provided the basis for enhanced functional insight into the RIMD proteome. Results are deposited as an online database for further investigation by the scientific community. Comparisons with complete
MATERIALS AND METHODS
Domain parsing and classification of the RIMD proteome
AlphaFold models from the proteome of
In this study, we discovered that DPAM might split the decorations of core domains as separate domains because these decorations are frequently not tightly packed against the cores. Therefore, we modified our DPAM pipeline to examine neighboring domains assigned to the same ECOD T-group and decide if they should be merged into a single domain. Neighboring domains are detected by sequence (separated by ≤5 residues) and three-dimensional structure (with ≥9 residue pairs that are ≤8 away). We counted the fraction of ECOD hits with confidence comparable to top hit (DPAM probability > top DPAM probability + 0.1) that support merging two neighboring domains, i.e., both query domains mapping to different regions of the same ECOD domain with overlap of <25% of the mapped residues for either query domain. We merge the two neighboring domains if this fraction is above 50%.
Additionally, domains detected by the DPAM pipeline were further evaluated by the number of secondary structure elements [≥6 residues predicted as H, G, or I by DSSP (80) were considered a helix, and ≥3 residues predicted as E or B by DSSP were considered as a strand] and the completeness of a domain. A domain with less than three secondary structure elements and without confident (HHsuite probability ≥95% and coverage ≥80%) ECOD hits by HHsuite is considered a “simple topology.” A domain that is too short compared to its ECOD parent domain (<1/3 residues aligned) or the median length of the ECOD T-group (<1/3 length) it is assigned to is considered a “partial domain.” Domains passing these filters are regarded as “low confidence” and cannot be automatically assigned if their DPAM probabilities are below 0.85, while the rest are considered “good domains” with confident assignments. The statistics of domains belonging to each category are shown in Fig. S1. The set of good domains are presented and studied subsequently. We presented the results as an online database using the ECOD framework that operates on a PostgreSQL database which serves a PHP-based F3 front end.
We compared the distribution of good domains in the RIMD proteome among each ECOD A-group against experimental structures filtered by 99% sequence identity (PDB F99 set). The fold enrichment of RIMD domains in an A-group is calculated as the ratio between the fraction of the RIMD domain in this A-group and the fraction of domains from the PDB F99 set in the same group. The statistical significance for enrichment or depletion (Table S1) of RIMD domains in an A-group is evaluated by Fisher’s exact tests (scipy.stats.fisher_exact).
Identifying the
GenBank (GCA) and RefSeq (GCF) genomes for
We searched every Uniprot RIMD protein against each
We averaged the DistX scores for different
To distinguish between fast evolution and HGT for RIMD proteins showing elevated distances between strains, the genome neighborhood of each RIMD protein was compared to that of the BB22OP strain. We compared these two strains using The Rapid Annotation of microbial genomes using Subsystems Technology (RAST) server (RASTtk annotation scheme, fix errors, fix frameshifts, and backfill gaps) with the SEED database sequence-based comparison (83, 84). Fast evolution was defined as the RIMD protein having DistX of >0.05 against the BB22OP proteome and equivalence between the RIMD and BB22OP genome according to the RAST genome neighborhood viewer (83, 84). Some RIMD proteins with elevated BB22OP-specific DistX were filtered out due to short and/or low complexity structure.
In-depth analyses of the RIMD mobilome and fast-evolving proteins
We compared RIMD mobilome against the known MGEs (7) by plotting the DistC scores for proteins in the order encoded by the RIMD genome using Jupyter Notebook (Fig. 2D). Clusters of proteins with elevated DistC scores were identified by computing the average DistC scores in windows with 11 proteins (adding five on each side of a protein). To identify RIMD proteins that tend to be present and conserved (DistX <0.05) among clinical isolates, we classified the non-redundant
Proteins of the RIMD mobilome were studied manually using the AlphaFold models and with the help of ECOD assignments and the ECOD parent domains derived from experimentally characterized PDB entries. We analyzed the distribution of RIMD domains of proteins in the RIMD mobilome among different ECOD H-groups using all domains in the RIMD proteome as the background. Fold enrichment in an H-group is calculated as observed/expected frequencies:
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2023 Kinch et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
ABSTRACT
IMPORTANCE
The pandemic
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer