INTRODUCTION
There is mounting evidence that multiple within-species variants coexist in microbiomes [, ]. Coexisting microbial cells of the same species can have substantial variations in their gene contents (i.e., accessory genome), which is largely generated by horizontal gene transfer (HGT) []. The intraspecies variation in the accessory genome can lead to substantial phenotypic differences (e.g., nutrient utilization, pathogenicity, and antibiotic resistance) and plays an important role in microbial adaptation across environments []. Moreover, many health outcomes linked to host-associated microbiomes have been found to be consequences of the function of individual strains [, ].
Metagenomic sequencing has revolutionized microbiome studies by providing a culture-independent approach to studying the composition and function of complex microbial communities. Commonly used tools for metagenomic analysis, known as metagenomics profilers, typically provide species-level taxonomic composition []. In parallel with the rapid increase of sequenced microbial isolates from culturomics studies [, , , ], high-resolution analyses of metagenomic data have revealed notable within-species variations [, ]. Methods that enable strain-level analysis of metagenomes have been used for tracking strain transmission or dispersal [, ], studying the population genetics of microbial strains [], and typing strains of specific interest [].
The gene content profile of a microbial strain determines its biological function. To date, the majority of strain-level analysis methods use single nucleotide variants (SNVs) to identify strain composition [, , , , ]. By assuming an association between SNV haplotypes and gene content profiles [], SNV-based methods can indirectly profile the within-species gene content variation. However, for many species, it has been shown that SNV haplotypes cannot capture microbial genetic diversification resulting from HGT []. Alternatively, the current pangenome-based method can infer the gene content of the dominant strain in a metagenomic sample [] but fails to provide the abundance and gene contents of coexisting strains within the sample. Establishing the linkage of composition and gene contents of coexisting within-species variants can provide crucial insights into microbial adaptation and microbiome–host interactions, but is inaccessible from currently available reference-based bioinformatics tools.
To meet the increasing needs of strain-level functional inference from metagenomics data, we developed a novel method known as Strain-level Pangenome Decomposition Analysis (StrainPanDA) to simultaneously reconstruct the composition and gene contents of coexisting strains using the pangenome coverage profile from metagenomic data (Figure ). We validated the performance of StrainPanDA with a comprehensive collection of synthetic data sets and showed that StrainPanDA was able to accurately infer strain composition and gene content profiles from metagenomic data. To demonstrate the practical use of StrainPanDA in human microbiome studies, we analyzed longitudinal gut microbiome samples of mother–infant pairs [] and patients treated with fecal microbiota transplantation (FMT) [, ]. We found that StrainPanDA was able to identify the association between strain-specific functions and microbial adaptation (or host phenotypes), leading to novel biological insights at the infraspecific level and testable hypotheses of molecular mechanisms.
[IMAGE OMITTED. SEE PDF]
RESULTS
Decomposition of the pangenome coverage profile to infer strain composition and gene content
The pangenome coverage profile of a microbial species from metagenomic data is composed of the gene contents of all coexisting strains. If there are multiple metagenomic samples with varying strain compositions, in principle it is possible to infer the composition of strains within the sample as well as the gene contents of each strain from the pangenome coverage profile []. Building on this intuition, the main algorithm of StrainPanDA aims to decompose the gene family abundance data matrix D into the product of two matrices, the gene content profile matrix P, and the strain composition matrix S (Figure , see Methods section for details). Here the pangenome coverage profile from metagenomic data is represented by matrix D, where is the normalized count of gene family in metagenomic sample . The gene contents of coexisting strains are represented by binary matrix , where the element indicates the presence/absence of gene family in strain . The composition of coexisting strains across samples is represented by , where is the relative abundance of strain in sample . In the implementation of StrainPanDA, the gene family abundance matrix D is decomposed by nonnegative matrix factorization (NMF) [] to solve for matrices P and S. This processing allows StrainPanDA to simultaneously delineate the composition and gene contents variation of coexisting strains.
The StrainPanDA software provides a fully automated workflow of strain analysis (Figure ), which imports raw sequencing data from multiple metagenomic samples, and performs reads mapping, strain decomposition, and downstream annotations (see Methods section for details). To assist the interpretation of gene content variation in strains, StrainPanDA incorporates functional annotation from several databases, including but not limited to Kyoto Encyclopedia of Genes and Genomes (KEGG) [], Carbohydrate-Active enZYmes (CAZy) [], and Virulence Factor Database (VFDB) [].
StrainPanDA provides accurate predictions of strain composition and gene family profiles in synthetic data
We validated the performance of StrainPanDA using synthetic metagenomic data (Methods section). For synthetic mixtures of Escherichia coli strains (ranging from 2 to 8 strains, see Methods section), the strain composition predicted by StrainPanDA was overall close to the actual composition (Ground Truth; Figure ), and its performance was better at a lower number of coexisting strains (2 and 4 strains). For quantitative comparison, we calculated the Jensen–Shannon divergence (JSD) and Matthews Correlation Coefficient (MCC) between the predicted and actual strain composition of simulated samples. JSD and MCC have been widely used in the evaluation of strain analysis tools [, , ]. At a lower number of coexisting strains (2 and 4 strains), the predicted strain composition by StrainPanDA was better than the state-of-the-art SNV-based methods, including StrainEst [] and PStrain [] (the latter was modified based on ConStrains []). While the results of StrainEst tended to include false positives at a lower number of coexisting strains, its performance was better at 6 and 8 strains. Furthermore, we generated synthetic mixtures of E. coli strains with varying levels of sequencing errors (Supporting Information Figure ), sequencing depths (Supporting Information Figure ), and different background noises (mixed with different metagenomic data sets, Supporting Information Table and Figure ). In comparison to SNV-based methods, the performance of StrainPanDA in predicting strain composition was robust.
[IMAGE OMITTED. SEE PDF]
To evaluate the performance of StrainPanDA in different bacterial species, we generated synthetic data for common human gut bacterial species (Bifidobacterium longum, Clostridium difficile, Enterococcus faecalis, Faecalibacterium prausnitzii, and Prevotella copri; Supporting Information Table , see Methods section). In comparison to other methods, StrainPanDA made the most accurate prediction of strain composition across all species when strain number was 4 (with JSD as 0.021 0.006; Figure ).
While current SNV-based methods can reconstruct the composition of coexisting strains from metagenomic samples, they could not directly provide the gene contents of the predicted strains. Here we show that StrainPanDA allows simultaneous reconstruction of strain composition in each metagenomic sample and the gene content variations among strains. In synthetic mixtures of E. coli strains, the predicted gene family profiles by StrainPanDA were close to the actual profiles (Figure , Supporting Information Figure ; precision = 0.91–0.96, recall = 0.87–0.96, for the four strains in a synthetic mixture). In particular, we note that StrainPanDA is able to infer the gene family profile of strains not included in the prebuilt reference genome database. The area under the Precision-Recall Curve (AUPRC) was over 0.95 for all E. coli strains, indicating that StrainPanDA was able to reconstruct the gene contents of microbial strains with high sensitivity and precision (Supporting Information Figure ).
We further evaluated the predicted gene family profiles of the human gut bacterial species included in the synthetic data. The AUPRC was on average above 0.9 and significantly better than random guesses (Figure ). The predicted gene family profiles were robust to sequencing errors, sequencing depths, and the background of real metagenomic data (Supporting Information Table ). Moreover, to demonstrate the ability of StrainPanDA to identify strain-specific genes, the pathogenic E. coli outbreak strain O104 [] was introduced in a synthetic mixture with other E. coli strains (Supporting Information Table ). All outbreak-related gene families were successfully recovered by StrainPanDA (Supporting Information Figure ). Finally, the performance of StrainPanDA and Pangenome-based Phylogenomic Analysis (PanPhlAn)/PanPhlAn3 in inferring the gene content profiles were comparable (Supporting Information Figure ). We note that PanPhlAn and PanPhlAn3 can only report the gene content profile of the “dominant strain” in a particular metagenomic sample (i.e., the strain with the highest relative abundance); in contrast, StrainPanDA can identify the gene content profiles of all coexisting strains.
Taken together, our benchmarking results demonstrate that StrainPanDA provides accurate predictions of compositional profiles and gene contents of coexisting strains from metagenomic samples. In the following sections, we will demonstrate the application of StrainPanDA in two longitudinal metagenomic data sets to elucidate the diversity of the gut microbiome at the subspecies level.
Succession of B. longum subspecies in infant gut microbiome is associated with breastfeeding patterns and the selection of nutrient utilization
The direct inference of both the population structure and gene content variation at the strain level is crucial to understanding the ecology of microbial communities. Here we apply StrainPanDA to study the adaptation of coexisting bacterial subspecies in the infant gut microbiomes. We analyzed a previously published data set that includes gut metagenomic samples from ~100 mother–infant pairs (infants were sampled at three time points: newborn, 4 months, and 12 months) [] (Supporting Information Table ). At the species level, the authors found that the composition of the infant gut microbiome had distinctive features at each sampled time point, and the cessation of breastfeeding was clearly associated with the maturation of an infant gut microbiome into an adult-like microbiome [].
We focused on the infraspecific analysis of B. longum, which is known to play an important role in the development of the infant gut microbiome [, ] and was found to be enriched in 4-month infant samples in this study (Supporting Information Figure ). Interestingly, we discovered a clear pattern of succession in the subspecies composition of B. longum over time, that is, a shift in the dominant subspecies (Figure ). Among the three B. longum subspecies predicted by StrainPanDA, B. longum subspecies 2 was dominant in the gut microbiomes of mothers. In the gut microbiomes of infants, B. longum subspecies 3 was most prevalent for newborns, while subspecies 1 transiently increased at the intermediate time point (at 4 months) and then was outcompeted by subspecies 2 (at 12 months). On the basis of the diet history provided in the original study, we further grouped infant samples into two different categories: “discontinued breastfeeding” and “continued breastfeeding” between successive time points (Methods section). It was evident that the relative abundance of B. longum subspecies 1 was enriched in infants that continued breastfeeding (Figure ). By contrast, once breastfeeding was discontinued, B. longum subspecies 1 was taken over by subspecies 2.
[IMAGE OMITTED. SEE PDF]
The association between B. longum subspecies composition and breastfeeding patterns suggests within-species competition in nutrient utilization functions (Figure ). On the basis of functional annotations of KEGG [] and CAZy [], we found clear variations in nutrient utilization genes among the predicted B. longum subspecies. B. longum subspecies 1 had unique gene families (marked in red, Figure ) that are key enzymes related to human milk oligosaccharide (HMO), including galactosidase, α-l-fucosidase, sialidase, and their corresponding CAZy groups (GH33, GH29, and GH95; Supporting Information Figures and and Table ). In addition, the urease-related gene families were only found in the gene family profile of subspecies 1. Therefore, the unique functional potential of B. longum subspecies 1 in utilizing HMO and urea [] from breast milk could confer a competitive advantage under breastfeeding, consistent with our observations of its transient dominance at 4 months in infant gut metagenomes.
Our finding is consistent with previous reports [] on HMO utilization genes in some B. longum strains as well as observed changes in the frequency of B. longum subspecies infantis after weaning [, , , , ]. The functional profile of subspecies 1, in comparison to B. longum reference genomes, suggests that it may correspond to the B. longum subsp. infantis (Supporting Information Figure ), which has been isolated from infants and known to be associated with breastfeeding [, , ].
Overall, we show that StrainPanDA is able to identify associations between strain-specific functions (via reconstruction of gene contents) and adaptation (via reconstruction of strain composition), leading to novel biological insights and testable hypotheses about microbial ecology at the subspecies level.
Analysis of post-FMT gut metagenomes reveals individualized subspecies profiles and subspecies-specific functions
FMT introduces bacterial strains from healthy donors into recipients and has a profound impact on the structure and function of the recipient's gut microbiota [, ]. Here we apply StrainPanDA to analyze the metagenomic samples from FMT recipients in a clinical trial to treat Crohn's disease [], including 17 patients (eight in the FMT group and nine in the sham group) and multiple samples (4–8 time points) for each patient. We analyzed the subspecies composition of commonly observed bacterial species in the human gut metagenome (Supporting Information Table ). Hierarchical clustering of the predicted subspecies compositional profiles revealed strong individual signatures, which remained stable throughout 24 weeks (Figure ). The pairwise distance in subspecies composition profiles among samples of the same individual (sampled at multiple time points, that is, the “intrasubject” group) was significantly lower than the pairwise distance among samples of different individuals (i.e., the “intersubject” group; Figure , p < 10−15), similar to the pattern at the species level. Furthermore, we found that the dissimilarity in subspecies composition between paired FMT donors and recipients (i.e., the “donor–recipient” group) was significantly lower than the “intersubject” group (p < 0.001), indicating the engraftment of donor strains and coexistence of donor and recipient strains []. We noted that the engraftment of donor gut bacteria was more obvious at the subspecies level (effect size = 1.4) than at the species level (effect size = 0.78; Figure ). Similarly, we applied StrainPanDA to analyze an independent FMT data set of metagenomic samples from patients with C. difficile infection []. We observed a clear pattern of subspecies engraftment in post-FMT gut metagenomes, consistent with SNV-based strain analysis in the original study [] (Supporting Information Figure ). Overall, we show that StrainPanDA is able to delineate the difference in subspecies composition among individuals and track the transmission of strains.
[IMAGE OMITTED. SEE PDF]
To elucidate the potential role of the gut microbiome in the maintenance of remission in Crohn's disease patients, we further investigated the strain-level genetic signatures associated with post-FMT clinical outcomes. The original study showed that the enrichment of Bacteroidetes species in patients relapsed after FMT []. We focused our analysis on Bacteroides ovatus, which was found to be enriched in relapsed individuals (false discovery rate [FDR]-adjusted p = 0.2; Figure ). Among the predicted subspecies of B. ovatus, the relative abundance of subspecies 2 was positively correlated with the abundance of species-level B. ovatus in gut metagenomes (Spearman correlation = 0.64, and FDR-adjusted p < 10−6, Supporting Information Figure ). We found substantial gene content variation among different B. ovatus subspecies (Figure ). Interestingly, we found that B. ovatus subspecies 2 had more CAZy family genes than others, indicating its functional potential to utilize diverse carbon sources and potential competitive advantages (Figure ). Thus, the strain-specific metabolic functions of B. ovatus subspecies 2 may explain its dominance within the species (~20%) as well as its positive correlation with species abundance (Figure and Supporting Information Figure ). In addition, B. ovatus subspecies 2 carried several strain-specific virulence factor genes (e.g., type IV secretion system and cholesterol-dependent cytolysin), which may contribute to the positive association between B. ovatus and post-FMT relapse (Supporting Information Figure ). For example, cholesterol-dependent cytolysin is a pore-forming toxin that can disrupt the host plasma membrane [], whose integrity has been linked to inflammatory bowel disease []. In addition, we noted that B. ovatus subspecies 4 was more abundant in the remission group (Figure , FDR-adjusted p < 10−5); whether this B. ovatus subspecies contributes to post-FMT remission remains to be validated in future studies. Similarly, we performed StrainPanDA analysis for Bacteroides vulgatus, which was also enriched in relapsed individuals, and found clear functional variation among its subspecies (Supporting Information Figure ).
In summary, we show that the linkage of strain composition and gene contents provided by StrainPanDA can greatly facilitate our understanding of microbial ecology beyond the species level. For microbes closely related to host health, this linkage helps formulate testable hypotheses on the association between molecular functions (e.g., pathogenicity) and clinical outcomes, which can be directly tested in experiments of isolated microbial strains.
DISCUSSION
Here we report a novel method, StrainPanDA, to simultaneously profile the composition of coexisting strains and their corresponding gene content from metagenomics data. Our benchmarking results showed that StrainPanDA provided accurate and robust predictions from synthetic data. The predicted strain composition was better than or comparable to state-of-the-art methods; meanwhile, the predicted gene content profile was close to the actual profile, even for strains not included in the prebuilt reference genome database. Furthermore, we applied StrainPanDA to metagenomic data sets to resolve within-species variation of bacterial taxa of interest. For example, we found that the composition of B. longum subspecies in infant gut microbiomes was associated with dietary shifts, and the unique functional potential of certain B. longum subspecies in utilizing nutrients from breast milk might confer a competitive advantage. We demonstrated that the linkage of strain abundance and gene contents could lead to direct functional interpretations and testable hypotheses.
To study within-species gene content variation, current SNV-based methods implicitly assume the association between SNV haplotypes and gene content. However, many microbial genomes with high similarity in the core genome have less than 70% of genes in common [], indicating that the indirect inference of gene content by SNV-based methods may be insufficient. In contrast, StrainPanDA adopts the pangenome-based approach to directly infer the gene content of multiple coexisting within-species variants. Our current method relies on the pangenome constructed from a collection of genomes for a given microbial species, thus it does not account for gene content transfers between species. To account for interspecies gene transfer, one possible solution is to include a pool of “putative mobile elements” to expand the pangenome for each species. The prediction of StrainPanDA relies on the pangenome database, but it is not limited to the profiles of available reference genomes; thus, StrainPanDA can also be used to identify novel strains, as long as the relevant gene families are included in the pangenome. Although we focused on the comparison of StrainPanDA to other reference-based methods, it is worth noting that complementary approaches based on metagenome-assembled genomes (MAGs) can identify novel strains from metagenomic data. For example, DESMAN [] can provide the predicted draft genome of each strain; other recently developed MAG-based methods include mixtureS [], STRONG [], and so forth. In contrast to reference-based methods, MAG-based methods can identify novel species and genes, yet the quality of MAG will greatly affect the results. In addition, the MAG-based methods require much higher sequencing depth than reference-based methods, which prohibits their application to species with low abundance. In comparison, sequencing depth is not a limiting factor for StrainPanDA (Supporting Information Figures and ).
StrainPanDA is most suitable for the analysis of multiple metagenomic samples with shared within-species variants, such as longitudinal studies. While the analysis in this study focused on the human gut microbiome, StrainPanDA is broadly applicable to microbiomes in different environments, as long as the pangenomes of the target species are available. The performance of StrainPanDA, including the accuracy of predicted strain composition and gene content profiles, improves with sample size (Supporting Information Figure ) and sequencing depth (Supporting Information Figure ). To apply StrainPanDA on a typical metagenomic data set, it would be desirable to have at least 10 samples and the relative abundance of the species of interest to be above 1%. Due to the nature of StrainPanDA's algorithm, it may be difficult to disentangle within-species variants with genetic mosaic or highly similar gene content profiles (i.e., lacking strain-unique features), thus StrainPanDA is most suitable for analysis at the level of subspecies []. Finally, in comparison to MAG-based methods, StrainPanDA has minimal requirements for computing resources (Supporting Information Figure ) and can be scaled to process multiple species in parallel.
CONCLUSION
In summary, we show that StrainPanDA is able to provide accurate profiling of strain composition and gene content from metagenomic data. We envision that the application of StrainPanDA to the rapidly increasing metagenomic data sets, especially in the context of spatiotemporal characterization of microbiomes [], will help elucidate novel associations between molecular functions and microbial/host phenotypes as well as microbial ecology at the infraspecies level.
METHODS
Generation of pangenome database and mapping of metagenomic data
The pangenome database of bacterial species analyzed in this study was created following the steps recommended by PanPhlAn (version 1.2.8) []. For each bacterial species, genomes were downloaded from National Center for Biotechnology Information (NCBI). Average Nucleotide Identity (ANI) between genomes was calculated by mash (version 1.1) []. Representative strains (pairwise ANI ≤ 99%) were selected and used as reference genomes for pangenome construction. The annotated genes were extracted from the reference genomes and clustered into gene families at 95% identity by usearch (v7) [] to create the pangenome database. Shotgun metagenomic data were mapped to the pangenome database by PanPhlAn [] (version 1.2.8), which used Bowtie2 (version 2.4.1) [] and SAMtools (version 0.1.19) [] to map and count the reads, respectively. A gene family profile was generated by summing up the read counts of genes (normalized by reads per kilobase million [RPKM]) belonging to the same gene family. The gene family profiles of all metagenomic samples were grouped into a single gene family profile matrix. To account for potential noise in reads mapping, the gene family abundance was trimmed to 0 if the RPKM value was below the cutoff (10, by default). After trimming, gene families absent in all samples were removed from further analysis. In addition, samples were filtered out if the number of gene families detected was below 0.9 × gmin (gmin is the minimum number of gene families found in all reference genomes).
StrainPanDA algorithm
The core algorithm of StrainPanDA decomposes the gene family abundance data matrix () of the microbial species of interest into the product of two matrices (Figure ):
The gene family abundance data matrix is an N × S nonnegative matrix, where is the normalized count of gene family in metagenomic sample . The gene content profile matrix is an N × K binary matrix, where is 1 if the gene family is present in strain and 0 otherwise. The strain composition matrix is a K × S matrix, where is the relative abundance of strain in the sample . N is the number of gene families in the pangenome of the microbial species of interest, S is the number of metagenomic samples, and K is the number of strains (i.e., factorization rank).
To estimate P and S, we approximate the solution P′ and S′ using NMF, considering the nonnegative constraints on both matrices (optimized using the “snmf/r” algorithm implemented in the R package “NMF” [, ], version 0.21.0). The addition of sparsity constraints (i.e., regularization terms in the objective function) ensures the uniqueness of factorization [, ]. The S′ matrix is then scaled into relative abundances. We binarize the approximated P′ matrix, following the assumption that the matrix elements corresponding to “present” gene families should have higher values than “absent” gene families, and the matrix elements should have a tight distribution due to the expectation that P is a binary matrix (see Supporting Information Figure ). Briefly, we find the peak of the probabilistic density curve (pmax) for each strain j, where the number of matrix elements on the right of the peak ( > pmax) is equal to the expected number of gene families of the species of interest (i.e., averaged over all reference genomes in the pangenome database). We then cut the density curve at θ between the selected peak and 0 (θ = 0.5 × pmax, by default), where the gene families with a weight greater than θ are considered as present. The confidence score Cij for gene family i in sample j was assigned to every gene family:
The confidence scores were used to rank gene presence predictions for generating the Precision-Recall curves in the benchmarking experiments
To select the proper number of strains (i.e., the rank of NMF), we parsimoniously select the least number of strains from a range of 1–12 (by default) satisfying the following criteria: (1) The mean relative abundance across all the samples of any strain should be greater than τ2 (τ2 = 0.1 by default), (2) the number of gene families of all strains should be greater than τ3 × gmin (τ3 = 0.5 by default), gmin is the minimum number of gene families found in all reference genomes, and (3) the gene family profiles between a pair of strains should have Jaccard distance larger than τ1 (τ1 = 0.1 by default). The program also provides an option to accept a user-specified number of strains set. In this study, we did not set the number of strains a priori in benchmarking and applications of StrainPanDA.
Benchmarking StrainPanDA with synthetic data
Synthetic data of E. coli strains
We generated four types of simulated sequencing reads: (1) Error-Free (ErrFree): pick random fragments from the reference genome of E. coli by read simulator ART [] (version 2016.06.05; parameter: -ef -ss HS25 -l 150 -m 270 -s 27); (2) ART with sequencing errors (ARTErr): use ART to add sequencing errors on top of ErrFree reads (parameter: -ss HS25 -l 150 -m 270 -s 27); (3) pure whole genome sequencing (pWGS): randomly draw reads from the WGS data of selected strains by seq-tk (, version 1.3; default parameter); and (4) pWGS data mixed with a real background metagenomic data set (whole genome sequencing background [WGSBG]): Three different metagenomic data sets (see Supporting Information Table ; BG1, IBD; BG2, FMT; BG3, MI; as shown in Figure ) were used to mix with the pWGS data of E. coli at different ratios (1-, 5-, 10-, 25-, and 100-fold). Metagenomic samples were analyzed by Kraken2 (version 2.1.1; database: miniKraken2_v2_8GB_201904) to ensure a minimal abundance of E. coli. Strains of E. coli with pairwise genome-wide ANI between 95%–99% were selected to represent different subspecies (Supporting Information Table ). In each synthetic data set of mixed strains, 20 combinations of strain composition were generated by Dirichlet distribution (Supporting Information Table ). All synthetic data sets were generated by the SimStr pipeline (). For each strain, its genome size was considered 1× sequencing depth and used to calculate the number of reads to generate. The minimum relative abundance (i.e., frequency) of a strain was set as 5% and as one unit. For example, for E. coli synthetic data of 1× sequencing depth that we refer to in this study, the data size of the strain with 5% frequency was ~4.5 megabases (MB), while the total depth of each sample in this data set was always 20×, and ~90 MB in size (1× sequencing depth as a unit and 20 units in total). To evaluate the effect of sample size, 400 synthetic mixtures of four strains were generated by Dirichlet distribution. The synthetic data were separated into 10 runs (40 samples each) and further downsampled to 20, 15, 10, and 5 samples in each run.
Synthetic data of gut bacterial species
Synthetic data of sixspecies, including B. longum, C. difficile, E. coli, E. faecalis, F. prausnitzii, and P. copri, were generated separately (Sync6, pWGS at 5× sequencing depth; Supporting Information Table ). For each species, the relative abundances of four strains (5%, 10%, 25%, and 60%) were permutated to generate 24 samples in total (Supporting Information Table ). All six species were in the prebuilt databases of StrainPanDA and PStrain. Only B. longum, E. coli, and E. faecalis were in the prebuilt database of StrainEst, so the other species were excluded in the comparison to StrainEst (Figure ).
Evaluation of predicted strain composition
StrainEst (v1.2.4 through docker) and PStrain (downloaded from GitHub on May 23, 2021) were run with their prebuilt database and default parameters (StrainEst, ; PStrain, ). The strain compositional profiles predicted by different methods were evaluated and compared by SimStr. For the predicted strain composition shown in Figure (stacked bar plots), strains with relative abundance below 0.01 were filtered and the remaining strains were sorted by their relative abundance (rescaled to 1) in decreasing order. After sorting, the predicted strains in the lower tail exceeding the number of simulated strains were grouped into “Extras.”
Two commonly used metrics were used to evaluate the performance of predicted strain composition of different methods:
- 1.
JSD []: JSD between the predicted strain composition and actual strain composition is calculated by the distance function in phyloseq [] (R package) on the sorted relative abundance (in decreasing order). If the number of predicted strains is different from the actual number of strains, zeros were appended to the vector with a lower dimension. The JSD is symmetric and is in the interval of [0, 1]. It reflects the dissimilarity in compositional profiles of strains, that is, JSD = 0 represents an exact prediction.
- 2.
MCC []:
Owing to the lack of strain annotations from PStrain, we only computed MCC for strain composition predicted by StrainPanDA and StrainEst. For StrainEst, the predicted strains were directly annotated by reference genomes. For StrainPanDA, based on the predicted gene family profile, the Jaccard distance () between the predicted strain and all reference genomes was calculated. The reference genome with the smallest Jaccard distance to the predicted strain was used for annotation. If a strain in the synthetic mixture is included in the prebuilt database of reference genomes, the ID of the annotated reference genome is directly compared with the actual strain to determine if the predicted strain is a true positive. If a strain in the synthetic mixture is not included in the prebuilt database of reference genomes, we used the phylogenetic tree to decide whether a predicted strain is a true positive (Supporting Information Table ). Briefly, we generated a phylogenetic tree by parsnp [] (version 1.5.1, default parameter) including genomes of the strains used in synthetic mixtures and all the reference genomes. If the annotated reference genome of a predicted strain is within the cutoff of phylogenetic distance (cutoff = 0.05 for E. coli, corresponding to ANI ~ 99%) from an actual strain, it is considered a true positive.
Evaluation of predicted gene family profiles
For each microbial strain evaluated in benchmarking data sets, ErrFree reads at 5× sequencing depth were generated by ART simulator [] (Sync-Single data set) based on its reference genome downloaded from NCBI. The Sync-Single data set contained three replicates for each strain and was used to generate the actual gene family profile of each strain by PanPhlAn and PanPhlAn3-v3.1 (default parameters for sensitive mode). The gene families found in two or more replicates were considered “present.” The actual gene family profile (reference) of each strain was compared with the predicted gene family profile (Figure ). The Jaccard distances between microbial strains' predicted gene family profiles and their reference profiles (or the gene family profile of a randomly sampled reference genome) were computed (Supporting Information Figure ). The Precision-Recall curve of gene family profiles for each strain was generated by R package PRROC [] using the confidence scores to rank the gene families predicted (Supporting Information Figure ). For random guesses, 1000 random gene family profiles were generated by sampling N gene families from the pangenome as “present,” where N is the average number of gene families present in reference genomes. To demonstrate the ability of StrainPanDA to identify strain-specific genes, the pathogenic E. coli strain O104 (GCF_002983645) was introduced in a synthetic data set of four strains (Sync O104 data set, pWGS, 5× sequencing depth). The outbreak-related genes curated from Scholz et al. [] were used to evaluate StrainPanDA's gene content prediction (Supporting Information Table ). We also compared the predicted gene family profiles from StrainPanDA to the prediction from PanPhlAn and PanPhlAn3 (Supporting Information Figure ).
Runtime evaluation
The runtime of StrainPanDA was measured with the time command in Linux. All these tests were run on a workstation of Intel(R) Xeon(R) Gold 6238 CPU @ 2.10 GHz and 16 GB memory. The runtime (seconds) as a function of sample size was estimated by running StrainPanDA with the downsampled synthetic mixture of four E. coli strains (See Synthetic data of E. coli strains in Methods section). The runtime (seconds) as a function of strain number was calculated by running StrainPanDA with the pWGS and 25-fold WGSBG data set.
Applications of StrainPanDA in metagenomic data
Case study: Mother–infant gut metagenomes
All available samples of ERP005989 were downloaded from EBI (eight samples failed, Supporting Information Table ). On the basis of the diet history, infants without diet history were filtered. The rest of the 84 infants were split into three different groups (B_F_F, discontinued breastfeeding at 4 months; B_B_F, discontinued breastfeeding at 12 months; B_B_B, continued breastfeeding; the F_B_M sample was excluded; Supporting Information Table ). Samples without enough coverage on B. longum gene families were filtered by StrainPanDA at the preprocessing step and excluded from the downstream analysis. In the “continued breastfeeding” group, infants that kept breastfeeding between successive time points (i.e., between newborn and 4 months, or between 4 and 12 months) were included. In the “discontinued breastfeeding” group, infants who stopped breastfeeding by 4 or 12 months were included. For functional interpretation of the subspecies, we grouped the gene families annotated to the same KEGG (downloaded September 1, 2021) ortholog or CAZy (downloaded July 31, 2019) family. To further analyze the key functions related to breastfeeding, we curated a set of KEGG orthologs from related references [, ]. All the KEGG orthologs were further grouped into subclasses and classes based on the literature [, ] (Supporting Information Table ). The gene family coverage was calculated as the fraction of detected genes belonging to the subclass (Figure ).
Case study: FMT donor–recipient metagenomes
Raw sequencing reads were downloaded from ENA (Accession: PRJNA625520 for the study on Crohn's disease [], PRJEB23524 for the study on C. difficile infection study []). For Crohn's disease data set, species relative abundances were estimated using Kraken2 [] (version 2.0.8-beta) with the miniKraken database (v2_8GB_201904_UPDATE). To identify species associated with remission or relapse, the Wilcoxon rank-sum test was conducted to select differentially abundant species using the mean relative abundances across different time points for each subject (samples collected before FMT or after relapse were discarded). The relative abundances of subspecies were predicted by StrainPanDA and normalized by the centered-log-ratio transformation for calculating Spearman correlation with the species abundances. For functional annotation of the subspecies, virulence factors and CAZy annotations were taken from the species-specific databases constructed as described above.
Functional annotation of gene families
To annotate the gene families by KEGG, gene family representative sequences were mapped against KEGG orthologs (release 2020-07-20) using usearch [] (v11.0.667; “-ublast”). Alignments with identity >50% and query coverage >50% are kept. To annotate the gene families by CAZy, SeqKit [] (0.15.0) was used to translate the gene family centroids into six open reading frames. The translated amino acid was used as the input of run_dbcan [] (2.0.11), which used DIAMOND [] (2.0.8), HMMer [] (3.3.2), and Hotpep [] (2.0.8) with default parameters to predict the CAZy annotation. The CAZy annotations were selected only if it was predicted by at least two programs. If a gene family was assigned by more than one CAZy annotation, only the first annotation was used. To annotate virulence factors, gene family centroids were mapped against the VFDB (April 9, 2021) by DIAMOND [] (2.0.8; blastp, query coverage >50% and identity >50%).
AUTHOR CONTRIBUTIONS
Han Hu, Yan Tan, and Lei Dai conceived and supervised the study. Han Hu, Yuxiang Tan, and Chenhao Li developed the algorithm and performed the analysis on simulated and real metagenomic samples. Lei Dai, Han Hu, Yuxiang Tan, and Chenhao Li wrote the manuscript with inputs from Zhenjiang Zech Xu, Yang-Yu Liu, Yan Kou, and Yan Tan.
ACKNOWLEDGMENTS
We would like to thank Haokui Zhou, Ramnik Xavier, and members of the Lei Dai lab for constructive comments on the manuscript. We would like to thank Jinhui Tang and Yong Liang for their assistance in bioinformatics analysis; Xi Wang and Dongdong Xu for their support in maintaining the computing environment; and Reese Hitchings for proofreading the manuscript. This study was supported by the National Key R&D Program of China (No. 2019YFA0906700) and the National Natural Science Foundation of China (Nos. 31971513 and 32061143023).
CONFLICTS OF INTEREST
The results described in this manuscript support pending patent application CN202011146154.3A. Yan Tan is cofounder and shareholder with a personal financial interest in Xbiome. Han Hu and Yan Kou are employees of Xbiome. Lei Dai received research grant support from Xbiome and serves as an unpaid consultant to the company. The remaining author/authors declares/declare no conflicts of interest.
DATA AVAILABILITY STATEMENT
The source code of StrainPanDA is available on GitHub () and Zenodo (). The experiment source code used in the manuscript is available from . Supplementary materials (figures, tables, scripts, graphical abstract, slides, videos, Chinese translated version, and update materials) may be found in the online DOI or iMeta Science .
Poyet, Mathilde, Mathieu Groussin, Sean M. Gibbons, Julian Avila‐Pacheco, Xiaofang Jiang, Sean M. Kearney, Alli son R. Perrotta, et al. 2019. “A Library of Human Gut Bacterial Isolates Paired with Longitudinal Multiomics Data Enables Mechanistic Microbiome Research.” Nature Medicine 25: 1442–52. [DOI: https://dx.doi.org/10.1038/s41591-019-0559-3]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Microbial strains of variable functional capacities coexist in microbiomes. Current bioinformatics methods of strain analysis cannot provide the direct linkage between strain composition and their gene contents from metagenomic data. Here we present Strain‐level Pangenome Decomposition Analysis (StrainPanDA), a novel method that uses the pangenome coverage profile of multiple metagenomic samples to simultaneously reconstruct the composition and gene content variation of coexisting strains in microbial communities. We systematically validate the accuracy and robustness of StrainPanDA using synthetic data sets. To demonstrate the power of gene‐centric strain profiling, we then apply StrainPanDA to analyze the gut microbiome samples of infants, as well as patients treated with fecal microbiota transplantation. We show that the linked reconstruction of strain composition and gene content profiles is critical for understanding the relationship between microbial adaptation and strain‐specific functions (e.g., nutrient utilization and pathogenicity). Finally, StrainPanDA has minimal requirements for computing resources and can be scaled to process multiple species in a community in parallel. In short, StrainPanDA can be applied to metagenomic data sets to detect the association between molecular functions and microbial/host phenotypes to formulate testable hypotheses and gain novel biological insights at the strain or subspecies level.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 Bioinformatics Department, Xbiome, Scientific Research Building, Tsinghua High‐Tech Park, Shenzhen, China
2 CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
3 Center for Computational and Integrative Biology, Massachusetts General Hospital and Harvard Medical School, Richard B. Simches Research Center, Boston, Massachusetts, USA
4 Department of Food Science and Technology, State Key Laboratory of Food Science and Technology, Nanchang University, Nanchang, China
5 Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, USA