INTRODUCTION
In the last few decades, high-throughput sequencing technologies have boosted our comprehension of microbial communities, such as those present in the human gut (1, 2), in natural environments (3–5), and those relevant for biotechnological applications (6). The most widely used sequencing approaches for microbiome research are metataxonomics and metagenomics (7). Metataxonomics consists of the targeted sequencing of marker genes, usually focusing on hypervariable regions present in the 16S rRNA gene. This approach allows the clustering of highly similar reads into operational taxonomic units or amplicon sequence variant clusters from which the microbiome composition and diversity can be derived (7–9). By contrast, metagenomics relies on shotgun sequencing of the total DNA extracted from a microbial sample. Its popularity derived from its ability to not only delineate community composition and diversity but also to functionally characterize the microbes present in the sample (7). Moreover, different studies have highlighted the higher taxonomic resolution and precision of shotgun DNA sequencing as compared to 16S amplicon sequencing (7, 10). Notably, it was demonstrated that metagenomic approaches can go beyond the species level allowing the investigation of single strains present in a microbiome (11–13).
Different approaches allow for defining the microbial composition of a given sample
starting from shotgun metagenomic sequencing data. In a typical genome-centric
workflow, metagenome-assembled genomes (MAGs) are reconstructed by assembling the
sequencing reads and clustering the obtained contigs (14). Sometimes, the sequencing coverage of the low-abundant
organisms might be insufficient to support a
Alternatively, assembly-free profiling can be used to classify shotgun reads from microbial samples using a set of publicly available genomes or MAGs as reference and specific indexing schemes for the respective sequence database (17). This approach is becoming established since the number of MAGs is fastly growing and they often represent a significant fraction of microbial diversity in well-studied environments (16). Sequencing reads of each sample of interest can be aligned against the available genomes to define the sample-specific microbial community composition and the relative abundance of the associated taxa (6, 18).
With each MAG potentially representing a microbial species, alignment-based approaches can be used to infer the microbial composition and the species abundance. There are different tools that can convert the BAM files obtained from the reads assembly into coverage values or relative abundance, including checkM (19) and coverM (20). For the sake of having a clear representation of the microbial composition in a sample, however, it is necessary to apply a rationale to discriminate between present and absent species. This apparently trivial problem is made complex by the fact that homologous sequences, horizontal gene transfer events, prophage sequences, and even misassemblies can cause an incorrect assignment of reads to genomes, potentially resulting in the erroneous identification of species in the sample under investigation (21). Several metagenomic studies use coverage values of the reference genomes, or other parameters like the number of mapped reads, to calculate relative abundance and to call species as present or absent based on arbitrary thresholds (6, 21). However, since reads can be mismapped, this approach can lead to a high number of false positives, or an overestimation of the number of species present when setting a low threshold of relative abundance. Conversely, a higher threshold may cause the exclusion of low-abundant species, especially for complex microbial communities (21).
Metrics such as coverage, and the directly derived relative abundance, are genome-wide averages, which cannot be used to assess the precise genomic distribution of the mapped reads. This limitation hinders the discrimination between genomes having homogeneous coverage, and those with reads mapped on specific regions only (22). A highly uneven coverage is usually representative of species not present in the sample, but sharing genomic regions with others. The use of relative abundance alone can thus result in an erroneous species identification. The use of additional criteria such as the evenness of the mapped reads on a genome represents a more accurate criterion to assess their presence in the sample. As already suggested by Olm and colleagues (22), the sequencing breadth—the fraction of the genome covered by at least one read—can be used for this purpose. By comparing the real breadth with its expected value at a given coverage, it is possible to infer whether the reads span on the entire genome or cover only specific regions—the latter being representative of a lower sequencing breadth than expected.
In the present study, we investigated the use of different metrics for the identification of the species present in a metagenomic sample, confirming the assessment of mapped read distribution as an accurate approach to discriminating them from false positives. Specifically, we tested the ratio of the coverage breadth to its expected value as a metric to evaluate such a distribution, along with a newly developed metric based on the distance between consecutive mapped reads. Through multiple experiments on synthetic and real data, we showed that the comparison between breadth and expected breadth is not a reliable discriminating criterion at low coverages, while consecutive read distance is more robust. By combining the two metrics, we demonstrated that truly present species can be detected down to as few as tens of reads mapping on a microbial genome.
We included our approach into a freely available Python tool—Metapresence—allowing an easy and reproducible calculation of the two metrics starting from a set of genomes and an alignment file. Our tool allows a reliable definition of the species composition of a given microbial sample.
RESULTS
Definition of the approach
As a natural application of the classic Lander and Waterman’s model (23), a random genome fragmentation in shotgun sequencing is well approximated by a one-dimensional homogeneous Poisson point process. Hence, when shotgun sequencing reads are aligned back on the genome they originated from, their mapping positions can be seen as realizations of a Poisson process. When reads are paired-end, each of the two groups of mates can be represented as two different Poisson processes.
Based on this assumption, the number of times a given genomic position is covered
by a read is a random variable that follows a Poisson distribution with mean
where
Therefore, at a given
The 0.883 value was empirically derived by Olm and colleagues (22). In particular, the authors generated synthetic reads from two microbial genomes, subsampled them to different total read numbers, and aligned them back on the corresponding genome. Finally, they fit a curve to the resulting coverage and breadth data to derive the 0.883 value.
We define the metric Breadth-Expected breadth Ratio (BER) as the ratio of the observed breadth
If the sequencing reads are generated from all the regions of a genome, when aligned back, they distribute across its entire length, resulting in similar values of observed and expected breadth. On the contrary, when sequencing reads map only to a fraction of the genome, the observed breadth is lower than expected. Thus, if the GC content and other properties influencing the sequencing efficiency are homogeneous on the entire genome, BER should be close to one, and this result is a reliable confirmation of the presence of the genome in the sample.
However, as demonstrated in the next section, BER is not reliable at small values
of
In a one-dimensional Poisson point process, the inter-point distance follows an exponential distribution. Therefore, the distribution of the distances between the mapping position of consecutive non-paired reads follows an exponential distribution with a parameter
Thus, we can define the nearest integer number to the expected distance between two consecutive reads as follows:
Given that the number of distances between consecutive non-paired reads corresponds to the number of reads within each set of non-paired reads minus one, we define
where
We therefore define FUG as:
FUG is essentially an approximation of the cumulative distribution function of the exponential distribution for X > Δ. Thus, when the reads are uniformly distributed on the genome, the expected value of FUG,
When considering paired-end reads, Metapresence separately calculates the FUG value for two groups of non-paired reads, which are constituted by either all the mates mapping as “first in the genome” (upstream), or those that map “as second” (downstream). For each group of mates, two arbitrary reads are added, one mapping to the first position of the genome, and the other to the last. This artifice is performed since we observed situations where reads map uniformly on genomes not really present, yet only on a single region. The two arbitrary reads are necessary, from an algorithmic point of view, to identify these genomes as absent.
While the exponential distribution is continuous, genomic positions are obviously discrete values. The definition of the expected distance Δ is therefore an approximation since the ratio of the genome length to the number of reads is represented by a real number. When the number of mapped reads is low (Δ is high), this approximation is negligible. However, when the number of mapped reads is high (Δ is low), the FUG values deviate from the expected. As we show in the next section of the work, FUG is not reliable at high coverage values.
A schematic representation of the various mapping patterns that the metrics can distinguish is reported in Fig. 1. In all the following analyses, we set a detection limit of 80 mapped reads. This limit is necessary to ensure a minimum number of reads per group of mates for an accurate approximation of the distance distribution, and, consequently, a reliable calculation of FUG. Thus, all the genomes with fewer than 80 mapped reads are, by default, considered absent.
Fig 1
Schematic representation of different read mapping patterns. In both A and B, reads mapping on the green genome spread across its entire length, whereas reads mapping on the red one are confined in one or more delimited regions. (A) Although the number of mapping reads is the same, the green genome has a higher breadth of coverage than the red one. (B) The bidirectional arrows graphically show the distance between consecutive reads, which is quantified as the distance between initial read positions, depicted as hairpins. Large gaps between consecutive reads are more frequent for the red genome.
Fig 6
Species identified in AAD-related samples. (A) Histogram representing the number of AAD-associated species identified in each sample derived from affected patients or healthy donors. The y-axis represents the number of samples with a given number of AAD-associated species. (B) Histogram representing the total number of species identified in samples from affected patients and healthy donors. The y-axis represents the number of samples where the number of species is within a given range.
DISCUSSION
The importance of this study lies in the need to address the inherent biases associated with utilizing relative abundance thresholds for defining the species-level composition of a metagenomic sample. To address this challenge, the main target was to explore the utilization of mapped read distribution on genomes as a reliable method for discriminating the species present from those absent.
The results obtained from synthetic and mock communities indicate that this approach is highly precise and holds significant potential. Furthermore, the findings from real case studies from AAD samples align with established knowledge (35), suggesting that the proposed method can effectively define the species-level composition of complex microbial communities, such as those found in the human gut.
Comparison with established methods for species identification in metagenomic data revealed that the approach investigated in this work can outperform existing ones, especially in the analysis of communities with a high prevalence of rare species. Interestingly, the integration of the proposed coverage evenness metrics in the Metaphlan4 workflow (15) increased its precision. This result suggests that the evaluation of mapping read distribution may improve the accuracy of methods that rely on read alignment to perform inferences. More thorough analyses are needed to investigate the potential of coverage evenness metrics in test cases going beyond species identification in metagenomic samples.
The most critical point of the proposed approach is the selection of the metric thresholds. According to the tests performed, we defined 0.77 and 0.5 as the optimal thresholds for BER and FUG, respectively. Nevertheless, users have the flexibility to choose a more stringent value for these metrics.
In this work, we have shown that the BER value for a given genome is dependent on the
similarity between that genome and the one from which the sequencing reads were
obtained. The test performed for the identification of the
Given the TNR observed in the analyses performed on synthetic and mock communities, the described approach can improve the outcomes beyond those achieved solely with relative abundance thresholds. In fact, irrespective of the coverage value (or the relative abundance), high BER and FUG values for a genome representative of a given species can reinforce the conclusion that the species is indeed present in a sample. Hence, the proposed method is more suitable than using relative abundance thresholds and particularly for the identification of rare species and those with moderate occurrence.
Moreover, given that higher values of the metrics correspond to increased sequence similarity, we hypothesize that the described approach may also be used in a reliable way for defining the strain-level composition of metagenomes. This could be achieved by employing fine-tuned BER and FUG thresholds for distinguishing genomes representing different strains of the same species within the same sample.
In general, FUG can be calculated through the theoretical and practical description that we have included in this manuscript, and BER value can be readily calculated starting from the output of tools such as inStrain (22) and coverM (20). However, to facilitate the calculation of BER and FUG we developed the Python tool Metapresence.
Overall, we demonstrated the potential of using the distribution of mapped reads on genomes as a criterion to define the presence of species in a metagenomic sample and we envision further development of this approach and its implementation in metagenomic pipelines.
MATERIALS AND METHODS
BER and FUG calculation
To calculate the BER metric, it is necessary to calculate the coverage breadth and the average coverage, which are derived from the per-base sequencing depth. More in detail, for a given contig, to calculate the depth at each position, Metapresence generates an array as long as the length of the contig, where each position of the contig is represented by the value in the array with an index equal to that position. All the values are initially set to 0. If a read maps on a given position, +1 is added to the corresponding value in the array, while −1 is added to the value of the array corresponding to the mapping position plus the read length. Multimappers, cigar operations, Phred quality scores, and alignment scores are ignored. Finally, sequencing depth at each position is calculated as the sum of all values in the array preceding the position. For a given genome, the average coverage is given by the average depth across all positions of each contig, while the breadth is given as one minus the ratio of the number of positions with zero depth to the genome length.
The BER metric is given by the ratio of the calculated breadth to the expected breadth, where the latter is obtained using the formula mentioned in the first section of the results.
To calculate the FUG metric, all the contigs of a given genome are arbitrarily joined together in one contiguous sequence. The mapping positions on all the contigs of a given genome are stored in an array. If the reads are paired-end, two arrays are generated: one for the first mates encountered in the sorted bam file, the other for the second. For each contig, the values that are stored are given by the mapping position on the contig plus the sum of the lengths of all the contigs preceding it in the arbitrarily generated contiguous sequence. One arbitrary read is added at the starting position of the contiguous sequence, and one at the ending position minus the average read length.
For each array, the expected distance is calculated as the ratio of genome length to the number of reads in the array, and the FUG metric for each group of mates is then calculated as described in the first section of the results. The artifice of joining the contigs together is necessary for calculating FUG on low-coverage and fragmented genomes. Analyses not included in this manuscript have shown that this approach does not bias the FUG value. Furthermore, the addition of one read at each end of the contiguous sequence not only avoids biasing the FUG value but is also essential in situations where reads uniformly map to a single contiguous region of the genome.
The calculation of the two metrics is implemented in Metapresence. Computational time and resource requirements were measured relatively to different data set scales. This analysis is shown in Fig. S3. When only one process is used, results show that, even when the number of reads is higher than one hundred million and the total size of the reference database is on the order of billions of bases, time consumption remains on the order of minutes and peak memory on the order of a few gigabytes. Parallelization significantly speeds up the computation, increasing however memory consumption.
Synthetic test data sets
To test the metrics on synthetic data, the Critical Assessment of Metagenome Interpretation (CAMI) medium complexity data set was used (24). This consists of 132 bacterial and archeal genome sequences and 100 bacterial plasmids, together with almost one hundred million 150 bp paired-end reads in fastq format, synthetically generated from the 232 sequences mentioned above assuming an Illumina HighSeq error profile. In particular, all the analyses were performed using the two sets of paired-end reads with an insert size of 270 nucleotides. This data set was chosen since it simulates a microbial community with a known species composition.
To evaluate the effect of read mismapping and to test the sensitivity and specificity of the metrics, the data set was expanded by adding 170 genomes selected among the bacterial representative genomes in the microbial genome NCBI database, and with an average nucleotide identity (ANI) between 0.80 and 0.95 with at least one of the CAMI genomes. The NCBI sequence identifier for each one of these genomes is reported in Table S1, section 1. ANI was calculated using the fastANI software with default options (37), and by setting the CAMI genomes as references and the representative genomes as queries. The plasmid sequences were not included in the alignment step. The reads were aligned using Bowtie2 with default options (38).
To generate more complex synthetic communities, we used CAMISIM (26), giving it as input 395 genomes. These genomes were species-specific and, according to fastANI (37), they had ANI values lower than 95%. We set the genomes_total and num_real_genomes parameters to 395, while a specified log_sigma parameter was used to define ten distinct communities. In particular, we used the following values: 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, and 3.0. For each simulated community, 50 million pairs of reads were generated. All other options were kept to default.
To expand the reference genome data set, we downloaded from NCBI 821 genomes with ANI values between 85% and 95% with at least one genome among those used to generate the synthetic reads. Information about the species forming the synthetic communities and the corresponding sequence identifiers is shown in Table S1, section 5.
Evaluation of TPR and TNR at different coverage values
To evaluate the potential influence exerted by the coverage on a given genome on
the precision of BER and FUG metrics, we downsampled the mapped reads of both
CAMI medium’s sample 1 and sample 2 using the option
We separated original and artificially introduced genomes in two distinct groups, and we clustered them into bins depending on their coverage. Since the coverage value ranged between 10-3 and 102 for the original genomes, bin boundaries were defined over logarithmic intervals as follows. For TPR, we clustered all original genomes in all subsamples in bins. Each genome, if detectable in multiple subsamples, could appear in multiple bins depending on the coverage value. The boundaries of the i-th bin were thus given by the following function:
where
The TPR calculation was performed independently for the genomes of each bin, and the result was plotted using a logarithmic scale along the x-axis to facilitate visualization.
For the TNR calculation, the artificially introduced genomes were used. The procedure was identical, with the exception that in this case
BER and sequence similarity
The sequencing reads were generated from the RefSeq genomic sequence of
-ss HS25, -p, -l 150, -m 1000, -s 50, -c 50000000.
The genomes included in the database for read alignment were selected taking into
account ANI as follows. The ANI values were calculated using fastANI (37) and all the representative genomes in
the NCBI database belonging to the
For the genomes used in the analysis, the NCBI genome sequence identifiers and the corresponding ANI values are reported in Table S1, section 4.
Benchmark tools
The reference sketch for YACHT (version 1.2.2) (25) was built using the “yacht sketch ref” command, using as input our expanded CAMISIM reference genome data set, comprising the 395 genomes used to generate the synthetic reads and the 821 added genomes. The following options were used: --kmer 31 --scaled 1,000. All other options were kept to default. The sketch for the sequencing reads from each synthetic community was built using the command “yacht sketch ref” and the following options: --kmer 31 --scaled 1,000. All other options were kept to default. For the command “yacht train,” we used an ANI threshold (--ani_thresh) of 0.975, which was the lowest value avoiding the merging of multiple genomes used for generating the sequencing reads. For the command “yacht run,” we tested multiple values with the --min_coverage_list option and we selected the min_cov value resulting in the highest balanced accuracy across samples (min_cov = 0.05), that is, the mean between TPR and TNR. Moreover, we tested both 0.99 and 0.95 as significance thresholds, selecting 0.99 as the one giving more accurate results.
Concerning Metaphlan4 (version 4.0.6) (15), the sequencing reads were aligned against the Metaphlan database (version: October 2022) using Bowtie2 with the following options: --sam-no-hd --sam-no-sq --no-unal --sensitive-local. Metaphlan4 was launched with the option “--input_type sam.” All other options were kept to default. From the Metaphlan4 output, we considered “present” in the sample all the species with relative abundance above a given threshold. Several thresholds in the range of 0%–0.1% were tested, and the one resulting in the highest balanced accuracy was selected (0.002%). To calculate the TPR, we did not take into account all the species used to generate the synthetic communities which were defined later than 2021 according to the List of Prokaryotic Names with Standing in Nomenclature (27), and in particular, to the web version of the database and the “Name” section of the page of each species. Moreover, we searched in the Metaphlan4 output all the synonyms of each species that are indicated in the same database. A true positive was considered present in the sample if at least one of its synonyms was identified in the Metphlan4 output. The number of false positives was calculated by counting the number of species-level genome bins identified in the Metaphlan4 output and above the relative abundance threshold not corresponding to species used to generate the sequencing reads. The TNR was obtained by dividing this number by the total number of species-level genome bins in the database according to the documentation of the tool, that is 26,970.
Concerning the integration of FUG and BER metrics with the Metaphlan4 workflow, from the alignments performed against the Metaphlan4 database we used Metapresence to calculate BER and FUG values for each marker gene present in the database. We modified the Python script of Metaphlan4 (version 4.0.6) by changing the function “map2bbh”—defined at lines 835–895—so that all marker genes with BER value smaller than 0.7 and FUG value (average of the two mate groups) smaller than 0.4 were considered to have zero reads mapping on them. We chose more flexible values of the metrics to account for the small size of the marker genes. The definition of TPR and TNR was performed from the output of the modified Metaphlan4 as described above, and without setting any relative abundance threshold.
Mock community data set
To test the metrics on real shotgun sequencing data, the mock community defined by Singer and colleagues (28) was used. The sequencing reads were subsampled to three groups of 10 million read pairs, ensuring each pair could appear in only one subsample. The NCBI sequence identifier of the genomes selected as representative for each species of the mock community is shown in Table S1, section 2.
The expansion of the data set with new genomes and the read alignment was performed in the same manner used for the synthetic data sets (Methods, Synthetic test data set). The NCBI sequence identifier of the added genomes is shown in Table S1, section 2.
AAD data set
The sequencing reads of samples derived from 63 AAD-affected patients and 61 healthy donors were recovered from multiple studies (30–32). The study and the accession number associated with each sample are shown in Table S1, section 3. The sequencing reads were aligned against the human gut database generated by Almeida and colleagues (2) using Bowtie2 and keeping all options to default.
To identify the MAG present in the database corresponding to each AAD-associated bacterial species, we selected the MAGs with ANI values higher than 95 with the NCBI representative genome of each species. The ANI was calculated using the software fastANI (37) keeping all options to default. According to the calculated values, all AAD-associated species have one and only one corresponding MAG in the human gut database.
The comparison between the numbers of species identified in affected patients or
healthy individuals was performed using the
Normalized read count metric
To compare the abundances of AAD-associated species in healthy or affected patients, we implemented a normalized read count metric (
where
To compare the distributions of
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2024 Sanguineti et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
ABSTRACT
Shotgun metagenomics allows comprehensive sampling of the genomic information of microbes in a given environment and is a tool of choice for studying complex microbial systems. Mapping sequencing reads against a set of reference or metagenome-assembled genomes is in principle a simple and powerful approach to define the species-level composition of the microbial community under investigation. However, despite the widespread use of this approach, there is no established way to properly interpret the alignment results, with arbitrary relative abundance thresholds being routinely used to discriminate between present and absent species. Such an approach can be affected by significant biases, especially in the identification of rare species. Therefore, it is important to develop new metrics to overcome these biases. Here, we present Metapresence, a new tool to perform reliable identification of the species in metagenomic samples based on the distribution of mapped reads on the reference genomes. The analysis is based on two metrics describing the breadth of coverage and the genomic distance between consecutive reads. We demonstrate the high precision and wide applicability of the tool using data from various synthetic communities, a real mock community, and the gut microbiome of healthy individuals and antibiotic-associated-diarrhea patients. Overall, our results suggest that the proposed approach has a robust performance in hard-to-analyze microbial communities containing contaminated or closely related genomes in low abundance.
IMPORTANCE
Despite the prevalent use of genome-centric alignment-based methods to characterize microbial community composition, there lacks a standardized approach for accurately identifying the species within a sample. Currently, arbitrary relative abundance thresholds are commonly employed for this purpose. However, due to the inherent complexity of genome structure and biases associated with genome-centric approaches, this practice tends to be imprecise. Notably, it introduces significant biases, particularly in the identification of rare species. The method presented here addresses these limitations and contributes significantly to overcoming inaccuracies in precisely defining community composition, especially when dealing with rare members.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer