Understanding the epidemiology and pathogenesis

Full text

Turn on search term navigation

Introduction

Tuberculosis (TB) took 1.3 million lives in 2022 globally, making it the second deadliest infectious disease just after COVID-19.[1] The World Health Organization (WHO) has set the goal to eliminate tuberculosis as a public health problem by 2035, which requires reducing TB incidence by 90% and mortality by 95% compared to 2015. New vaccines, better drugs and improved diagnostic tools are key advances needed to end the TB pandemic. However, only one now old vaccine with limited efficacy, the so-called BCG developed from attenuated M. bovis, has been licensed for TB. Although recently new drugs, including bedaquiline, pretomanid and delamanid, have been added to the list of drugs available to treat TB in the past decades, strains extremely resistant even to these new drugs have already been observed. This stunted progress is partially due to the lack of incentives and underfunding, but also underlines our lack of understanding of the biological process and mechanisms relating to issues such as host-pathogen interaction and drug resistance development.

The first whole genome sequence of MTB (Mycobacterium tuberculosis) strain H37Rv was published in 1998 and has been used as the backbone for mapping short reads from high-throughput sequencing platforms to identify genetic variations, primarily single nucleotide polymorphisms (SNPs) and small indels described in most studies.[2] This information can be used to infer the phylogeny and evolution of epidemic strains, to monitor recent transmission and outbreaks, and to predict drug resistance phenotypes.[3–6] These milestone findings suggest that SNPs and small indels indeed correspond to a large proportion of the observed diversities in clinical MTB isolates. However, the limitation of this pipeline is also obvious since genetic variations in the regions not presented in the reference genome are not included. Comparative genome analysis has already shown that large sequence polymorphisms (LSPs) between strains are not only related to the genealogical delineation, such as Lineage 2 of MTB defined by RD105, but are also associated with important biological processes.[7,8] For example, two genes mmpS6 and mmpL6 are both deleted or truncated in “modern” MTB strains due to the deletion of TbD1 region, which may facilitate the successful expansion of modern strains globally through a hypoxia-induced copper response and increased virulence.[9,10] Another region of difference, RD1, harbors several open reading frames (ORFs) encoding ESX-1 secretion system proteins including CFP-10 and ESAT-6 which are crucial to virulence and phagosomal escape.[11,12] The RD1 deletion in M. bovis is related to the attenuation of the BCG vaccine.[13] Besides these LSPs which result in complete deletions of genes, other types of structural variations (SVs), such as IS6110 insertions in coding sequences, partial deletions of genes by large deletions (>100 bp) or interruptions by large insertions, are also missed by the standard pipeline and may have important consequences related to important biological processes.[12,14,15] By focusing only on SNPs and small indels, it’s assumed that at least a certain proportion of genetic diversity information in clinical strains of MTB is missed, thus we propose a more comprehensive approach to study genomic data.

Accumulation of high through-put sequencing data has allowed the development of a new study field: pangenome analysis.[16,17] The term pangenome refers to all the genetic content of a given dataset, usually a species. The shared genomic content or genes in all genomes are referred to as core genes; other genes presented in several but not all genomes are called dispensable or accessory genes which may be introduced by horizontal gene transfer, gene duplication/deletion or other SVs. Accessory genes are supposed to account for much of the diversity among clinical strains of microorganisms.[18] This is especially true for species with a large number of accessory genes, such as E. coli, which has high genome plasticity and a very large gene pool.[19] MTB is known to have a conserved genome with no plasmids detected and horizontal gene transfer is very rare in modern strains if it exists at all. Thus, it is questionable how much of the diversity observed in clinical practice could be explained by the presence/absence of accessory genes. The drawback of the presence/absence approach is that the variations within genes are ignored, which might contribute to the genetic diversity at different levels. Thus, for a microorganism with a conserved genome such as MTB, a combined approach by integrating the pangenome and within gene variation information is desirable.

In this study, we developed a pipeline to generate a non-redundant pangenome of MTB addressing the problems outlined above from the available short read sequencing data.[20] Based on the pangenome matrix and variations detected with dedicated tools, we described the genetic variation at nucleotide, gene and protein levels. Based on this comprehensive profile of genetic diversity in the pangenome, we characterized the evolutionary patterns of all coding genes under the influence of selective pressure, population structure and purifying selection, which reflects the complex interactions between these factors. Gene ontology analysis of genes showing different evolutionary patterns has revealed some primary characteristics of the MTB biology and the interactions between MTB and the environment.

Method

Sampling and WGS

The 420 clinical MTB isolates used in this retrospective study were selected from the collection of the first national drug resistance baseline survey in China from April 1 to December 31, 2007.[21] Written informed consent was obtained from each participant. All participants were anonymized by uniformed participants identifier. Drug susceptibility testing was performed for six drugs including isoniazid, rifampin, ofloxacin, streptomycin, ethambutol and kanamycin.[21] Selection criteria were based on drug resistance patterns, geographic location of isolation, and genotypes such as spoligotype and MIRU-VNTR profile, in order to represent the diversity in different geographic regions and the population structure of the epidemic strains. The phylogenetic and evolutionary analysis based on whole genome sequencing were published previously by our laboratory and all sequencing data is available in project PRJNA573798 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA573798) on the NCBI server.[20] Briefly, all samples were sequenced on an Illumina platform using single end or paired end library. After filtering out repeat regions, PE/PPE_PGRS genes and insertion sequences, phylogenetic trees were constructed with MEGA using a maximum likelihood algorithm.[22] For the following analysis, the reference genome of H37Rv (NC_000962.3) was used for annotation and variation detection. This study has been approved within the project (2022YRC2305200) by the Ethics Review Board of China CDC (No. 202223) and the data was accessed for the pangenome analysis in November 30, 2022.

Assembly and annotation

The same sequencing data was used to produce the assembly of each strain using the program Spades with k-mer size auto-detected and coverage cut-off set to off and trying to reduce the number of mismatches and short indels.[23] The resulting contigs were first blasted against the NCBI nucleotide database to remove possible contamination from the previous steps. All blast hits of e-value > 1e-5 were excluded. Preserved contigs were excluded if aligned to species other than MTBC. The quality of assemblies was evaluated with QUAST.[24] All contigs longer than 200 bp that passed the contamination filtering were annotated by Prokka [25] to predict coding sequences (CDSs) with default parameters. The produced gbk file was used to search for homologues genes.

Constructing non-redundant pangenome

Two approaches are combined to generate the full-set of non-redundant pangenome: CDS prediction and blastn. CDSs predicted by Prokka were clustered by GET_HOMOLOGUES using COG algorithm.[26] When searching for clusters, the max e-value for blast was set to 0.01, the max intergenic size was set to 1,000 bp, minimum coverage was using the default 75%. At least one sequence was required for a cluster to be called. Only CDSs were considered for clustering. Clusters which were clustered or overlapped with Rv by>=60 bp were considered as redundant to Rv and were removed from candidate new genes list.[27] The left candidate new-/non-Rv genes went through self-overlapping filter by length to keep the longer CDSs if the overlap exceeds 60 bp.

Because Prokka can only predict valid ORFs, for those with interruptions such as frameshift mutations or other structure mutations, these genes are not predicted in the output or predicted with alternative ORFs, but could provide information on phylogeny and evolution. To further detect this part of information missed by prediction software, we applied blastn using consensus sequences of each cluster as the query sequence. Consensus sequences of the new candidate genes and sequences of all 3,906 CDSs and 112 other genetic features from H37Rv genome, including pseudogenes (30), tRNAs (45), ncRNA (20), rRNA (3), misc RNA (2), fragment of putative small regulatory RNA (10), and two other feature genes, were blasted to all contigs of 420 samples and the H37Rv genome. Blast hits with e-value < 1e-5, identity > 90% and query coverage > 30% (for Rv genes)/ > 90% (new candidate genes) were kept. Blastn results went through redundant checking using synteny and overlapping information. When two blast hits for the same gene in one sample have no overlap of their locations, they would be considered as gene duplications and only represented one homologues gene.

Finally, candidate new genes in complementary sequences to H37Rv genome were manually checked and filtered according to expert opinion by manual checking the similarity of complementary sequences and locations of the candidate new genes in the complementary sequences. All contigs were blasted to the H37Rv genome sequence and blast hits with e-value < 1e-5 and identity > 90% with the highest score were kept. Complementary sequences absent in the H37Rv genome were extracted based on blast results. All complementary sequences longer than 100 bp were compared to each other and those sharing more than 90% identity and more than 95% query coverage with e-value < 1e-5 were considered as homologous sequences. This step could filter alternative ORFs predicted in different samples that were not filtered out in previous steps. The whole pipeline is summarized in S1 Fig.

Pangenome composition

The presence or absence of an individual gene was analyzed at different levels separately. At DNA level, if blastn identified a non-redundant best hit for a gene, it was considered as the presence of this gene. Conversion of gene sequence to protein sequence was done with Biopython package.[28] Because genetic variations such as SVs or non-sense point mutations could result in failure of in silico translation, only coding sequences that could be translated into protein sequences in silico were considered as valid ORFs. At protein level, only when a gene sequence had a valid ORF was it considered as present. Non-synonymous mutations in start/stop codon resulting in loss of start/stop codons, gain of an extra stop codon, frameshift mutations, and IS6110 insertions in CDSs were considered as interruptions of translation. In addition, if > 90% of the gene sequence was deleted or not detected by blastn, it was considered as complete deletion.

We define core genes/proteins as genes/proteins presenting in all 420 strains and the reference genome, soft-core genes/proteins as genes/proteins which were deleted/interrupted in less than 5% (n = 21) of the whole sample population, shell genes/proteins as presenting in between 3 (inclusive) to 400 genomes, and cloud genes/proteins in less than 3 genomes.

Detection of structural variations and high impact SNPs

Eleven open source software packages were used to call insertions and deletions, including assemblytics [29], bcftools [30], breakdancer [31], delly [32], lumpy-smoove [33], minimap2 [34], svaba [35], softsv [36], tiddit [37], unimap [38], and wham [39]. Called variants were first filtered based on the parameters with each method such as number of supporting reads. Inversions were not analyzed in this study. Deletions and insertions > 100 kb were also filtered without further check. Variants detected for each sample were merged by SURVIVOR only if supported by more than one method (for deletions > 1,000 bp, more than four methods were required) with restrain to the same variation type and breakpoints within 100 bp.[40] High impact SNPs were defined as SNPs that cause loss of the start/stop codon or gain of an extra stop codon within ORFs. The standard pipeline based on bwa and bcftools was used to detect the SNPs and the effect was predicted by snpEff.[41]

The pangenome matrix at gene level from composition analysis was compared with the results from SURVIVOR. Deletions >1,000 bp without corresponding gene deletion in the pangenome matrix were checked manually with IgV.[42] The gene deletions in the pangenome matrix that were not detected by the 11 methods were checked by viewing in IgV and added to the dataset if confirmed. IS6110 insertions were detected in paired end samples with ISMapper.[43]

Genetic diversity, selective pressures, positive selection and population stratification

Tajima’s D [44] and nucleotide diversity π [45] for each coding gene were calculated using Dnasp (v6).[46] Interrupted gene copies were excluded before calculating the statistics under the assumption that these sequences evolve faster than valid ORFs. Significance for Tajima’s D was given by the table in the original paper by Tajima.[44] Homoplasy was detected by homoplasyFinder using the default parameters.[47] Homoplasy in SNPs were only detected in valid CDSs. For each gene tree used to find homoplasy SNPs in the specific gene, the phylogeny tree produced previously was re-used and genomes with interrupted CDSs were removed using the python package ete because homoplasyFinder requires all genomes in the tree presenting sequences for analysis.[20,48] SVs resulting in interruptions of the same CDS in the same genome were considered as homoplasy ignoring the type and the position information. Interquartile range (IQR) for homoplasy events was calculated by substracting the 25^th percentile from the 75^th percentile. A high homoplasy level was defined as more than the sum of IQR and the 75^th percentile. Fixation index Fst for SNPs was calculated using command basic.stats from R package Hierfstat according to the equation 7.38-7.43 in page 164–5 of Nei (1987).[49,50] All negative Fsts were considered as zero. Significance for positive Fsts were calculated by permutating all samples among all groups for 1,000 times and counting the number of Fsts greater than or equal to the observed Fst using the same R package Hierfstat.

Gene ontology and overrepresentation analysis

Coding genes under different evolutionary patterns were subjected to gene over-representation analysis using enrichGO function in clusterProfiler (version 4.14.4) package.[51] The background annotation database was constructed from annotations from PANTHER server (accessed in January 27, 2025) using AnnotationForge package (version 1.48.0) and GO.db package (version 3.20).[52] Bonferroni correction and Fisher exact test were implemented to find over-represented GO items in each group as compared to the background annotation database. The minimum annotated gene set size for analysis was set to 3. P-value, adjust P-value and q-value were all set to 0.05 to filter insignificant results. Only the most specific GO items were retrieved.

Results

M. tuberculosis has a conserved and closed pangenome

The data sample used in this study has been analyzed in a previous study which described the phylogeny and evolution of the epidemic MTB strains in China.[20] This sample set contained 344 Beijing family strains, 14 Lineage 4.2 strains, 15 Lineage 4.4 strains, 40 Lineage 4.5 strains, 6 Lineage 3 strains and one Lineage 1 strain, which were representative of the epidemic MTB strains in China and these strains were also prevalent in other regions globally.

The number of coding genes annotated for each genome varied from 4,167–4,337 (median = 4,194), resulting in a total of 1,707,532 coding gene sequences by de novo annotation from 420 strains and the H37Rv genome. After pair-wise comparison and clustering, the initial pangenome set automatically predicted by software consisted of 10,097 coding gene clusters, which included 3,754 singleton clusters. More than two thirds (1,167,634; 68.38%) of the predicted coding gene sequences were identical to the blast results. Besides Rv gene sequences, 66,412 sequences in 436 clusters remained after removing redundant (overlapping >=60 bp with Rv genes) CDSs predicted by Prokka. To deal with alternative ORFs annotated in different genomes, another round of blast using these 436 clusters as query sequences was done. After blasting consensus sequences of these 436 clusters to all contigs and repeating the process of removing redundance, 288 clusters were left. At this step, one cluster was removed because there were only two sequences of very different lengths in this cluster and no consensus sequence could be produced.

Among the remaining 288 clusters, 20 clusters harbored repeat sequences or were in insertion sequences such as IS6110 and were removed from the final pangenome. There were 175 clusters which had non-redundant homologues sequences in the H37Rv genome and were considered as new Rv genes. Among the other 93 clusters, 22 clusters were in previously reported LSPs, such as RvD1 (four genes), RvD2 (4), RvD5(4), and TbD1 (1), and some less known LSPs, such as the RvD4494 (inserted at 2,219,418, six genes). Another 40 clusters were in complementary sequences (sequences not present in the H37Rv genome) adjacent to PPE/PE_PGRS genes or IS6110/IS1532 insertion sequences or both. Seven genes were in complementary sequences adjacent to other highly polymorphic regions, such as Rv2082. These 69 genes were not detected in H37Rv genomes and were considered as non-Rv genes. Another 16 genes were not in complementary sequences to H37Rv but were interrupted by various kinds of SVs in H37Rv, such as 15427_dxs interrupted by IS6110–15 insertion. These 16 clusters were also considered as non-Rv genes, which resulted in a total of 85 non-Rv genes. The eight genes left were excluded from the pangenome because of overlapping >= 60 bp in H37Rv (though in some genomes < 60 bp) or being in low quality short contigs (about 300 bp). Finally, 260 new genes were added to the final pangenome in addition to the 4,018 genes/features in H37Rv genome, including 175 new Rv genes and 85 non-Rv genes.

The final non-redundant pangenome constituted 4,278 genes, 4,183 of which were protein coding genes. Besides the 3,906 CDSs and 175 new Rv genes in the H37Rv genome, another 17 pseudo Rv genes interrupted in H37Rv might produce translation products in other genomes, so the total coding genes in H37Rv were 4,098 (Fig 1).

[Figure omitted. See PDF.]

Pangenome composition at different levels and with different methods.

Genetic variations in the pangenome

Among the 4,183 protein coding genes in this pangenome, 3,438 genes presented in all genomes, but only 1,651 genes had valid ORFs in all 421 genomes. We combined pangenome matrix with presence/absence information and genetic variations detected to see how the presence or absence at gene level and at protein level in the pangenome could be categorized by different genetic variations. Because most of the available tools to detect SVs and SNPs with short reads sequencing data require a linear reference genome, genetic variations such as SNPs and SVs were only detected for the 4,098 coding genes (including pseudo-Rv genes) in H37Rv genome. We also included insertions of IS6110 to categorize the interruptions.

With dedicated tools, we detected 76,671 SVs (7,334 unique SVs) in the 4,098 coding genes in total, including 49,655 deletions, 26,700 insertions and 316 inversions. Inversions were not included in the following analysis. For the 325 paired-end samples, 1,860 IS6110 insertions at 411 positions in 311 samples were detected. Using the standard pipeline, there were 16,151 high impact point mutations in start/stop codons, such as those causing loss of a start/stop codon or gain of an extra stop codon, detected in 1,007 genes. Some gene copies contained multiple types of variations which were referred to as “complex” in this study. A “complex SV” composed of “indels” plus an “IS6110” insertion was detected 35 times in our dataset; a “complex” variation of “indel” and “stop gained” was detected five times. 6,035 unique SVs only occur in single isolates in either coding genes or intergenic regions. In the 4,098 coding genes, 5,111 (69.69% of 7,334) only occur in single isolates. This level is comparable to the reported level of 47.5% only for small indels(1–40 bp) in the study of Coll.[53]

Based on the pangenome matrix, there are 99,694 interruptions detected in 420 strains affecting 2,447 coding genes. There were 23,221(23.29%) copies of 744 genes which could not be detected by blastn, of which 50.40% (11,704, including complex variations) could be attributed to large deletions detected resulting in the complete deletions of genes. Other SVs added another 46.26% (42,110) of the gene interruptions in coding genes which were not completely deleted. High impact SNPs alone caused 12.65% (12,615) interruptions. Small indels (< 100 bp) contributed the largest part of interruptions (31.22%) by a single category. Together, SVs including insertions of IS6110 and high impact point mutation in start/stop codons classified 67,315 (67.52%) interruptions in the pangenome in total, which left 32.48% of the interruptions unclassified. (Table 1)

[Figure omitted. See PDF.]

We further investigated the distribution of un-classified interruptions to identify possible causes. A total of 32,379 un-classified interruptions were observed in 1,328 genes. There were 253 genes having more than 21 (5% of total samples) un-classified interruptions. The un-classified interruptions in these 253 genes corresponded to 88.29% (28,587) of all the un-classified interruptions detected, indicating these genes were tend to produce low quality data with our pipeline. On the other hand, there were 33 (out of 421) samples each with > 50 un-classified interruptions of in total 3,635 (39.16%) un-classified interruptions, indicating these samples might have yielded lower sequencing quality and more errors in assembly, but this bias was less obvious as compared to the 253 genes with low quality mentioned above. After excluding the 253 low quality genes, there were 46,024 interruptions in 2,194 genes and only 8.24% (3,792) of the interruptions were not classified by genetic variations.

Some of the genetic variations were phylogenetically specific, such as TbD1 present only in Lineage 1, which harbored two ORFs mmpS6 and mmpL6, and RvD2 absent in Lineage 2, Lineage 3 and H37Rv which harbored four ORFs including pimC, suoX, 15952_hypothetical_protein and 15953_transmembrane_transp (Fig 2 and S1 table). In addition, we have identified one LSP which was lineage specific but not reported before and one LSP only recently reported.[7] A fragment of 3,138 bp (inserted between 1,414,558-1,415,891 in H37Rv) deleted in Lineage 2 and Lineage 4 strains harbored two ORFs. This fragment of insertion sequence to H37Rv could be fully matched to other MTBC genomes in NCBI database but not to the H37Rv genome, possibly another LSP yet to be reported. The another LSP of 4,494 bp (inserted to H37Rv at 2,219,418) was (partially) deleted in all Lineage 4 strains but presented in all other genotypes, which contained six ORFs and has been reported in a previous study.[7]

[Figure omitted. See PDF.]

128 shell genes in the pangenome with information of presence and absence at gene level and protein level in each sample. Detailed information about these 128 shell genes could be found in S1 table. Low quality genes and non-Rv genes were not shown. The phylogeny tree on the left side was constructed in our previous study using 33,220 SNPs detected in the whole genome with reference to H37Rv using max likelihood method.[20] The genetic distance information was removed for better presentation of the topology of the tree. Bright red boxes represent genes completely (>90% of length) deleted in that genome; dark grey boxes represent genes presenting in the genome but were interrupted by SVs or high impact SNPs; blue boxes represent genes with valid ORFs; light grey represent genes with un-classified interruptions.

On the other hand, some LSPs were found in highly polymorphic regions where multiple overlapping SVs were identified in different genomes. For example, RD3 (Fig 2, from Rv1573 to Rv1586c), which harbored 15 genes (including Rv1575 which was interrupted by indels or complete deletions in all other genomes except H37Rv), was deleted in Lineage 2 strains; however, RD3 was also deleted in 21 Lineage 4 strains. In the region where RD152 (Fig 2 from Rv1754c to Rv1762c) was located, multiple deletions were observed in Lineage 2 and Lineage 4 strains but with different borders: seven genes (including cut1 of low quality) were absent in Lineage 2 strains due to the deletion from 1,986,638–1,998,622; five genes (including cut1 of low quality) were deleted in 21 Lineage 4 strains due to the deletion from 1,987,702–1,998,657.

High impact SNPs or other types of SVs could also result in interruptions of genes in specific lineages. For example, Rv0197 was interrupted by an insertion of 2 bp at position 234,496 in Lineage 2 strains but also in all other strains except H37Rv; ephF was interrupted by the deletion of one nucleotide at site 162,152 in Lineage 2 strains; Rv0061c was interrupted by a point mutation from G to T at position 65,150 which resulted in the gain of an extra stop codon in Lineage 2 strains. 15427_dxs was interrupted by IS6110–15 insertion in H37Rv; in our dataset, no other genomes had this IS6110 insertion but at least 351 genomes had an insertion of 5 bp in this gene, so only 66 genomes (all Lineage 4) had valid ORFs of this gene in our dataset.

Landscape of genetic diversity shaped by selective pressures, population structure and purifying selection

Many factors could shape the landscape of genetic diversity in MTB, such as selective pressure, population structure and purifying selection. To investigate how genes evolve under selective pressures, we calculated Tajima’s D, nucleotide diversity π and detected homoplasy events in coding genes. The 253 coding genes that might have inferior sequencing quality and the 85 non-Rv genes that were prone to errors with short reads sequencing were excluded from this analysis, thus 3,845 genes were analyzed out of the total 4,183 coding genes.

Nucleotide diversity (π) measures the average number of nucleotide differences per site between any two DNA sequences chosen randomly from the population.[54] It’s a measure of genetic variation within a population. Nucleotide diversity π was calculated for 3,818 coding genes with more than one CDSs of valid ORF which varied between 0 to 7.09e-3. The average nucleotide diversity was 1.51e-4 in this pangenome. Stratified by the value of Tajima’s D, the values of π distributed almost symmetrically around the average. With reference to the average nucleotide diversity, there were 957 (25.07%) genes of higher nucleotide diversity and 2,861 (74.93%) genes of lower nucleotide diversity, indicating that the nucleotide diversity was concentrated in a small proportion of coding genes (Fig 3).

[Figure omitted. See PDF.]

= 3,742). Only high quality Rv genes with > 3 CDSs of valid ORFs and at least one segregating nucleotide site are shown. Each circle represents a gene. x axis is Tajima’s D, and y axis is nucleotide diversity π. Count of homoplasy events are shown by the size of the circles. The colors show the different levels of significance of Tajima’s D. In the legend, the first number in parenthesis is the number of genes at that level of significance for Tajima’s D, and the second number shows the corresponding number of counts of homoplasy events as shown by corresponding circle size. The red line shows the average nucleotide diversity across the pangenome. The blue line shows the predicted π by linear regression model for genes above the average nucleotide diversity, and the purple line shows the predicted π below the average nucleotide diversity.

Tajima’s D is a statistical test used to compare the number of segregating sites to the average number of pairwise differences in a DNA sequence for a given sample set. A significantly negative value of Tajima’s D indicates recent selective sweep, population expansion after a recent bottleneck, or linkage to a swept gene; a significant positive value of Tajima’s D indicates balancing selection or sudden population contraction; and a zero value of Tajima’s D indicates no selection taking place.[44] Tajima’s D was obtained for 3,742 coding genes with > 3 valid ORFs and at least one segregating nucleotide site, 2,349 of which have shown significant signal of selective pressure (adjusted P-value < 0.05) and 192 showed strong signal of selective pressure (adjusted P-value < 0.001) (Fig 3). In our previous study [20], we have demonstrated that the same sample set in this study had experienced a population contraction recently. Thus, the observed significant negative value of Tajima’s D suggests recent selective sweeps or directional evolution, and the existence of selective pressures rather than population expansion. Most of the drug resistance related genes including gyrA, priA, rpoB, rpoC, katG, pncA, and embB, showed signals of selective pressure, except rpsL, rplC and ddn. While rplC and ddn were conserved in sequences (π < 1e-5), rpsL showed high nucleotide diversity (π = 0.001356). The genetic variation in rpsL in our dataset was constrained to four nucleotide loci and mostly in codon 43 (72.92% of K, 0.24% of M, and 26.37% of R). On the contrary, in other drug resistance related genes under selective pressures, such as in katG, many low frequency alleles were observed (Fig 4). There were 71 genes with > 3 CDSs of valid ORFs having no segregating sites thus no genetic variation detected, 40 of which were core genes, suggesting these core genes may carry critical functions which bear no mutation at all.

[Figure omitted. See PDF.]

(b).

Homoplasy occurs when a trait or nucleotide sequence is shared by two or more taxa due to convergent evolution, parallel evolution, or evolutionary reversals, rather than shared ancestry.[55] Because MTB has a very stable genome and no horizontal gene transfer, the possibility of parallel evolution and evolutionary reversals are estimated to be very low. In this case, homoplasy can be used as an index of positive selection (convergent evolution) due to survival advantages. Only 1,745 (19.91%) coding genes show signals of positive selection detected as homoplasy. We have identified 6,695 homoplasy events including 3,159 SVs/high impact SNPs homoplasy events and 3,536 homoplasy events with non-synonymous SNPs. There were 129 genes showing higher level of homoplasy (>= 9 homoplasy events). The most significant non-synonymous homoplasy SNPs in our dataset was detected in codon 315 in katG, which was estimated to have mutated 92 times (Fig 5). Interruptions of esxR were observed 82 times due to different deletions and insertions, which was the most frequently interrupted gene by SVs (Fig 2.)

[Figure omitted. See PDF.]

Phylogeny tree shows 404 genomes with valid katG ORFs. The phylogeny was modified from the same one used in Fig 2 but with genomes of in valid katG ORFs removed. Nucleotide site 947 corresponds to 944 in H37Rv due to a 3 bp insertion in some genomes upstream of this position. Strips show the sites with homoplasy with nucleotide in different colors: A = red, C = blue, G = grey, T = orange.

MTB has a typical clonal population structure with no horizontal gene transfer which introduces systematic genetic variations between sub-populations known as population stratification.[56] Positive Fst values are positively correlated with the level of population stratification. In total, 1,267 coding genes had significant positive values of Fst (P-value < 0.05) based on SNP information in coding genes varying between 0.0004 to 1, indicating these genes were subject to the impact of population stratification at different levels. In addition, complete deletions or interruptions of genes in one sub-group would result in lineage-specific genes in other sub-groups, which is another type of systematic difference between genotypes and could be considered as population stratification. We have detected at least 70 lineage-specific genes absent in Lineage 2 or Lineage 4 in this sample set. In Lineage 2 strains, 40 genes were absent due to LSPs, including 14 phage related genes located around the RvD3 deletion and five CRISPR associated genes located in RD207.

Over-represented gene categories following different evolutionary patterns

As shown in Fig 2, the relationship between selective pressure and nucleotide diversity was not mono-linear but bilateral. For genes of higher nucleotide diversity than average, the nucleotide diversity was negatively correlated with the strength of selective pressures (or positively correlated with the value of Tajima’s D, correlation coefficient = 0.26); for genes with low nucleotide diversity, the nucleotide diversity was positively correlated with the strength of selective pressure (correlation coefficient = -0.33) (Fig 3). When the data was stratified by the level of selective pressure and nucleotide diversity, these trends were confirmed by the average nucleotide diversity in each sub-group. For example, in the group of genes with higher nucleotide diversity than average, the average nucleotide within each subgroup increased almost mono-directionally as the selective strength decreased; while in the group of genes with lower nucleotide diversity than average, the trend was the opposite. (Table 2)

[Figure omitted. See PDF.]

Under selective pressure (adjusted P-value for Tajima’s D < 0.05), 1,310 genes have shown signals of positive selection as indicated by homoplasy. The level of positive selection was positively correlated with the strength of selective pressures irrelevant to nucleotide diversity, as the average number of homoplasy events detected increased as the strength of selective pressures became stronger in both the high and low nucleotide diversity groups. (Table 2) For example, among the 192 genes under strong selective pressure (adjusted P-value < 0.001), 155 (80.73%) genes showed homoplasy, and the average number of homoplasy events was 15.29 and 4.42 for high and low nucleotide diversity groups, respectively; among the 1,393 gene showing no signal of selective pressures, only 406 (29.15%) genes showed homoplasy, and the average number of homoplasy events was 1.09 and 0.69 for high and low nucleotide diversity groups, respectively. There were 35 genes showing high levels of homoplasy under strong selective pressure, suggesting on-going intensive adaptive evolution in these genes.

In the low nucleotide diversity group, when no signal of selective pressure was detected, 988 genes showed the lowest nucleotide diversity (average π = 5.07e-5), the lowest level of homoplasy (0.69 homoplasy events on average), and the lowest level of fixation between subpopulations (average Fst = 0.10), indicating extreme evolutionary conservation of these genes in MTB which were assumed to be subject to purifying selection. (Table 2) There were 500 core genes (at protein level) among these highly conserved genes with no positive selection or selective pressures signal detected. As mentioned above, there were 40 core genes showing no genetic variation which could not be calculated for Tajima’s D and are not included in this table. Together this makes a total of 540 conserved core genes.

In the high nucleotide diversity group, 405 genes showing no signal of selective pressure had the highest nucleotide diversity (average π = 5.14e-4) and also the highest level of fixation between subpopulations (average Fst = 0.56) and a moderate level of homoplasy (1.09 homoplasy events on average), indicating that population stratification was the major contributor to the nucleotide diversity for this group and a relatively low level of conservativeness in evolution for these genes. (Table 2)

A gene’s characteristics in the population genetics and presence in the pangenome can provide insights into the role of a gene in evolution. Based on the above described characteristics, we divided coding genes into five groups: 1) genes under selective pressures (n = 2,349) including the 192 genes under strong selective pressures and the 35 genes showing high levels of homoplasy under strong selective pressures; 2) the highly conserved core genes (n = 540) with no homoplasy or selective pressures detected; 3) genes showing no signal of selective pressures but with homoplasy detected (n = 435); 4) genes only influenced by population stratification (n = 202) showing neither signal of selective pressures nor positive selection, including lineage-specific genes; and 5) the other genes (n = 319) in the pangenome which showed no signal of selective pressures, no homoplasy, and no population stratification. We further investigated the over-represented gene categories in some characteristic group/sub-groups. (Table 3)

[Figure omitted. See PDF.]

Critical biological processes, such as regulation of growth (GO:0040008), gene expression (GO:0010467), and translation (GO:0006412) were over-represented in the group of conserved core genes. Genes encoding cellular components of organelles such as ribosome (GO:0005840), and genes with toxin sequestering activity (GO:0097351) were also over-represented in this group.

Genes involved in actinobacterium-type cell wall biogenesis (GO:0071766) were over-represented in the group of genes under strong selective pressures. Especially, genes of fatty acid synthase activity (GO:0004312) and of phosphopantetheine binding (GO:0031177) activity were over-represented in this group also potentially related to cell wall biogenesis (Fig 6a).

[Figure omitted. See PDF.]

a) Enriched ontologies in the group of genes under high selective pressures. Only genes enriched in actinobacterium-type cell wall biogenesis (GO:0071766) and related ontologies were shown. b) Enriched ontologies in the group of genes showing high homoplasy levels. Only genes enriched in disruption of anatomical structure in another organism (GO:0141060) and related ontologies were shown. Enriched ontology nodes were in red with gene nodes in grey.

Disruption of host anatomical structure (GO:0141060) was over-represented by five genes in the group of high-level homoplasy genes, four of which had phospholipase activity (GO:0004620) including all three proteins of sphingomyelin phosphodiesterase activity (GO:0004767) reported so far (Fig 6b).

In the group of genes under strong selective pressures and showing high homoplasy levels, fatty acid synthase activity (GO:0004312) was over-represented by five genes in this group; phosphopantetheine binding (GO:0031177) activity was over-represented by six genes in this group as well. Besides, drug resistance related genes were gathered in this group including rpoB (421 presences in the pangenome), rpoC (407), katG (406), embB (421), and gyrA (420).

Discussion

Our study has shown that MTB has a highly conserved pangenome with only 85 new genes not presented in the reference genome H37Rv in a diversified sample set. The pangenome size of 4,278 estimnated here is very close to the value of 4,252 reported in a recent study by Behruznia et al.,.[57] This is expected as MTB has “conserved” genomes and a clonal population structure. By applying further filtering steps, the number of accessory genes was reduced almost by two thirds of the automatically predicted number. The inflated accessory gene number is mostly caused by alternative ORFs predicted by annotation software and platform-wide and workflow specific bias.[58,59] This result emphasizes the importance of establishing a standardized and optimized analytical process for pangenome analysis.

The size of pangenome of MTB is only one-fifth of E. coli, but the proportion of core genes is three times of that in E. coli at protein level and eight times at gene level.[60] This small pangenome size is presumably a result of the obligate intracellular lifestyle.[61] The average nucleotide diversity for coding genes is at least one order lower than that in other organisms.[62,63] These findings agree with and further confirm the notion that MTB is evolutionarily conserved as reported by previous studies using a standard reference genome or focusing on a specific region.[64–66]

Due to interruptions of ORFs caused by various types of genetic variation, about half of the core coding genes were potentially inactivated in different genomes. Structural variation accounts for more than half of the observed interruptions in coding genes while high impact SNPs only correspond to about ten percent, suggesting SVs are a major source of genetic diversity and is the major contributor to adaptive loss in evolution.[67] Although the number of unique SVs detected is only one fourth of the number of SNPs (7,334 v.s. 33,220) in the same sample set, the impact of SVs is more fundamental than SNPs, as SNPs in inactivated genes are assumed to have no impact on the biological functions. This bias may influence the analysis pipeline relying on SNP information, such as genome wide association analysis, which may require refinement of these methodologies.

Though genetic variations in LSPs are relatively limited in MTB, we have identified two genes, pimC and suoX, which are deleted due to the LSP of RvD2, may have crucial impact on metabolism. PimC is an amannosyltransferase involved in cell wall biosynthesis and some of its products are important virulence factors in MTB.[68,69] Searching literature, we have not found other homologues gene in MTB so far, we assume there is an alternative pathway involved in the biosynthesis of Ac_nPIM₃ in Lineage 2 which has not yet been discovered. SuoX oxidizes sulfite to sulfate directly in sulfur metabolism. Due to its nucleophilicity and strong reductive capacity, sulfite can be toxic for bacteria but can be detoxified by oxidation to sulfate.[70] Direct oxidation of sulfite seems to be the exclusive way to detoxify sulfite via oxidation, yet the gene suoX encoding sulfite oxidase in MTB is deleted in Lineage 2 strains and no other sulfite oxidase has been identified in MTB so far, which means the direct oxidation of sulfite in Lineage 2 may be compromised.[70] In the reverse direction, the assimilation pathway of sulfate, sulfite could be further reduced to sulfide, which is the form of sulfur required for biosynthesis of sulfur containing metabolites such as cysteine and acetate.[71] Sulfate assimilation is important for bacteria persistence. For example, H₂S has been shown to stimulate respiration, growth and pathogenesis of MTB in vivo.[72] MTB can produce endogenous H₂S from cysteine by a desulfhydrase (probably Rv3684).[73] The absence of suoX in Lineage 2 might contribute to the higher virulence of Beijing strains observed, as compromised oxidation capacity would increase the concentration of sulfite, which in turn moves the balance towards the direction of assimilation and consequently increases the production of H₂S. Besides pimC and suoX, the five CRISPR associated genes in the RD207 region deleted in Lineage 2 strains are reported to be associated with drug resistance.[74] These observations indicate that although LSPs only have a limited contribution to the difference in genetic content, some of these differences may have important consequences that their absences in certain genotype could result in differentiation in metabolism that might have contributed to the prevalence of certain genotypes.

Our data has shown that the intensity of positive selection is positively correlated to the strength of selective pressure in spite of other factors, indicating though it may be slow, the conservative MTB genome is still evolving. One important evolutionary force might be the interaction between MTB and the host immune system, as genes involved in disruption of host structures are over-represented in the group with a high level of homoplasy, indicating these selected mutations might help survive host immune attacks. Especially, the metabolism of fatty acids may have played a key role as indicated by the over-representation of genes with phospholipase activity and sphingomyelin phosphodiesterase activity. For example, Rv0888, an extracellular bifunctional enzyme with both nuclease and sphingomyelinase activities, can enhance the colonization ability of Mycobacterium smegmatis in the lungs of mice.[75] Deletion of plcC, the phospholipase C in MTB, was related to a persistent tuberculosis outbreak in the UK.[76] plcC was reported to be a hotspot of deletions in another study as well.[59] Further investigation of these genes may deepen our understanding of the pathogenicity of MTB.

A few exceptions were observed for the correlation between selective pressure and positive selection. One example is the rpsL gene, which shows a high level of homoplasy (104 homoplasy events) indicating positive selection of advantageous mutations, but no signal of selective pressure. Mutations in codon 43 in rpsL confer a high-level resistance to streptomycin. Streptomycin is the first anti-TB drug used in monotherapy of tuberculosis but has been replaced by oral drugs for almost two decades now.[77] The disappearance of signal of selective pressure reflects the changes in the standard treatment regime. Though this historical selective pressure on rpsL no longer exists the signal of positive selection could still be detected in our data. Another factor that might have contributed to the persistent signal of positive selection is fitness cost. Mutations in other drug resistance related genes, such as rpoB, usually cause different levels of fitness cost.[78] While strains with mutations in rpsL conferring streptomycin resistance are not affected by such deficiency.[79] Epistatis may also play a role in the high frequency of rpsL mutantions as shown in a study with E.coli which reported higher fitness in some double mutants of rpoB/rpsL than wild genotypes.[80] Also in our previous analysis using the same sample set, the association between drug resistance patterns and genotypes suggests that certain genotypes which developed resistance to streptomycin seem not to show any deficiency in further developing rifampin resistance.[81] However, this type of highly advantageous mutation without fitness cost in genes is very rare, as we only observed 32 genes in this category with no signal of selective pressures and a high level of homoplasy. Further checking the function of these genes might be helpful to identify historical selective pressures in the evolution of MTB. These observations suggest that the current landscape of genetic diversity in MTB is the outcome of complicated interactions between changing selective pressures, purifying selection, clonal population structure, and other factors such as fitness cost of mutations.

Our study has shown the necessity and advantages to switch from an approach using a standard reference genome focusing on SNPs and small indels to a more comprehensive pangenome approach with detailed genetic variation information. By combining pangenome, population genetics and gene ontology, new insights into the biology and the interaction of MTB with its environment may be gained. Future studies with more advanced sequencing techniques like long read sequencing and more versatile bioinformatic tools such as graph pangenome will remove the drawbacks in current methodology and improve our understanding about the epidemiology and pathogenicity of MTB.[82].

Supporting information

S1 Fig. Pangenome pipeline.

https://doi.org/10.1371/journal.pone.0324152.s001

(TIF)

S1 Table. Information of 128 shell genes in Fig 2.

https://doi.org/10.1371/journal.pone.0324152.s002

(XLSX)

References

1. 1. World Health Organization. Global tuberculosis report 2023. 2023 [cited 20 Nov 2023]. Available: https://www.who.int/publications-detail-redirect/9789240083851

* View Article

* Google Scholar

2. 2. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393(6685):537–44. pmid:9634230

* View Article

* PubMed/NCBI

* Google Scholar

3. 3. Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão J, Viveiros M, et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat Commun. 2014;5:4812. pmid:25176035

* View Article

* PubMed/NCBI

* Google Scholar

4. 4. Napier G, Campino S, Merid Y, Abebe M, Woldeamanuel Y, Aseffa A, et al. Robust barcoding and identification of Mycobacterium tuberculosis lineages for epidemiological and clinical studies. Genome Med. 2020;12(1):114. pmid:33317631

* View Article

* PubMed/NCBI

* Google Scholar

5. 5. Walker TM, Kohl TA, Omar SV, Hedge J, Del Ojo Elias C, Bradley P, et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect Dis. 2015;15(10):1193–202. pmid:26116186

* View Article

* PubMed/NCBI

* Google Scholar

6. 6. Walker TM, Ip CLC, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, et al. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect Dis. 2013;13(2):137–46. pmid:23158499

* View Article

* PubMed/NCBI

* Google Scholar

7. 7. Liu Z, Jiang Z, Wu W, Xu X, Ma Y, Guo X, et al. Identification of region of difference and H37Rv-related deletion in Mycobacterium tuberculosis complex by structural variant detection and genome assembly. Front Microbiol. 2022;13:984582. pmid:36160240

* View Article

* PubMed/NCBI

* Google Scholar

8. 8. Zeng X, Kwok JS-L, Yang KY, Leung KS-S, Shi M, Yang Z, et al. Whole genome sequencing data of 1110 Mycobacterium tuberculosis isolates identifies insertions and deletions associated with drug resistance. BMC Genomics. 2018;19(1):365. pmid:29769016

* View Article

* PubMed/NCBI

* Google Scholar

9. 9. Ma R, Farrell D, Gonzalez G, Browne JA, Nakajima C, Suzuki Y, et al. The TbD1 locus mediates a hypoxia-induced copper response in Mycobacterium bovis. Front Microbiol. 2022;13:817952. pmid:35495699

* View Article

* PubMed/NCBI

* Google Scholar

10. 10. Bottai D, Frigui W, Sayes F, Di Luca M, Spadoni D, Pawlik A, et al. TbD1 deletion as a driver of the evolutionary success of modern epidemic Mycobacterium tuberculosis lineages. Nat Commun. 2020;11(1):684. pmid:32019932

* View Article

* PubMed/NCBI

* Google Scholar

11. 11. Lewis KN, Liao R, Guinn KM, Hickey MJ, Smith S, Behr MA, et al. Deletion of RD1 from Mycobacterium tuberculosis mimics bacille Calmette-Guérin attenuation. J Infect Dis. 2003;187(1):117–23. pmid:12508154

* View Article

* PubMed/NCBI

* Google Scholar

12. 12. Zondervan NA, van Dam JCJ, Schaap PJ, Martins Dos Santos VAP, Suarez-Diez M. Regulation of three virulence strategies of Mycobacterium tuberculosis: A success story. Int J Mol Sci. 2018;19(2):347. pmid:29364195

* View Article

* PubMed/NCBI

* Google Scholar

13. 13. Mahairas GG, Sabo PJ, Hickey MJ, Singh DC, Stover CK. Molecular analysis of genetic differences between Mycobacterium bovis BCG and virulent M. bovis. J Bacteriol. 1996;178(5):1274–82. pmid:8631702

* View Article

* PubMed/NCBI

* Google Scholar

14. 14. Phelan JE, Coll F, Bergval I, Anthony RM, Warren R, Sampson SL, et al. Recombination in pe/ppe genes contributes to genetic variation in Mycobacterium tuberculosis lineages. BMC Genomics. 2016;17:151. pmid:26923687

* View Article

* PubMed/NCBI

* Google Scholar

15. 15. Periwal V, Scaria V. Insights into structural variations and genome rearrangements in prokaryotic genomes. Bioinformatics. 2015;31(1):1–9. pmid:25189783

* View Article

* PubMed/NCBI

* Google Scholar

16. 16. Medini D, Donati C, Rappuoli R, Tettelin H. The Pangenome: A Data-Driven Discovery in Biology. The Pangenome: Diversity, Dynamics and Evolution of Genomes. Springer International Publishing; 2020. pp. 3–20. Available: https://pubmed.ncbi.nlm.nih.gov/32633919//

17. 17. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A. 2005;102(39):13950–5. pmid:16172379

* View Article

* PubMed/NCBI

* Google Scholar

18. 18. Köstlbacher S, Collingro A, Halter T, Schulz F, Jungbluth SP, Horn M. Pangenomics reveals alternative environmental lifestyles among chlamydiae. Nat Commun. 2021;12(1):4021. pmid:34188040

* View Article

* PubMed/NCBI

* Google Scholar

19. 19. Horesh G, Blackwell GA, Tonkin-Hill G, Corander J, Heinz E, Thomson NR. A comprehensive and high-quality collection of Escherichia coli genomes and their genes. Microb Genom. 2021;7(2):000499. pmid:33417534

* View Article

* PubMed/NCBI

* Google Scholar

20. 20. Zhou Y, Anthony R, Wang S, Ou X, Liu D, Zhao Y, et al. The epidemic of multidrug resistant tuberculosis in China in historical and phylogenetic perspectives. J Infect. 2020;80(4):444–53. pmid:31972213

* View Article

* PubMed/NCBI

* Google Scholar

21. 21. Zhao Y, Xu S, Wang L, Chin DP, Wang S, Jiang G, et al. National survey of drug-resistant tuberculosis in China. N Engl J Med. 2012;366(23):2161–70. pmid:22670902

* View Article

* PubMed/NCBI

* Google Scholar

22. 22. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecular evolutionary genetics analysis across computing platforms. Battistuzzi FU, editor. Molecular Biology and Evolution. 2018;35: 1547–1549. . pmid:29722887

* View Article

* PubMed/NCBI

* Google Scholar

23. 23. Nurk S, Bankevich A, Antipov D, Gurevich A, Korobeynikov A, Lapidus A, et al. Assembling genomes and mini-metagenomes from highly chimeric reads. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer: Berlin, Heidelberg. 2013. pp. 158–170. https://doi.org/10.1007/978-3-642-37195-0_13

24. 24. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5. pmid:23422339

* View Article

* PubMed/NCBI

* Google Scholar

25. 25. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30(14):2068–9. pmid:24642063

* View Article

* PubMed/NCBI

* Google Scholar

26. 26. Contreras-Moreira B, Vinuesa P. GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol. 2013;79(24):7696–701. pmid:24096415

* View Article

* PubMed/NCBI

* Google Scholar

27. 27. Pallejà A, Harrington ED, Bork P. Large gene overlaps in prokaryotic genomes: Result of functional constraints or mispredictions?. BMC Genomics. 2008;9:335. pmid:18627618

* View Article

* PubMed/NCBI

* Google Scholar

28. 28. Biopython . [cited 11 Dec 2023]. Available: https://biopython.org//.

* View Article

* Google Scholar

29. 29. Nattestad M, Schatz MC. Assemblytics: A web analytics tool for the detection of variants from an assembly. Bioinformatics. 2016;32(19):3021–3. pmid:27318204

* View Article

* PubMed/NCBI

* Google Scholar

30. 30. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008. pmid:33590861

* View Article

* PubMed/NCBI

* Google Scholar

31. 31. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. BreakDancer: An algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6(9):677–81. pmid:19668202

* View Article

* PubMed/NCBI

* Google Scholar

32. 32. Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28(18):i333–9. pmid:22962449

* View Article

* PubMed/NCBI

* Google Scholar

33. 33. Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: A probabilistic framework for structural variant discovery. Genome Biol. 2014;15(6):R84. pmid:24970577

* View Article

* PubMed/NCBI

* Google Scholar

34. 34. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. pmid:29750242

* View Article

* PubMed/NCBI

* Google Scholar

35. 35. Wala JA, Bandopadhayay P, Greenwald NF, O’Rourke R, Sharpe T, Stewart C, et al. SvABA: Genome-wide detection of structural variants and indels by local assembly. Genome Res. 2018;28(4):581–91. pmid:29535149

* View Article

* PubMed/NCBI

* Google Scholar

36. 36. Bartenhagen C, Dugas M. Robust and exact structural variation detection with paired-end and soft-clipped alignments: SoftSV compared with eight algorithms. Brief Bioinform. 2016;17(1):51–62. pmid:25998133

* View Article

* PubMed/NCBI

* Google Scholar

37. 37. Eisfeldt J, Vezzi F, Olason P, Nilsson D, Lindstrand A. TIDDIT, an efficient and comprehensive structural variant caller for massive parallel sequencing data. F1000Res. 2017;6:664.

* View Article

* Google Scholar

38. 38. lh3/unimap: A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment. [cited 11 Dec 2023]. Available: https://github.com/lh3/unimap

* View Article

* Google Scholar

39. 39. Kronenberg ZN, Osborne EJ, Cone KR, Kennedy BJ, Domyan ET, Shapiro MD, et al. Wham: Identifying Structural Variants of Biological Consequence. Prlic A, editor. PLOS Computational Biology. 2015;11: e1004572. pmid:26625158

* View Article

* PubMed/NCBI

* Google Scholar

40. 40. Sedlazeck F. fritzsedlazeck/SURVIVOR. 2023. Available: https://github.com/fritzsedlazeck/SURVIVOR

* View Article

* Google Scholar

41. 41. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80–92. pmid:22728672

* View Article

* PubMed/NCBI

* Google Scholar

42. 42. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–6. pmid:21221095

* View Article

* PubMed/NCBI

* Google Scholar

43. 43. Hawkey J, Hamidian M, Wick RR, Edwards DJ, Billman-Jacobe H, Hall RM, et al. ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data. BMC Genomics. 2015;16(1):667. pmid:26336060

* View Article

* PubMed/NCBI

* Google Scholar

44. 44. Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585–95. pmid:2513255

* View Article

* PubMed/NCBI

* Google Scholar

45. 45. Nei M. Molecular evolutionary genetics. New York: Columbia University Press; 1987. Available: http://archive.org/details/molecularevoluti0000neim_u8c9

46. 46. Rozas J, Ferrer-Mata A, Sánchez-DelBarrio JC, Guirao-Rico S, Librado P, Ramos-Onsins SE, et al. DnaSP 6: DNA sequence polymorphism analysis of large data sets. Mol Biol Evol. 2017;34(12):3299–302. pmid:29029172

* View Article

* PubMed/NCBI

* Google Scholar

47. 47. Crispell J, Balaz D, Gordon SV. HomoplasyFinder: A simple tool to identify homoplasies on a phylogeny. Microb Genom. 2019;5(1):e000245. pmid:30663960

* View Article

* PubMed/NCBI

* Google Scholar

48. 48. Huerta-Cepas J, Serra F, Bork P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol. 2016;33(6):1635–8. pmid:26921390

* View Article

* PubMed/NCBI

* Google Scholar

49. 49. Goudet J, Jombart T, Kamvar ZN, Archer E, Hardy O. hierfstat: Estimation and Tests of Hierarchical F-Statistics. 2022. Available: https://CRAN.R-project.org/package=hierfstat

* View Article

* Google Scholar

50. 50. Nei M. Molecular Evolutionary Genetics. Columbia University Press. 1987. https://doi.org/10.7312/nei-92038 http://dx.doi.org/10.7312/nei-92038

51. 51. Xu S, Hu E, Cai Y, Xie Z, Luo X, Zhan L, et al. Using clusterProfiler to characterize multiomics data. Nat Protoc. 2024;19(11):3292–320. pmid:39019974

* View Article

* PubMed/NCBI

* Google Scholar

52. 52. AnnotationForge . In: Bioconductor [Internet]. [cited 23 Jan 2025]. Available: http://bioconductor.org/packages/AnnotationForge//.

* View Article

* Google Scholar

53. 53. Coll F, Preston M, Guerra-Assunção JA, Hill-Cawthorn G, Harris D, Perdigão J, et al. PolyTB: a genomic variation map for Mycobacterium tuberculosis. Tuberculosis (Edinb). 2014;94(3):346–54. pmid:24637013

* View Article

* PubMed/NCBI

* Google Scholar

54. 54. Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci U S A. 1979;76(10):5269–73. pmid:291943

* View Article

* PubMed/NCBI

* Google Scholar

55. 55. Wake DB, Wake MH, Specht CD. Homoplasy: from detecting pattern to determining process and mechanism of evolution. Science. 2011;331(6020):1032–5. pmid:21350170

* View Article

* PubMed/NCBI

* Google Scholar

56. 56. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361(9357):598–604. pmid:12598158

* View Article

* PubMed/NCBI

* Google Scholar

57. 57. Behruznia M, Marin M, Farhat M, Thomas JC, Domingo-Sananes MR, Meehan CJ. The Mycobacterium tuberculosis complex pangenome is small and driven by sub-lineage-specific regions of difference. eLife. 2024;13.

* View Article

* Google Scholar

58. 58. Modlin SJ, Robinhold C, Morrissey C, Mitchell SN, Ramirez-Busby SM, Shmaya T, et al. Exact mapping of Illumina blind spots in the Mycobacterium tuberculosis genome reveals platform-wide and workflow-specific biases. Microb Genom. 2021;7(3):mgen000465. pmid:33502304

* View Article

* PubMed/NCBI

* Google Scholar

59. 59. Marin MG, Wippel C, Quinones-Olvera N, Behruznia M, Jeffrey BM, Harris M, et al. Analysis of the limited M. tuberculosis accessory genome reveals potential pitfalls of pan-genome analysis approaches. bioRxiv. 2024;:2024.03.21.586149. pmid:38585972

* View Article

* PubMed/NCBI

* Google Scholar

60. 60. Tantoso E, Eisenhaber B, Kirsch M, Shitov V, Zhao Z, Eisenhaber F. To kill or to be killed: pangenome analysis of Escherichia coli strains reveals a tailocin specific for pandemic ST131. BMC Biol. 2022;20(1):146. pmid:35710371

* View Article

* PubMed/NCBI

* Google Scholar

61. 61. Bobay L-M, Ochman H. Factors driving effective population size and pan-genome evolution in bacteria. BMC Evol Biol. 2018;18(1):153. pmid:30314447

* View Article

* PubMed/NCBI

* Google Scholar

62. 62. Dutoit L, Burri R, Nater A, Mugal CF, Ellegren H. Genomic distribution and estimation of nucleotide diversity in natural populations: perspectives from the collared flycatcher (Ficedula albicollis) genome. Mol Ecol Resour. 2017;17(4):586–97. pmid:27717155

* View Article

* PubMed/NCBI

* Google Scholar

63. 63. Ali F. Patterns of change in nucleotide diversity over gene length. Genome Biol Evol. 2024;16(4):evae078. pmid:38608148

* View Article

* PubMed/NCBI

* Google Scholar

64. 64. Comas I, Chakravartti J, Small PM, Galagan J, Niemann S, Kremer K, et al. Human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved. Nat Genet. 2010;42(6):498–503. pmid:20495566

* View Article

* PubMed/NCBI

* Google Scholar

65. 65. Frothingham R, Hills HG, Wilson KH. Extensive DNA sequence conservation throughout the Mycobacterium tuberculosis complex. J Clin Microbiol. 1994;32(7):1639–43. pmid:7929750

* View Article

* PubMed/NCBI

* Google Scholar

66. 66. Jia X, Yang L, Dong M, Chen S, Lv L, Cao D, et al. The bioinformatics analysis of comparative genomics of Mycobacterium tuberculosis complex (MTBC) provides insight into dissimilarities between intraspecific groups differing in host association, virulence, and epitope diversity. Front Cell Infect Microbiol. 2017;7:88. pmid:28377903

* View Article

* PubMed/NCBI

* Google Scholar

67. 67. Gatt YE, Margalit H. Common adaptive strategies underlie within-host evolution of bacterial pathogens. Mol Biol Evol. 2021;38(3):1101–21. pmid:33118035

* View Article

* PubMed/NCBI

* Google Scholar

68. 68. Bansal-Mutalik R, Nikaido H. Mycobacterial outer membrane is a lipid bilayer and the inner membrane is unusually rich in diacyl phosphatidylinositol dimannosides. Proc Natl Acad Sci U S A. 2014;111(13):4958–63. pmid:24639491

* View Article

* PubMed/NCBI

* Google Scholar

69. 69. Pirson C, Engel R, Jones GJ, Holder T, Holst O, Vordermeier HM. Highly purified mycobacterial phosphatidylinositol mannosides drive cell-mediated responses and activate NKT cells in cattle. Clin Vaccine Immunol. 2015;22(2):178–84. pmid:25499010

* View Article

* PubMed/NCBI

* Google Scholar

70. 70. Kappler U, Dahl C. Enzymology and molecular biology of prokaryotic sulfite oxidation. FEMS Microbiol Lett. 2001;203(1):1–9. pmid:11557133

* View Article

* PubMed/NCBI

* Google Scholar

71. 71. Pinto R, Harrison JS, Hsu T, Jacobs WR Jr, Leyh TS. Sulfite reduction in mycobacteria. J Bacteriol. 2007;189(18):6714–22. pmid:17644602

* View Article

* PubMed/NCBI

* Google Scholar

72. 72. Saini V, Chinta KC, Reddy VP, Glasgow JN, Stein A, Lamprecht DA, et al. Hydrogen sulfide stimulates Mycobacterium tuberculosis respiration, growth and pathogenesis. Nat Commun. 2020;11(1):557. pmid:31992699

* View Article

* PubMed/NCBI

* Google Scholar

73. 73. Kunota TTR, Rahman MA, Truebody BE, Mackenzie JS, Saini V, Lamprecht DA, et al. Mycobacterium tuberculosis H2S functions as a sink to modulate central metabolism, bioenergetics, and drug susceptibility. Antioxidants (Basel). 2021;10(8):1285. pmid:34439535

* View Article

* PubMed/NCBI

* Google Scholar

74. 74. Bhalla N, Nanda RK. Pangenome-wide association study reveals the selective absence of CRISPR genes (Rv2816c-19c) in drug-resistant Mycobacterium tuberculosis. Microbiol Spectr. 2024;12(8):e0052724. pmid:38916315

* View Article

* PubMed/NCBI

* Google Scholar

75. 75. Dang G, Cui Y, Wang L, Li T, Cui Z, Song N, et al. Extracellular sphingomyelinase Rv0888 of Mycobacterium tuberculosis contributes to pathological lung injury of Mycobacterium smegmatis in mice via inducing formation of neutrophil extracellular traps. Front Immunol. 2018;9:677. pmid:29670633

* View Article

* PubMed/NCBI

* Google Scholar

76. 76. Farzand R, Haigh RD, Monk P, Haldar P, Patel H, Pareek M, et al. A persistent tuberculosis outbreak in the UK is characterized by hydrophobic fadB4-deficient Mycobacterium tuberculosis that replicates rapidly in macrophages. mBio. 2022;13(6):e0265622. pmid:36374090

* View Article

* PubMed/NCBI

* Google Scholar

77. 77. MedicalResearchCouncil A. Streptomycin treatment of pulmonary tuberculosis. Br Med J. 1948;2:769–82.

* View Article

* Google Scholar

78. 78. Meftahi N, Namouchi A, Mhenni B, Brandis G, Hughes D, Mardassi H. Evidence for the critical role of a secondary site rpoB mutation in the compensatory evolution and successful transmission of an MDR tuberculosis outbreak strain. J Antimicrob Chemother. 2016;71(2):324–32. pmid:26538504

* View Article

* PubMed/NCBI

* Google Scholar

79. 79. Spies FS, von Groll A, Ribeiro AW, Ramos DF, Ribeiro MO, Dalla Costa ER, et al. Biological cost in Mycobacterium tuberculosis with mutations in the rpsL, rrs, rpoB, and katG genes. Tuberculosis (Edinb). 2013;93(2):150–4. pmid:23276692

* View Article

* PubMed/NCBI

* Google Scholar

80. 80. Durão P, Trindade S, Sousa A, Gordo I. Multiple resistance at no cost: Rifampicin and streptomycin a dangerous liaison in the spread of antibiotic resistance. Mol Biol Evol. 2015;32(10):2675–80. pmid:26130082

* View Article

* PubMed/NCBI

* Google Scholar

81. 81. Zhou Y, van den Hof S, Wang S, Pang Y, Zhao B, Xia H, et al. Association between genotype and drug resistance profiles of Mycobacterium tuberculosis strains circulating in China in a national drug resistance survey. PLoS One. 2017;12(3):e0174197. pmid:28333978

* View Article

* PubMed/NCBI

* Google Scholar

82. 82. Li H, Wang S, Chai S, Yang Z, Zhang Q, Xin H, et al. Graph-based pan-genome reveals structural and sequence variations related to agronomic traits and domestication in cucumber. Nat Commun. 2022;13(1):682. pmid:35115520

* View Article

* PubMed/NCBI

* Google Scholar

Citation: Zhou Y, Anthony R, Wang S, Xia H, Ou X, Zhao B, et al. (2025) Understanding the epidemiology and pathogenesis of Mycobacterium tuberculosis with non-redundant pangenome of epidemic strains in China. PLoS One 20(5): e0324152. https://doi.org/10.1371/journal.pone.0324152

About the Authors:

Yang Zhou

Roles: Conceptualization, Formal analysis, Methodology, Visualization, Writing – original draft

Affiliations: National Center for Tuberculosis Control and Prevention, Chinese Center for Disease Control and Prevention, Changping District, Beijing, China, Radboudumc Research Institute, Radboud University, Houtlaan XZ, Nijmegen, The Netherlands

ORICD: https://orcid.org/0000-0002-8522-2141

Richard Anthony

Roles: Conceptualization, Writing – review & editing

Affiliation: National Tuberculosis Reference Laboratory, Centre for Infectious Disease Control, National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands.

Shengfen Wang

Roles: Data curation