Introduction
Alleles do not affect fitness and other phenotypic traits independently and, instead, engage in epistatic interactions (de Visser et al., 2011; de Visser and Krug, 2014; Gillespie, 1994; Good and Desai, 2015; Kryazhimskiy et al., 2011; Maynard Smith, 1970; McCandlish et al., 2013; Povolotskaya and Kondrashov, 2010). Epistasis is pervasive at the scale of between-species differences, where it is saliently manifested by Dobzhansky-Muller incompatibilities and results in low fitness of interspecific hybrids (Callahan et al., 2011; Corbett-Detig et al., 2013; Dobzhansky, 1936; Kondrashov et al., 2002; Orr, 1995; Taverner et al., 2020). By contrast, at the scale of within-population variation, the importance of epistasis remains controversial (Crow, 2010; Hill et al., 2008; Hivert et al., 2021). This may look like a paradox, because such variation provides an opportunity to detect epistasis through linkage disequilibrium (LD), non-random associations between alleles at different loci (Beissinger et al., 2016; Boyrie et al., 2021; Garcia and Lohmueller, 2021; Wang et al., 2012; Zan et al., 2018). In the case of positive epistasis, a situation when a combination of alleles confers higher fitness than that expected from selection acting on these alleles individually, it can maintain favorable coadapted combinations of alleles at interacting sites, increasing linkage disequilibrium (LD) between them (Barton, 2010; Boyrie et al., 2021; Kouyos et al., 2007; Pedruzzi et al., 2018; Takahasi and Tajima, 2005). In sexual populations, recombination competes with epistasis, disrupting such coupling LD (Neher and Shraiman, 2009; Pedruzzi et al., 2018). Nevertheless, within a single gene, physical proximity alone may suffice to limit recombination, so sets of coadapted variants may evolve (Dobzhansky, 1950; Lewontin and Kojima, 1960). Such positive within-gene epistasis has been proposed to affect variation in natural populations (Arnold et al., 2020; Ragsdale, 2021), but conditions for this are expected to be restrictive (Hansen, 2013; Mäki-Tanila and Hill, 2014; Sackton and Hartl, 2016).
Perhaps, the fitness landscape is complex macroscopically but is more smooth microscopically or, in other words, epistasis is genuinely more pronounced at a macroscopic scale (Ochs and Desai, 2015). If so, studying epistasis in hyperpolymorphic populations, where differences between genotypes can be as high as those between genomes of species from different genera or even families, holds a great promise because variation within such a population can cover multiple fitness peaks or a sizeable chunk of a curved ridge of high fitness (Bateson, 1909; Dobzhansky, 1937; Gavrilets, 1997; Kondrashov et al., 2002; Muller, 1942; Appendix 1).
Results
Elevated LD between nonsynonymous polymorphisms
In a vast majority of species, nucleotide diversity π, the evolutionary distance between a pair of randomly chosen genotypes, is, at selectively neutral sites, of the order of 0.001 (as in
In both
Figure 1.
The efficiency of epistatic selection in populations with different levels of genetic diversity.
(A–C) LD in natural populations for SNPs with MAF >0.05. (A) USA population of
Figure 1—figure supplement 1.
The efficiency of epistatic selection in populations with different levels of genetic diversity.
LD between nonsynonymous SNPs is shown in orange, and LD between synonymous SNPs is shown in blue. (A) Russian population of
Figure 1—figure supplement 2.
Linkage disequilibrium within and between exons in
LD between nonsynonymous SNPs is shown in orange, and LD between synonymous SNPs is shown in blue. Solid lines indicate LD between pairs of SNPs located within the same exon of the gene; dashed lines correspond to pairs of SNPs located in different exons of the gene. (A) USA population of
Figure 1—figure supplement 3.
Comparison of LDnonsyn and LDsyn in
For each possible minor allele count and nucleotide distance, the number of corresponding pairs of nonsynonymous variants and LDnonsyn between them is calculated. Then, the same number of synonymous variants on the same nucleotide distance and with the same minor allele count is randomly chosen to calculate LDsyn. Subsampling is performed for 100 times. Filled areas show 95% intervals of LDsyn in the subsamples. (A) All SNPs, (B) SNPs with MAF >0.05.
Figure 1—figure supplement 4.
LD between SNPs with different MAF in
LD between nonsynonymous SNPs is shown in orange, and LD between synonymous SNPs is shown in blue. Filled areas indicate SE of LD calculated for each scaffold separately. (A, B) LD between all pairs of SNPs pooled together. Solid lines indicate LD between pairs of SNPs located within the same gene; dashed lines correspond to pairs of SNPs located in different genes. (C, D) Pairs of SNPs split by MAF.
Figure 1—figure supplement 5.
LD between SNPs with different MAF in
LD between nonsynonymous SNPs is shown in orange, and LD between synonymous SNPs is shown in blue. Filled areas indicate SE of LD calculated for each chromosome separately. (A) LD between all pairs of SNPs pooled together. Solid lines indicate LD between pairs of SNPs located within the same gene; dashed lines correspond to pairs of SNPs located in different genes. (B) Pairs of SNPs with MAF <0.05 (large scale). (C) Pairs of SNPs split by MAF.
Figure 1—figure supplement 6.
LD between SNPs with different MAF in
LD between nonsynonymous SNPs is shown in orange, and LD between synonymous SNPs is shown in blue. Filled areas indicate SE of LD calculated for each chromosome separately. (A) LD between all pairs of SNPs pooled together. Solid lines indicate LD between pairs of SNPs located within the same gene; dashed lines correspond to pairs of SNPs located in different genes. (B) Pairs of SNPs with MAF <0.05 (large scale). (C) Pairs of SNPs split by MAF.
A much weaker excess of LDnonsyn over LDsyn for MAF >0.05 is also observed in the less genetically diverse
The excess of LDnonsyn over LDsyn corresponds to the attraction between minor nonsynonymous alleles. This attraction can only appear due to positive epistasis between such alleles - higher-than-expected fitness of their combinations (Appendix 2). Positive epistasis can be expected to cause stronger LD in more polymorphic populations (Figure 1D–F, Appendix 1) and must be more common for pairs of sites located within the same gene, which are more likely to interact with each other.
For rare SNPs with MAF <0.05 taken alone, LDnonsyn is similar or lower to LDsyn for all three species, consistent with the effects of random drift, Hill-Robertson interference, and/or negative epistasis (Figure 1—figure supplements 4–6; Appendix 2). Decreased LD between negatively selected polymorphisms is expected due to Hill-Robertson interference between deleterious alleles (Hill and Robertson, 1966; Roze and Barton, 2006); this effect has been described previously for
Elevated LD between interacting sites
Natural selection acting on physically interacting amino acids that are located close to each other within the three-dimensional structure of a protein is characterized by strong epistasis which leads to their coevolution at the level of between-species differences (Marks et al., 2011; Ovchinnikov et al., 2014; Sjodt et al., 2018). Genome-wide elevated LD between amino acid sites within structural domains was recently demonstrated in human populations (Ragsdale, 2021). Extraordinary diversity of
To test this, we aligned
In both
Figure 2.
Excessive LD between physically interacting protein sites.
(A) Within pairs of SNPs that correspond to pairs of amino acids that are colocalized within 10 Å in the protein structure, the LD is elevated between nonsynonymous, but not between synonymous, variants. Dashed lines show the average LD between colocalized sites. Permutations were performed by randomly sampling pairs of non-interacting SNPs while controlling for genetic distance between them, measured in amino acids; pairs of SNPs closer than 5 aa were excluded. (B–D) Examples of proteins with LD patterns matching their three-dimensional structures. Heatmaps show the physical distance between pairs of sites in the protein structure; only positions carrying biallelic SNPs are shown. Black dots correspond to pairs of sites with high LD (>0.9 quantile for the gene). Dashed lines in (c) structure show high LD between physically close SNPs from different segments of high LD. In these examples, LD is calculated in the Russian population of
Figure 2—figure supplement 1.
Examples of proteins with LD patterns matching the three-dimensional structure in the RUS population of
Heatmaps show the physical distance between pairs of sites in the protein structure; only positions carrying biallelic SNPs are shown. Black dots correspond to pairs of sites with high LD (>0.9 quantile for the gene). Grey regions indicate the exon structure of the genes. (A) cog1523 (5Y1B:A); (B) cog2779 (1SXJ:B); (C) cog5375 (1RGI:G); (D), cog5725 (1TA3:B); (E) cog18092 (4QJY:A); (F) cog7878 (4TYW:A). LD statistics and p-values for each gene are listed in Appendix 3—table 1.
Figure 2—figure supplement 2.
Examples of proteins with LD patterns matching the three-dimensional structure in the USA population of
Heatmaps show the physical distance between pairs of sites in the protein structure; only positions carrying biallelic SNPs are shown. Black dots correspond to pairs of sites with high LD (>0.9 quantile for the gene). Grey regions indicate the exon structure of the genes. (A) cog1536 (6AHR:E); (B) cog5725 (1TA3:B); (C) cog8253 (6F87:A); (D) cog9241 (1YCD:A). LD statistics and p-values for each gene are listed in Appendix 3—table 1.
Moreover, it is possible to identify individual proteins with significant associations between the patterns of LD and of physical interactions between sites. At a 5% FDR, we found 22 such proteins in the USA population, and 87 proteins in the Russian population (Appendix 3—table 1); three examples are shown in Figure 2B–D (see also Figure 2—figure supplements 1 and 2). The alignment of ADAT2 protein contains two segments (teal and red in Figure 2B) characterized by high within-segment LD. The boundaries of these segments match those of structural units of the protein, but not the exon structure of its gene. In RadB protein, a similar pattern is observed, and LD is also elevated between pairs of SNPs from different segments on the interface of the corresponding structural units (Figure 2C). The alignment of 4CL protein can be naturally split into four high-LD segments, which also match its structure (Figure 2D).
Distinct regions of high LD
The magnitude of LD varies widely along the
In the USA population, 8.4% of the genome is occupied by 5,316 such haploblocks, 56% consist of regions with background LD level, and the rest cannot be analyzed due to poor alignment quality or low SNP density. Eighty-eight percent of the haploblocks are shorter than 1000 nucleotides, although the longest haploblocks spread for several thousand nucleotides (Figure 3—figure supplement 2). In the Russian population, there are 10,694 haploblocks, occupying 15.9% of the genome, and regions of background LD cover 39% of it. There is only a modest correlation between the USA and Russian haploblocks: the probability that a genomic position belongs to a haploblock in both populations is 2.3% instead of the expected 1.3%, indicating their relatively short persistence time in the populations (examples shown in Figure 3—figure supplement 1).
LD within a haploblock is usually so high that most genotypes can be attributed to one of just two distinct haplotypes, which carry different sets of alleles (Figure 3—figure supplement 3). This results in a bimodal distribution of the fraction of minor alleles in a genotype within a haploblock, because some genotypes belong to the major haplotype and, thus, carry only a small fraction of minor alleles, and other genotypes belong to the minor haplotype and, thus, possess a high fraction of minor alleles (Figure 3A). Polymorphic sites within haploblocks are characterized by higher MAF than that at sites that reside in non-haploblock regions (t-test p-value <2e-16), and in the USA population MAFs within a haploblock are positively correlated with its strength of LD (Figure 3B, Pearson correlation estimate = 0.07, p-value <2e-6).
Figure 3.
Patterns of linkage disequilibrium in the USA population of
(A) Distribution of the fraction of polymorphic sites that carry minor alleles in a genotype within haploblocks. Black line shows the distribution of fraction of minor alleles in genotypes in non-haploblock regions. (B) Distributions of the average MAF within a haploblock for haploblocks with different average values of LD. The average MAF in non-haploblock regions is shown as a horizontal black line for comparison. (C) LD between nonsynonymous and synonymous SNPs within individual genes. Linear regression of LDnonsyn on LDnsyn is shown as the red line. To control for the gene length, only SNPs within 300 nucleotides from each other were analyzed. Genes with fewer than 100 such pairs of SNPs were excluded. (D,E) The positive correlation between pn/ps of the gene and its average LD (D) or the difference between LDnonsyn and LDsyn (E). Here, the data on the USA population of
Figure 3—figure supplement 1.
Examples of haploblocks in two populations of
The heatmaps show LD between polymorphic SNPs in the same genomic regions in the USA and RUS populations of
Figure 3—figure supplement 2.
Distribution of haploblock lengths (nt) in the two populations of
Figure 3—figure supplement 3.
Example of the
Region 3097200–3097500 of scaffold 4 in the USA population of
Figure 3—figure supplement 4.
Patterns of linkage disequilibrium in the RUS population of
(A) Bimodal distribution of the fraction of polymorphic sites carrying minor alleles per genome within the haploblocks. Each count corresponds to a genotype within a haploblock. Black line shows the background distribution of minor alleles in the non-haploblock regions. (B) The increased average minor allele frequency within haploblocks as compared to the non-haploblock regions (dashed line, t-test p-value <2e-16). (C) LD between nonsynonymous and synonymous SNPs within single genes. Each dot represents an individual gene. Linear regression of LDnonsyn over LDnsyn is shown as the red line. To control for the gene length, only SNPs within 300 bp from each other were analyzed. Genes with fewer than 100 such pairs of SNPs were excluded. (D,E) The positive correlation between pn/ps of the gene and its average LD (Spearman correlation p-value = 4e-16) (D) or the difference between LDnonsyn and LDsyn (Spearman correlation p-value = 2e-5) (E).
Figure 3—figure supplement 5.
Comparison of LDnonsyn and LDsyn in the genes of
(A) The USA population, (B) the RUS population. The genes are stratified by their average LD (the panels) and by the pn/ps. Only pairs of SNPs within 300 bp from each other are analyzed; genes with less than 100 such pairs of nonsynonymous or synonymous SNPs are excluded. Spearman correlation p-values are shown.
Figure 3—figure supplement 6.
The difference between LDnonsyn and LDsyn under pairwise epistasis and balancing selection.
(A) The excess of LDnonsyn over LDsyn under different models of epistasis between two deleterious mutations A → a and B → b without balancing selection and in the presence of negative frequency-dependent selection (NFDS) or associate overdominance (AOD) acting in the linked sites. The height of columns shows fitness of the corresponding genotypes. (+) indicate simulations where the excess of LDnonsyn is reproduced. (B) The average LD in the simulations. (C) The difference between LDnonsyn and LDsyn in the simulations.
Figure 3—figure supplement 7.
Criteria for haploblocks in
Red lines show the distribution of LD (r2) in windows of 250 nucleotides in two populations. Black line corresponds to the lognormal distribution with the same mean and variance. The windows with LD higher than the threshold value defined as the intersection point of the two lines (dashed) are attributed to haploblocks.
There is no one-to-one correspondence between haploblocks and genes, which are, on average, longer. Still, different genes are covered by haploblocks to different extent, which leads to wide variation in the strength of LD and other characteristics among them. Genes with high LD, that is those that contain haploblocks, have the largest excess of LDnonsyn over LDsyn (Figure 3C). Positive correlation between the overall LD within the gene and the excess of LDnonsyn in this gene indicates that the attraction between nonsynonymous variants, driven by epistasis, is stronger if combinations of epistatic alleles are persisting within population for a long time, comprising haplotypes within a haploblock. Since both haplotypes tend to be common in a haploblock (Figure 3), this excess is much stronger for loci with MAF >0.05.
LD between alleles of all kinds is higher within genes with large ratios of nonsynonymous and synonymous polymorphisms pn/ps (Spearman correlation p-value <2e-16, Figure 3D). Genes with elevated pn/ps also have a stronger excess of LDnonsyn over LDsyn (Figure 3E, Spearman correlation p-value = 4.4e-17). This excess is the strongest for genes with high overall LD, but its correlation with pn/ps holds even when the overall LD is controlled for (Figure 3—figure supplement 5).
There can be multiple non-exclusive mechanisms by which epistasis could lead to the observed positive associations between pn/ps, overall LD, and excess LDnonsyn. First, genes under weaker selection, and therefore higher pn/ps, could be characterized by a higher overall amount and/or strength of epistasis. Second, epistasis, as estimated by excess LDnonsyn, can contribute to increased pn/ps by allowing nonsynonymous polymorphisms to segregate in the population when maintained in coadapted combinations, therefore weakening negative selection against them. Third, epistasis can be more potent in genes with lower overall recombination rate due to competition between epistasis and recombination: recombination breaks positively interacting combinations of alleles, disrupting linkage between them and interfering with epistasis. Fourth, existence of cosegregating combinations of mutually beneficial alleles could select for reduced local recombination rate.
Excess of LDnonsyn requires stable polymorphism
Simulations show that positive epistasis alone cannot lead to the observed large excess LDnonsyn over LDsyn, for which two extra conditions need to be satisfied. The general reason for this is simple: in order for a substantial LD between not-too-rare alleles to appear, these alleles must persist in the population for a long enough time.
First, positive epistasis must lead to a full compensation of deleterious effects of individual alleles. In other words, the fitnesses of at least two most-fit genotypes that are present in the population at substantial frequencies must be (nearly) the same (Figure 3—figure supplement 6). If this is not the case, selection favoring the only most-fit genotype leads to a too low level of genetic variation, which persists only due to recurrent mutation. It is natural to assume that the two major haplotypes that are common within a haploblock correspond to high-fitness genotypes. High-fitness genotypes can represent either isolated fitness peaks of equal heights (corresponding to a situation when two out of the four allele combinations confer high fitness) or a flat, curved ridge of high fitness (corresponding to a situation when three out of four combinations confer high fitness). The available data are insufficient to distinguish between these two options. Of course, with complete selective neutrality of all allele combinations there is no reason for LDnonsyn >LDsyn, so that at least some mixed genotypes, carrying alleles from different high-fitness genotypes, must be maladapted.
Second, there must be some kind of balancing selection that specifically works to maintain variation, because otherwise random drift does not allow genetic variation to persist for a long enough time even if some, or even all, genotypes are equally fit (Figure 3—figure supplement 6). Here, there are at least two options. On the one hand, a ‘real’ negative frequency-dependent selection (NFDS) can act either directly at loci that display high LD or at some other tightly linked loci (Charlesworth, 2006; Olendorf et al., 2006). On the other hand, variation can be maintained due to associative overdominance (AOD), resulting from selection against recurrent deleterious mutations at linked loci (Gilbert et al., 2020; Ohta, 1971; Zhao and Charlesworth, 2016).
Balancing selection is also neccessary for the presence of haploblocks, because a pair of divergent haplotypes can evolve in a panmictic population only if they coexist for a considerable time. A single locus under NFDS is enough to maintain a haploblock comprising the region of the genome around it. By contrast, if variation is maintained by AOD, it is more likely that selection against recessive mutations acts at a number of tightly linked loci (Gilbert et al., 2020). Long coexistence of diverged haplotypes that comprise a haploblock enables accumulation of co-adapted combinations of nonsynonymous alleles within them. Thus, it is not surprising that a pronounced excess of LDnonsyn over LDsyn in
Correlated LDs in two populations
Although a high excess of LDnonsyn is observed only within haploblocks, a signature of epistasis can also be seen outside of them in the form of a correlation between LDs in the two populations. This correlation can be high even if LDs by themselves are low.
The USA and the Russian populations share a large proportion of their SNPs. Given the high divergence between the two populations, few such shared SNPs are expected to have common origin in the ancestral population, and instead they are likely to have arisen from recurrent mutation. Since the haploblocks show little correlation between the two populations, we assume that they arose after their divergence. The high prevalence of coincident SNPs is not surprising because SNPs comprise 0.28 and 0.13 of all the aligned nucleotide sites in the USA and Russian populations, respectively (Baranova et al., 2015, Appendix 3—figure 2). We identified pairs of shared biallelic SNPs located within 2 kb from one another and calculated the LD between them in both populations. To avoid the effects of strong within-population linkage and the occasional co-ocсurrence of haploblocks between populations, we excluded SNPs located within haploblocks or within genes under high LD (>0.8 LD quantile for the corresponding population) in either population.
The values of LD in the two populations are strongly correlated only for pairs of nonsynonymous SNPs located within the same gene, and only if both populations carry the same pairs of amino acids in the same sites (Figure 4). The correlation of LDs is the strongest if shared SNPs carry the same pairs of nucleotides, but is also observed if they encode the same amino acids by different nucleotides (Figure 4—figure supplement 1). The contrast between correlations within pairs of sites that reside in the same vs. different genes and the correlation of LDs observed for different nucleotides encoding the same amino acid cannot be explained by inheritance of LD from the common ancestral population. Moreover, synonymous SNPs are expected to be on average older than nonsynonymous ones, so that this mechanism should lead to a higher correlation of LDs for pairs of synonymous mutations. Thus, the observed pattern indicates that epistatic selection is shared between the two populations.
Figure 4.
Correlation of LD values between pairs of shared SNPs in the two
(A) Pairs of SNPs with the same alleles in both sites, (B) pairs of SNPs differing by at least one allele. Asterisks indicate Spearman correlation p-values <0.001.
Figure 4—figure supplement 1.
Association of LD values between pairs of shared nonsynonymous SNPs encoding the same amino acids in the two
(A) All pairs of SNPs pooled together. Pair of SNPs is considered to carry different alleles if at least one allele differs in at least one site. (B) Pairs of SNPs stratified by distance between them. Asterisks indicate Spearman correlation p-values <0.01.
Figure 4—figure supplement 2.
Association of LD values between pairs of shared SNPs within haploblocks in the two
(A) Pairs of SNPs with the same major and minor alleles in both sites, (B) pairs of SNPs differing by at least one allele. Asterisks indicate Spearman correlation p-values <0.001.
The correlation of LDs between SNPs located within haploblocks in both populations is high regardless of whether they reside in the same or different genes, apparently because of occasional coincidence of haploblocks between populations (Figure 4—figure supplement 2).
Discussion
On top of its most salient property, an exceptionally high π, genetic variation within
The second feature is the excessive attraction between nonsynonymous alleles polarized by frequency. This pattern is much stronger within haploblocks, indicating that they were shaped by both balancing and epistatic selection, so that amino acids common within a haplotype together confer a higher fitness. Polymorphisms that involve haplotypes that comprise many interacting genes, such as inversions (Charlesworth and Charlesworth, 1973; Dobzhansky and Pavlovsky, 1958; Singh, 2008; Sturtevant and Mather, 1938) and supergenes (Joron et al., 2011; Mather, 1950), are known from the dawn of population genetics, but here we are dealing with an analogous phenomenon at a much finer scale, because haploblocks are typically shorter than genes. Thus, instead of coadapted gene complexes (Dobzhansky and Pavlovsky, 1958), haplotypes represent coadaptive site complexes within genes.
In our simulations, equally high fitnesses of two or more genotypes was a necessary condition for a large excess of LDnonsyn, because otherwise the polymorphism did not live long enough for any substantial LD to evolve. However, epistasis between loci responsible for real or apparent balancing selection and those involved in compensatory interactions probably abolished the need for this fine-tuning of fitnesses. For example, if each haploblock carries its own complement of partially recessive deleterious mutations, together with alleles engaged in compensatory interactions with each other which also make these recessive mutations less deleterious, AOD can be expected to cause stable coexistence of these alleles.
Why are haploblocks and positive LD between minor nonsynonymous alleles so common in
Excessive LDnonsyn in
In a vast majority of species, π is a small parameter <<1. This imposes a severe constraint on operation of selection and obscures signatures of its particular modes. Thus, hyperpolymorphic species where π is ~1 provide a unique opportunity to study phenomena which are traditionally viewed as belonging to the domain of macroevolution through data on within-population variation.
Materials and methods
Haploid cultures of 24 isolates, each originated from a single haplospore, were obtained from fruit bodies collected in Ann Arbor, MI, USA by T. James and A. Kondrashov and in Moscow and Kostroma regions, Russia by A. Kondrashov, A. Baykalova and T. Neretina in 2009–2015. Specimen vouchers are stored in the White Sea Branch of Zoological Museum of Moscow State University (WS). Herbarium numbers are listed in Appendix 3—table 2. To obtain isolates, wild fruit bodies were hung on the top lid of a 10 cm petri dish with agar medium. Petri dish was set at an angle of 60–70 degrees to the horizontal surface for 32 hr. A germinated spore was excised together with a square-shaped fragment (approximately 0.7 × 0.7 mm) of the medium from the maximally rarefied area of the obtained spore print under a stereomicroscope with 100 x magnification. The obtained isolates were cultured in Petri dishes on 2% malt extract agar for a week. For storage, cultures were subcultured into 1.5 ml microcentrifuge tubes with 2% malt extract agar. To obtain sufficient biomass for DNA isolation, isolates were cultured in 20 ml 0.5% malt extract liquid medium in 50 ml microcentrifuge tubes in a horizontal position on a shaker at 100 rpm in daylight for 5–10 days. The tubes with the cultures were then centrifuged at 4000 rpm, and the supernatant was decanted. The resulting mycelium was lyophilized. DNA was extracted using Diamond DNA kit according to the manufacturer’s recommendations.
DNA libraries were constructed using the NEBNext Ultra II DNA Library Prep Kit kit by New England Biolabs (NEB) and the NEBNext Multiplex Oligos for Illumina (Index Primers Set 1) by NEB following the manufacturer’s protocol. The samples were amplified using 10 cycles of PCR. The constructed libraries were sequenced on Illumina NextSeq500 with paired-end read length of 151. The genomes were assembled de novo using SPAdes (v3.6.0) (Bankevich et al., 2012); possible contaminations were removed using
Together with the 30 samples sequenced previously (Baranova et al., 2015; Bezmenova et al., 2020), the obtained haploid genomes were aligned with TBA and
The phylogeny of the sequenced genomes was reconstructed with RAxML (Stamatakis, 2014; Appendix 3—figure 2). Nucleotide diversity (π) was estimated as the average frequency of pairwise nucleotide differences; π for different classes of sites is shown in Appendix 3—figure 1B. Two samples from Florida (USA population) were excluded from the further analysis to minimize the possible effect of population structure.
Genome sequence data are deposited at DDBJ/ENA/GenBank under accession numbers JAGVRL000000000-JAGVSI000000000, BioProject PRJNA720428. Sequencing data are deposited at SRA with accession numbers SRR14467839-SRR14467862.
Data on
We used polymorphism data from 347 phased diploid human genomes from African and 301 genomes from European super-populations sequenced as part of the 1000 Genomes project (1,000
1000 Genomes Project Consortium et al., 2015). If several individuals from the same family were sequenced, we included only one of them. As a
Estimation of LD
As a measure of linkage disequilibrium between two biallelic sites, we used r2, calculated as follows:
, where p(A) and p(B) are the minor allele frequencies at these sites, and p(AB) is the frequency of the genotype that carries minor alleles at both sites.
Singletons (sites with minor allele present only in one genotype) were excluded from the analysis if not stated otherwise.
Haploblocks annotation
In order to annotate the haploblocks, we calculated LD along the
Estimation of LD between physically interacting amino acid sites
Of 16,319 annotated protein-coding genes of
To compare LD between pairs of physically close and distant sites, we used the controlled permutation test (Figure 2A): for each pair of physically close amino acid sites (within 10 Å) we sampled a pair of physically distant amino acids on the same genetic distance (measured in aa). Pairs of sites closer than 5 aa were excluded from the analysis.
To examine LD patterns within individual protein structures, we calculated contingency tables of pairs of SNPs being located in codons encoding physically close amino acids and having high LD (no less than 90% quantile for a given gene). Pairs of amino acid sites located closer than 30 aa or more distant than 100 aa from each other were excluded; genes with less than five pairs of physically close sites under high or low LD were also excluded. From these contingency tables, we calculated the odds ratio (OR) and chi-square test p-value for each gene. p-values were adjusted using BH correction. We identified 22 genes with pairs of adjacent sites having significantly higher LD in the USA population (out of 1286 eligible genes in total), and 87 genes in the Russian population (out of 967) under 5% FDR (Appendix 3—table 1). Examples of such genes are shown in Figure 2 and Figure 2—figure supplements 1 and 2.
Simulations of epistasis
To simulate evolution of populations with or without epistasis and balancing selection (Figure 3—figure supplement 6), we used an individual-based model implemented by
We modelled two types of mutations, depending on whether they are neutral (with selection coefficient ssyn = 0) or weakly deleterious (snonsyn ≤0), representing synonymous and nonsynonymous variants correspondingly. There were twice as many nonsynonymous as synonymous sites. Under the non-epistatic model,
Under the pairwise positive epistasis model, we assumed that one nonsynonymous mutation can be partially or fully compensated by a mutation at another site. In this model, all nonsynonymous sites were split into pairs. Each mutation of a pair individually occurring within a genotype was assumed to be deleterious, with selection coefficient snonsyn = −0.01; however, the fitness of the double mutant is larger than expected under the additive (non-epistatic) model. We used several models of epistasis, with different strengths of epistasis strength and landscape shapes (Figure 3—figure supplement 6).
In the NFDS model of balancing selection, a single mutation at a random position was subjected to frequency-dependent selection (so that it is positively selected at frequencies below 0.5, and negatively selected at frequencies above 0.5). In the AOD model, mutations in 10 random positions were fully recessive (
To simulate evolution of populations with different levels of genetic diversity under epistasis (Appendix 1), we used
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022, Stolyarova et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
It is natural to assume that patterns of genetic variation in hyperpolymorphic species can reveal large-scale properties of the fitness landscape that are hard to detect by studying species with ordinary levels of genetic variation. Here, we study such patterns in a fungus
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer