Citation:Sesta L, Pagnani A, Fernandez-de-Cossio-Diaz J, Uguzzoni G (2024) Inference of annealed protein fitness landscapes with AnnealDCA. PLoS Comput Biol 20(2): e1011812. https://doi.org/10.1371/journal.pcbi.1011812
Editor:Sushmita Roy, University of Wisconsin, Madison, UNITED STATES
Received:June 6, 2023; Accepted:January 8, 2024; Published: February 20, 2024
Copyright: © 2024 Sesta et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability:The code used to implement the method presented in the paper, and all the employed data can be found at the publicly available GitLab repository: https://gitlab.com/luca.sesta/AnnealDCA.jl.
Funding:A.P. acknowledges funding by the EU H2020 (https://ec.europa.eu/programmes/horizon2020/) research and innovation programme MSCA-RISE- 2016 under Grant Agreement No. 734439 INFERNET, as well as financial support from FAIR (Future Artificial Intelligence Research https://future-ai-research.it/) PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D. 1555 11/10/2022, PE00000013, and from ICSC (Centro Nazionale di Ricerca in High-Performance Computing, Big Data, and Quantum Computing https://www.supercomputing-icsc.it/) which are both funded by the European Union Next-GenerationEU. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors declare no competing interests.
This is a PLOS Computational Biology Methods paper.
Introduction
The design of proteins to perform a given task (e.g. binding a target molecule) is a paramount challenge in molecular biology and has crucial diagnostic and therapeutic applications. Several high-throughput screening technologies have been developed to systematically assess protein activity. Despite the high parallelization of many techniques, a fundamental limitation lies in the small fraction of possible molecules that can be tested compared to the huge number of possible variants. Leveraging those data using effective computational models is crucial to overcome the obstacle by exploring in-silico the sequence space for the fittest molecules for a given function. We use the term fitness generically to refer to the protein activity under selection in a screening experiment (or during the in-vivo affinity maturation process). Several molecular activities can be selected in such experiments ranging from binding to a substrate to very complex phenotypes, such as conferring antibiotic resistance or multiple unknown interactions in a tissue.
Many machine-learning methods have been proposed recently to learn the protein fitness landscape from sequencing of high-throughput screening experiments [1–7]. Here, we propose a machine learning framework to target sequencing data derived from a broad class of experiments that use selection and sequencing to quantify the activity of protein variants. These experiments include, among others: Deep Mutational Scanning (DMS), where a library of protein mutants is screened in-vitro for different activities [8–21]; Experimental Evolution (EE), where a mutagenesis step adds diversity in the library after the rounds of selection [22–24]; sampling of the in-vivo immune response as in antibodies Repertoire Sequencing (Rep-Seq) [25]. Some of these experiments serve to select the fittest variants within the screened library while providing quantitative information about the protein activity landscape.
A basic quantitative measure of protein fitness can be obtained by computing the ratio between the relative frequencies of the variants in the populations before and after selection. This ratio, called selectivity, is a proxy for the probability that a variant survives the selection process, and has been widely used in the analysis of DMS experiments [8, 26]. Other approaches leverage more efficiently the same information, by parameterizing in some way the genotype-fitness map [7], or by developing adequate denoising procedures [27–29].
All these approaches evaluate the fitness from the temporal trajectory of variant abundances through the selection rounds. Conversely, many experimental setups are incompatible with the notion of a single variant trajectory in the population. Such is the case of EE, where a mutagenesis process occurs alongside selection that modifies the pool of individual variants from one round to the next [22, 23]. Depending on the interplay between mutational drift, selection strength, and the fitness landscape, the probability to re-sample previously seen variants can be very small after some rounds. Most variants do not persist through the whole time series and are often observed only once. In other setups, a severely undersampled regime precludes the repeated observation of individual variants. In repertoire sequencing, the coverage is generally too low in comparison to the large number of receptors present in an immune repertoire, which implies that individual sequences are not sampled more than once.
For other in-vitro screening experiments, factors such as the selection strength, the number of rounds, the shape of the fitness landscape, the size of the initial library, and sequencing coverage, can limit the ability to observe a relevant fraction of the possible variants. In these cases, we cannot detect the time trajectory of the frequency of most variants and thus we cannot compute an enrichment ratio. Nevertheless, it is still possible to make inferences about the fitness landscape. Another possible approach involves a dimensionality reduction of the protein sequence space through the modeling of the evolution of the distribution of the variants as selection proceeds.
Here, we propose AnnealDCA, a simple but effective strategy to perform protein fitness landscape inference, which can be applied to different experiments and types of data. Our approach is inspired by the simulated annealing method [30] from statistical physics to solve optimization problems. The different experimental rounds can be viewed as a cooling process, where an effective temperature is gradually reduced across successive rounds, and the selective pressure becomes increasingly dominant. The general mathematical framework and the associated statistical inference method can be applied to most of the experimental cases where a population of protein variants undergoes different rounds of selection, and a subset (or all) rounds are sequenced. Datasets of this type include, among others, protein screening experiments with one or multiple panning rounds, and the collection of Rep-Seq samples at different infection times.
To demonstrate the effectiveness of our scheme, we apply the method to antibody Rep-Seq data of immunized mice and we predict the antibody affinity towards its cognate antigen. We further test the method in more controlled experiments and assess the quality of the in-silico reconstructed fitness landscape.
Method
To describe our method, we start for the sake of simplicity by considering a simple screening experiment of an initial library that takes place over several panning rounds. Other experimental setups will be described next. We define Pt(S) as the probability of observing a sequence S at round t. Eventually, Pt(S) is the quantity we want to estimate from the sequencing data. We introduce a sequence-dependent survival factor Qt(S). This quantity is a measure of the probability that sequence S survives between round t − 1 to t. Similarly to [1, 31, 32], we assume that this quantity takes the following exponential form:(1)with a time dependent factor αt, modeling the scale of the selective pressure acting at round t. The time-independent function E(S), associates a statistical energy to the protein sequence S. Thanks to Eq (1), we can then express Pt(S) as:(2)up to a normalization constant.
Using Eq (2), we can express Pt(S) as a product of the initial configuration probability P0(S) and the factor e−E(S), raised to the sum of the selective pressures of all rounds. We can redefine such a sum as:(3)
Eq (3) can be interpreted as a fictitious inverse temperature, accounting for the cumulative selective pressure up to round t. In the absence of mutations and if the experimental conditions are the same for all rounds, Fisher’s fundamental theorem of evolution states that αt is a decreasing function of time [33]. Thanks to Eq (3), we can transform Eq (2) as follows:(4)
The accumulated selection, quantified by the inverse temperature βt, tends to drive the mass of the distribution Pt(S) towards the minima of E. This mental picture is reminiscent of the simulated annealing process studied in statistical mechanics and other areas [30].
At t = 0, P0(S) is the distribution of the variants in the initial library. Since this library is randomly generated, it is supposed to be unrelated to the selection process, and consequently to fitness. We can model the distribution of the initial variants by another similar energy function G(S):(5)so that Eq (4) takes the following form:(6)where Zt = ∑{S} exp(−βtE(S) − G(S)) is a time-dependent normalization factor, and the sum runs over all possible sequences.
Fig 1 shows a pictorial representation of the overall modeling of the experimental screening process. Notably, we do not need any explicit assumption on the specific temporal dependence of the inverse temperature, as the β factors can be inferred directly from the data.
[Figure omitted. See PDF.]
Fig 1. A simplified portrayal of the modeling of the selection process.
Each color represents a different variant. Starting from the initial distribution of variants (which in this representation is uniform), the probability of observing a sequence in a subsequent round is shaped by the selection process, defined by the energy function E(S). αt encodes the selective pressure at each transition. The arrows represent transitions between rounds, and underneath each round box, the related expressions of the model probability are reported.
https://doi.org/10.1371/journal.pcbi.1011812.g001
Fitness map
The genotype-to-fitness map here is encoded in the energy function E. The choice of its functional form and the related number of parameters to be inferred are eventually a trade-off between the expressive power and the actual availability of the sequence data to train the model. One of the simplest parameterizations is an independent site model, where each amino acid contributes additively to the energy:(7)with parameters that depend on the identity of the amino acid σi, present at position i along the sequence S.
A more complex parameterization is obtained by including pairwise epistatic interactions between all pairs of amino acids and is now widely used in structural biology [34, 35] and functional biology [36–40]. The resulting energy function takes the form of a generalized Potts model:(8)In comparison to the simple independent site model in Eq (7), the parameterization in Eq (8) is characterized by additional parameters, , to model the pairwise interactions. Furthermore, in cases where there is sufficient sequence variability, pairwise models have demonstrated the capacity to deliver superior performance when reconstructing fitness landscapes [37, 41].
Model training
The model parameters are trained by maximizing the log-likelihood of the full dataset:(9)where is the normalized abundance of the sequence m at time t, , {τ0, τ1, …, τf} ⊂ {0, t1, …, tf} is the subset of sequenced rounds and θ = {θE, θG} is the set of parameters of the energy functions.
The likelihood is the product of the probabilities to sample the observed sequences and frequencies from the model at each time point. It can be interpreted as minus the cross-entropy between the predicted distribution of the variants and the observed one, the empirical frequencies. The exact maximization of the likelihood involves the computation of the partition function of the model, whose computational complexity scales as . To overcome this practical limitation, there are many approximate methods developed for specific parametrizations of the energy function. Other approaches, based on Monte-Carlo, are more general but might have convergence issues that are difficult to control in practice and are computationally costly. A very effective approach relies on the maximization of a quantity related to the likelihood, called the pseudo-likelihood function [42, 43], whose precise definition in the case of the Potts model (Eq (8)), is given in Section A of the S1 Text. A regularization term is added to the pseudo-likelihood to avoid overfitting. While Eq (9) is separately convex with respect to the energetic parameters and the inverse temperatures, this is no longer true when both are inferred simultaneously. To learn the parameters it is possible to use an iterative optimization scheme: starting from an arbitrary set of β components, the energetic parameters θ are optimized. Next, the β components are updated while the θ are kept fixed. The two steps are iterated until both sets of parameters reach convergence. Some constraints can be imposed on the β parameters without affecting the expressivity of the model. In particular, it is possible to set and to fix a scale factor setting .
Methodological advancements and limitations
The fundamental motivation and distinguishing characteristic of the AnnealDCA method lies in its independence from the reliance on variant enrichment or, more broadly, the evolution of specific variant frequencies over time. This is in contrast to Deterministic Rare Binding (DRB) [7], a method previously introduced by some of the authors, that can process only variants present in at least two consecutive sampled rounds. Such a unique attribute empowers us to glean insights from experiments where the set of overlapping variants between rounds is notably limited. Such limitations might arise due to undersampling, the emergence of novel mutations, or other contributing factors. This capability is realized by implementing a dimensionality reduction technique that captures the temporal progression of the samples. In this context, we define dimensionality reduction as the ratio of the model’s number of parameters to the total number of possible variants. For a sequence of L = 100 amino acids, our model requires approximately L2 ⋅ 202 + 20L ≃ 4 ⋅ 106 parameters, while the number of possible sequences is 20L ≃ 1.2 ⋅ 10130. This dimensionality reduction proves invaluable in effectively rectifying issues stemming from noise and undersampling. On the other hand, the AMaLa method (introduced in [6]) is explicitly designed to address experiments involving mutations. The energy function within the AMaLa framework was originally tailored to accommodate random mutations that reshape the population of variants throughout an experiment. This was achieved by incorporating an additional term derived from a generalized Jukes–Cantor model that describes the mutational step. These experiments typically start with a wild-type sequence and progress through a series of selection and mutation rounds. In the AnnealDCA approach, we do not explicitly model correlations between random mutations from one round to the next. In an undersampled regime, these correlations are expected to be weak and multiple rounds can be treated as independent samples. The approach shares similarities with various studies where a Potts model is inferred from a Multiple Sequence Alignment (MSA) of observed mutated viruses [44–48]. This inference is typically used to establish a prevalence landscape, often considered a surrogate for the intrinsic fitness landscape. However, it’s crucial to recognize that in the particular scenarios we investigate, the observed variant abundances (analogous to prevalence for viruses) are influenced by the stochastic composition of the initial library. It is important to note that the G part of the Hamiltonian serves the specific purpose of characterizing the bias introduced by the initial library and lacks a direct physical interpretation. Furthermore, G is not utilized in the subsequent analyses and validations. However, it effectively assimilates factors such as the impact of initially overexpressed variants and it remains crucial for accurately learning the energy component E related to the fitness. AnnealDCA’s applicability is subject to certain limitations, primarily due to the dataset statistics in terms of the size of the high-throughput screened library and the sequencing depth. Ensuring the accuracy of the probability model hinges on learning from sufficient statistics.
Results
Most of the computational methods used to infer the fitness landscape from screening experiments rely on the computation of the enrichment/depletion ratios for a sufficiently large set of variants to train a regression model. The enrichment ratio is, in its simpler form, the ratio between the frequency of a variant at different rounds (see Eq (4) in the S1 Text). This quantity is a proxy for the ability of a variant to be selected during the process, namely, the fitness. However, several cases exist in which the temporal trajectory of the single variant is not detected. It can happen when: (i) the experiment is dominated by noise effects; (ii) the sequencing coverage is not adequate in comparison to the broadness of the library, and under-sampling effects might dominate; (iii) some mutations are introduced along the selection process at each round of the experiment. As a consequence, most of the variants sampled at different time points could be unique or in low copies, affecting the accuracy of the enrichment ratios estimate. Conversely, the generality of our approach makes it applicable to all the above cases. We demonstrate the efficacy and the versatility of the method by applying it to three different experimental setups, which are described briefly below. All references to the experiments and the datasets used are summarized in Table 1.
* Antibody Repertoire Sequencing (Rep-Seq)
The Antibody Repertoire encompasses the diverse set of immunoglobulins present in an individual at a specific point in time. The Rep-Seq technique enables the study of a sample from this immunoglobulin repertoire. Our dataset was compiled from two sources: Khan et al. [49] and Gerard et al. [50]. In the former study, the authors sequenced IgG antibodies secreted by memory B cells and plasmablasts in non-immunized mice. In the latter, the same mouse clones were immunized against a specific antigen. Subsequently, the isolated IgG repertoire underwent high-throughput phenotypic assays using a microfluidic platform. This platform enriched the output in antigen binders, the authors estimates the final fraction of binders as 90%, as explained in detail in [50]. We can view the datasets as representing two distinct scenarios: the first as a sample before the immune response, and the second as a sample of clonally expanded antibodies responding to the antigen.
* Deep mutational scanning (DMS)
These experiments combine high-throughput screening of a mutational library with sequencing to assess the effect of mutations on protein activity [51]. An initial library of protein variants undergoes one or multiple cycles of selection for a protein function (e.g. binding to a substrate). After a number of rounds, a sample of the variants is deep-sequenced to assess their abundances over time. Typical examples are in-vitro display experiments (e.g. phage display). The experiments and datasets we used in our study are described in Fowler et al. [26], Boyer et al. [52], and Wu et al. [53].
* Experimental Evolution (EE)
EE follows a setup similar to DMS, with the difference that in this case, random mutations are repeatedly introduced before each panning round. In some cases, the experiment starts from a single wild-type protein. The experiment attempts to simulate in-vitro a natural Darwinian evolutionary process, where mutations explore the sequence space creating new genotypes whose phenotype is tested for the protein function. The experiments and datasets are described in the following two papers: Fantini et al. [22], Stiffler et al. [23].
[Figure omitted. See PDF.]
Table 1. Experimental data overview.
https://doi.org/10.1371/journal.pcbi.1011812.t001
Antibodies Repertoire Sequencing
We utilize our method on Rep-Seq data from antibodies to estimate the likelihood that a given antibody results from a specific immune response. Once the model is trained, it provides a parameterization of the probability function, which is then applied to design new antibodies. In essence, when the immune system responds to an antigen, the antibodies it produces should have a high affinity for that antigen. To achieve this, we work with two datasets, before and after the immune response: one from mice with unimmunized repertoires, referred to as the background or negative dataset, and another from mice with immunized repertoires (the positive set). In the case of the positive set, it is further enriched in binders through functional sorting using a microfluidic platform. Note that the latest experimental step increases the signal-to-noise ratio. However, in some instances, we may rely solely on samples from RepSeq data (see references [31, 32, 36]).
The fundamental concept is to model the probability of encountering an antibody in the positive set as the product of two probabilities: the background probability, which signifies the likelihood of finding an antibody in the negative set (unimmunized repertoire), and the selection factor. The selection factor describes the overall effective process of the immune response, including the impact of the enrichment platform. For a visual representation, please refer to Fig 2. The negative or background dataset contains sequences from the IgG heavy chain repertoire of three unimmunized BALB/c mice (the same type as for the positive dataset). The dataset is publicly available from the Observed Antibody Space [54] and the experimental setup is described in Khan et al. [49]. The negative dataset contains a total of 19772 unique IgG heavy chain sequences with the number of readouts. The positive datasets contain sequences of IgG heavy chain (VH) of immunized BALB/c mice repertoire sorted by a droplet microfluidics platform by the binding status of two immunogenic targets: Tetanus toxoid (TT) and Glucose-6-Phosphate Isomerase (GPI). The number of unique IgG heavy chain sequences in the two positive datasets is 3881 for TT and 3233 for GPI. All sequences were aligned using the Martin antibody numbering. The complete preprocessing pipeline is described in the Section B of the S1 Text.
[Figure omitted. See PDF.]
Fig 2. Method’s application to antibody Rep-seq data.
(a) Depicts the inference of the selection process, where the initial antibody population represents the unimmunized repertoire (negative set). After the immune response, the library undergoes selection to bind the antigen (positive set), shaping the immunoglobulin population. (b) Displays a plot of background and selection energies for both negative and positive sets. Red crosses represent the test set of antibodies with affinity measures (panel e). The selection energy effectively distinguishes antibodies in the immunized repertoire from those in the unimmunized repertoire. (c) Classification Task: Demonstrates the model’s ability to discriminate between binders and non-binders by presenting results on a random test set composed of negative and positive antibodies. ROC curves for the GPI and TT cases yield area under the curve (AUC) values of 0.89 and 0.98, respectively. (d) and (e) Model energy vs. Experimental Affinity: Show scatter plots comparing the selection energy (y-axis in panel b) with affinity measures for a set of antibodies. Specifically, EC50 values for TT and Kd measures for GPI. Notably, a significant correlation (Spearman coefficient 0.76) is observed in the latter case. The GPI test set is indicated by red crosses in panel (b).
https://doi.org/10.1371/journal.pcbi.1011812.g002
We analyzed the similarity of positive and negative datasets in terms of different sequence statistics such as distances from consensus sequence, site conservation and covariance, alignment PCA, and germline distributions. These preliminary analyses however were not capable of revealing sensible differences between the two datasets (see more details in Section B and Fig A of the S1 Text).
We test the model for two distinct tasks: one is the classification of binders and not binders and the second is the model estimate of the binding affinity.
The first test assesses the ability of the model to discriminate between binders and not binders. For this purpose, we split the positive and background datasets into a training set (to learn the model) and a test set to validate the classification predictions. This validation procedure was chosen due to the lack of a large list of IgG labeled as binders of the two antigens (TT and GPI). Thus, we use the positive and background set as a proxy for binders/not binders labels. Fig 2 shows the results of the classification task. The sequences with low selection energy are likely to be part of the immune response to the specific antigen. The model can discriminate remarkably well the binders (of the positive set) and not binders (of the background set), as demonstrated by the ROC curve in the test sets of both targets (AUC 0.98 for TT and 0.89 for GPI).
In Gerard et al. [50], the authors reported the experimental measures of the dissociation constant Kd with GPI of a small set of antibodies (14 binders and 2 not-binders) and the EC50 values against TT for another small set (42 binders and 4 not-binders). Using this test set, we can test whether the inferred selection energy correlates with the antibody affinity or the neutralization power. As shown in Fig 2, (panel b), the antibodies in the test set (crosses in Fig 2, (panel b)) are evenly sampled from the sequence space from the positive ensemble and lay on the high-selectivity model energy region. The results show that inferred selection energy correlates with Kd GPI measures, while there is no significant correlation between selection energy and EC50 in the TT case (see Fig 2 panels d,e). Using our statistical model to quantitatively predict the activity of binding sequence variants in terms of binding affinity turns out to be a more challenging task, compared to the classification task. Although we do not have a clear-cut explanation for why we failed on the TT dataset (while doing a pretty decent job on the GPI dataset), we speculate that: (i) The activity measurements in the two experiments are different. For the GPI case, Surface Plasmonic Resonance (SPR) was used to establish the dissociation constant Kd, while in the TT the EC50, i.e. the concentration required to obtain a 50% maximum antibody-ligand binding, was measured using ELISA. It is known that SPR measurements, albeit more complex, are generally more accurate compared to ELISA [55] because SPR measures the association Kon and dissociation rate Koff for the calculation of equilibrium dissociation constant (Kd = Koff/Kon), a more reliable measure for binding affinity. (ii) Although it is known that the immune response to TT is orchestrated by a complex interplay between the heavy and light chain [56], we could not take into account the contribution of antibodies’ light chains to neutralization, as the light chain of the background dataset was not sequenced. In other terms, if the contribution of the heavy chain alone seems to be sufficient to discriminate binder vs. not binder, it is possible that the contribution of both chains would be necessary for our model energy to better correlate with the binding affinity to TT.
Deep mutational scanning (DMS)
Deep mutational scanning experiments are explicitly designed to quantify mutation effects on fitness. The broadness of the library and the sequencing depth are chosen to compute reliable enrichment measures for the variants ([8, 26]). Thus, approaches that leverage the enrichment ratios are more suitable to address these datasets. Nevertheless, it provides an interesting controlled case to assess the inference procedure and compare it to other tools. The screening experiment described in [26] probes the binding affinity of the human WW-domain with its peptide ligand. More than 6 × 105 unique variants are generated in the initial library, which comprises almost all single point mutations, a fourth of the double mutations, and almost 2% of all three point mutations. Then, six rounds of phage display screening are performed, and rounds 3 and 6 are sequenced. In Boyer et al. [52], the library variability leans on a short sequence segment of the CDR3 region of an antibody’s heavy chain (L = 4). The library (chosen among 24 different scaffolds around the CDR3 region) is subsequently screened for three rounds of panning against a polyvinylpyrrolidone target. In the experiment described in Wu et al. [53], the variants library contains all possible mutations of four residues of the IgG-binding domain (GB1). The library is then screened to bind an immunoglobulin fragment target in a single round of selection.
We perform a 5-fold cross-validation to test the inference method: for each dataset, a random selection of 4/5 trains the model while 1/5 operates as a benchmark. We compare the model energy function with the empirical log-selectivity, which is computed from the enrichment ratios and serves as a proxy for the variants’ fitness. The performance is then evaluated by the Pearson correlation between the model energies E and the log-selectivities in the test set. On all datasets, we observe high correlations as shown in panel (a) of Fig 3, where we report their trend with the number of sequences in the test set. Specifically, moving from right to left of the horizontal axis the sequences characterized by a higher uncertainty of the log-selectivity are progressively pruned. Several approaches can be employed in order to estimate selectivity uncertainties, ranging from bare Poisson counting, denoising procedures [28], or fit over different experiment replicas [29] (if available). Here we follow the approach outlined in [7], where the uncertainty is estimated as the error related to the regression procedure for estimating empirical selectivities θm (see Eq (4) in the S1 Text). Finally, we compare the results with the Deterministic Rare Binding (DRB) inference method developed in [7]. As expected, the DRB method performs better in all three datasets, as it uses the enrichment information. Nevertheless, we underline that we are still able to obtain an energy function highly correlated with selectivity, close to the best performance. Furthermore, for the Wu et al. dataset [53], we notice how the discrepancy between the two methods becomes very shallow, due to the broad coverage of sequence space. Lastly, we remark that DRB is unable to handle other datasets considered in this work.
[Figure omitted. See PDF.]
Fig 3. Comparative analysis on DMS and EE data.
Panel (a) shows the overall performance of the method on DMS data: the Pearson correlation coefficient between inferred selective energies E and empirical log-selectivities (Eq (4) of the S1 Text). The correlations are reported as a function of data fraction pruned for the level of noise. The selectivity as a proxy for the fitness is more reliable for variants less affected by the noise, and consistently, for those variants the correlation with the energy is greater. Results are compared with the DRB method, which gains by using the enrichment information. Panel (b): comparison between AnnealDCA and AMaLa [6] for the reconstruction of the fitness landscape of TEM-1 from [22] data. Accuracy is quantified via the Pearson correlation between inferred energies E and independent fitness measurements [15, 16], as a function of the threshold discrepancy between the two datasets x. Panel (c): contact prediction sensitivity plot for the protein PSE-1. AnnealDCA (blue), AMaLa (green), and pseudo-likelihood DCA (orange) are inferred using data of [23] (DCA uses the last round only, as in [5]). On the y-axis, the positive predicted value is reported as a function of the first L residue pairs, sorted in decreasing order of the Frobenius norm (see Section A in the S1 Text). The vertical solid line coincides with L/2 residue pairs. Panel (d): Contact map related to panel (c). The plot is an L × L representation of the possible contacts between protein residues. The prediction of DCA (lower-left) and AnnealDCA (upper-right) are compared; correctly/incorrectly predicted contacts are respectively reported in green/red for DCA and blue/orange for AnnealDCA. Panel (e): scatter plot between selective energies inferred on [22] and fitness measurements of Firnberg et al [16] for a specific threshold value x = 0.8. Energies are rescaled with respect to the wild-type sequence ΔE = E(S) − E(Swt).
https://doi.org/10.1371/journal.pcbi.1011812.g003
Experimental Evolution (EE)
Due to its flexibility, we can apply the method to experiments in which new protein variants appear via a mutagenesis step at each new round. In this case, as discussed in [6], we cannot compute the enrichment ratios and selectivity. The G Hamiltonian in Eq (6), although being time-independent, accounts for the mutational step in an effective manner, as is demonstrated by the high correlation between G and the Hamming distance from the wild-type sequence (see Fig I of the S1 Text). The E part, on the other hand, corresponds to the selection process.
We collect data from three experiments described in Fantini et al. and Stiffler et al. [22, 23]. The authors screen proteins responsible for antibiotic resistance in bacteria: TEM-1 and PSE-1 variants of the β-lactamase family and AAC6 protein of the acetyltransferase family. Starting from a wild-type protein, error-prone PCR creates new mutants at each round. Subsequently, the library undergoes a selection step in which bacteria equipped with the mutants are exposed to an antibiotic-rich environment. This cycle of mutagenesis and screening is repeated multiple times and for a subset of the panning rounds a sample of the library is sequenced. Specifically, 20 rounds of EE at an ampicillin concentration of 6μg/mL are performed for PSE-1, among which rounds 10 and 20 are sequenced, whereas AAC6 mutants are subjected to 8 rounds at a kanamycin concentration of 10μg/mL, of which rounds 2, 4 and 8 are sequenced. Finally, in [22] TEM-1 mutants are exposed to two different antibiotic concentrations: 25μg/mL for all rounds but 5 and 12, for which the concentration is raised to 100μg/mL. Out of the 12 experimental cycles, rounds 1, 5, and 12 are sequenced.
We performed two different validations to assess the inferred model:
1. (i) In the case of TEM-1 β-lactamase, we directly compare the model energy with independent fitness measurements related to antibiotic resistance, collected from [15, 16]. In [15] variants fitness is quantified in terms of minimum inhibitory concentration (MIC), that is, the minimum antibiotic concentration necessary to neutralize bacteria equipped with that variant. On the other hand, in [16], the authors directly measured the gene fitness (see Section A of the S1 Text). For our analysis, we mapped the measurements of [16] onto those of [15], following the procedure outlined in [37]. The results show that the inferred energy correlates with the experimental fitness (see panel (b) and (e) of Fig 3). The method described in [6], specifically designed for these experiments performs systematically better.
2. (ii) In the PSE-1 and AAC6 cases, for which fitness measurements are not available, we validate the model using the prediction of protein structure contact map as prescribed by the DCA method [34, 35]. Then, the predictions are compared to the crystallographic studies of the protein structures.
The contact predictions are obtained using the coupling parameters J, which quantify the interaction between residues in the DCA framework [34, 35]. We used the Frobenius norm of the q × q matrix Jij to obtain a score quantifying the epistatic interactions between pairs of positions (see Eq (5) in S1 Text), on top of which we apply the average product correction. These residues are more likely to be found in spatial proximity in the folded structure as shown in panels (c), (d) of Fig 3.
Discussion
Several machine-learning methods have been proposed for learning a protein fitness landscape using sequencing data obtained from high-throughput screening experiments [2, 7, 31, 32]. However, these methods require observation of the trajectory in multiple rounds of selection of a statistically relevant set of variants. This presents a limitation as detecting the single variants trajectory in the population is often unfeasible in many experimental setups. To overcome this issue, we propose AnnealDCA, a novel machine-learning framework inspired by the simulated annealing method from statistical physics [30]. This approach can handle sequencing data derived from a broad range of experiments that use selection and sequencing to quantify the activity of protein variants, including Deep Mutational Scanning, Experimental Evolution, and antibodies Repertoire Sequencing (Rep-Seq), among others.
In our approach, selection acts as a cooling process where the distribution of the population on the fitness landscape is gradually peaked around regions of higher fitness. The samples before and after the selection are considered at different statistical temperatures and the inference method decouples the distribution contribution due to the initial library and the time-dependent fitness part. The general mathematical framework and the inference method can be applied to most of the experimental cases where a population of protein variants undergoes a selective process and is sequenced at different times. Such datasets include, among others, protein screening experiments with one or multiple panning rounds, and the collection of Rep-Seq samples at different infection times.
To demonstrate the effectiveness of this approach, we applied the method to antibodies Rep-Seq data of immunized mice to predict the antibody’s affinity towards the antigen. We learned the model energies from the repertoire of mice unimmunized and subjected to two antigens. The model energy was then used to accurately classify binders and not-binders to the antigen. This was supported by the fact that it correlated well with experimental measures of the Kd antibody-antigen of a set of antibodies not used in the training of the model.
To further test our approach, we applied it to more controlled experimental setups using three Deep Mutational Scans experiments. The results of 5-fold cross-validation showed a high correlation between the inferred fitness landscape and the experimental selectivity. Additionally, we applied the method to Experimental Evolution experiments of three proteins responsible for antibiotic resistance in bacteria, where mutations are added to increase the variability and explore sequence space around a wild-type sequence. The model energy precisely described the antibiotic resistance measurements of a set of variants. Moreover, the model coupling parameters were able to predict the three-dimensional contact map with a level of precision comparable to other approaches.
In summary, AnnealDCA provides a simple but effective strategy that can be applied to different experiments and data types where a population of protein variants undergoes a selective process and is sequenced at different times.
Supporting information
S1 Text. Supporting information.
Additional details about methods, datasets and further results.
https://doi.org/10.1371/journal.pcbi.1011812.s001
(PDF)
Acknowledgments
We thank Adam Woolfe, Andreas Raue, and Annabelle Gerard for helpful discussions and assistance with the Rep-seq data.
References
1. 1. Di Gioacchino A, Procyk J, Molari M, Schreck JS, Zhou Y, Liu Y, et al. Generative and interpretable machine learning for aptamer design and analysis of in vitro sequence selection. PLoS computational biology. 2022;18(9):e1010561. pmid:36174101
* View Article
* PubMed/NCBI
* Google Scholar
2. 2. Otwinowski J, McCandlish DM, Plotkin JB. Inferring the shape of global epistasis. Proceedings of the National Academy of Sciences. 2018;115(32):E7550–E7558. pmid:30037990
* View Article
* PubMed/NCBI
* Google Scholar
3. 3. Otwinowski J. Biophysical Inference of Epistasis and the Effects of Mutations on Protein Stability and Function. Molecular Biology and Evolution. 2018;35(10):2345–2354. pmid:30085303
* View Article
* PubMed/NCBI
* Google Scholar
4. 4. Otwinowski J, Plotkin JB. Inferring fitness landscapes by regression produces biased estimates of epistasis. Proceedings of the National Academy of Sciences. 2014;111(22):E2301–E2309. pmid:24843135
* View Article
* PubMed/NCBI
* Google Scholar
5. 5. Bisardi M, Rodriguez-Rivas J, Zamponi F, Weigt M. Modeling sequence-space exploration and emergence of epistatic signals in protein evolution. Molecular biology and evolution. 2022;39(1):msab321. pmid:34751386
* View Article
* PubMed/NCBI
* Google Scholar
6. 6. Sesta L, Uguzzoni G, Fernandez-de Cossio-Diaz J, Pagnani A. Amala: Analysis of directed evolution experiments via annealed mutational approximated landscape. International journal of molecular sciences. 2021;22(20):10908. pmid:34681569
* View Article
* PubMed/NCBI
* Google Scholar
7. 7. Fernandez-de Cossio-Diaz J, Uguzzoni G, Pagnani A. Unsupervised Inference of Protein Fitness Landscape from Deep Mutational Scan. Molecular Biology and Evolution. 2020.
* View Article
* Google Scholar
8. 8. Araya CL, Fowler DM, Chen W, Muniez I, Kelly JW, Fields S. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proceedings of the National Academy of Sciences. 2012;109(42):16858–16863. pmid:23035249
* View Article
* PubMed/NCBI
* Google Scholar
9. 9. Olson CA, Wu NC, Sun R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Current Biology. 2014;24(22):2643–2651. pmid:25455030
* View Article
* PubMed/NCBI
* Google Scholar
10. 10. Starita LM, Young DL, Islam M, Kitzman JO, Gullingsrud J, Hause RJ, et al. Massively parallel functional analysis of BRCA1 RING domain variants. Genetics. 2015;200(2):413–422. pmid:25823446
* View Article
* PubMed/NCBI
* Google Scholar
11. 11. Aakre CD, Herrou J, Phung TN, Perchuk BS, Crosson S, Laub MT. Evolving new protein-protein interaction specificity through promiscuous intermediates. Cell. 2015;163(3):594–606. pmid:26478181
* View Article
* PubMed/NCBI
* Google Scholar
12. 12. Kitzman JO, Starita LM, Lo RS, Fields S, Shendure J. Massively parallel single-amino-acid mutagenesis. Nature methods. 2015;12(3):203–206. pmid:25559584
* View Article
* PubMed/NCBI
* Google Scholar
13. 13. Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proceedings of the National Academy of Sciences. 2015;112(23):7159–7164. pmid:26040002
* View Article
* PubMed/NCBI
* Google Scholar
14. 14. Starita LM, Pruneda JN, Lo RS, Fowler DM, Kim HJ, Hiatt JB, et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proceedings of the National Academy of Sciences. 2013;110(14):E1263–E1272.
* View Article
* Google Scholar
15. 15. Jacquier H, Birgy A, Le Nagard H, Mechulam Y, Schmitt E, Glodt J, et al. Capturing the mutational landscape of the beta-lactamase TEM-1. Proceedings of the National Academy of Sciences. 2013;110(32):13067–13072. pmid:23878237
* View Article
* PubMed/NCBI
* Google Scholar
16. 16. Firnberg E, Labonte JW, Gray JJ, Ostermeier M. A comprehensive, high-resolution map of a gene’s fitness landscape. Molecular biology and evolution. 2014;31(6):1581–1592. pmid:24567513
* View Article
* PubMed/NCBI
* Google Scholar
17. 17. Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KH, Dingens AS, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. cell. 2020;182(5):1295–1310. pmid:32841599
* View Article
* PubMed/NCBI
* Google Scholar
18. 18. Starr TN, Greaney AJ, Stewart CM, Walls AC, Hannon WW, Veesler D, et al. Deep mutational scans for ACE2 binding, RBD expression, and antibody escape in the SARS-CoV-2 Omicron BA. 1 and BA. 2 receptor-binding domains. PLoS pathogens. 2022;18(11):e1010951. pmid:36399443
* View Article
* PubMed/NCBI
* Google Scholar
19. 19. Bolognesi B, Faure AJ, Seuma M, Schmiedel JM, Tartaglia GG, Lehner B. The mutational landscape of a prion-like domain. Nature communications. 2019;10(1):4162. pmid:31519910
* View Article
* PubMed/NCBI
* Google Scholar
20. 20. Haddox HK, Dingens AS, Hilton SK, Overbaugh J, Bloom JD. Mapping mutational effects along the evolutionary landscape of HIV envelope. Elife. 2018;7:e34420. pmid:29590010
* View Article
* PubMed/NCBI
* Google Scholar
21. 21. Faure AJ, Domingo J, Schmiedel JM, Hidalgo-Carcedo C, Diss G, Lehner B. Mapping the energetic and allosteric landscapes of protein binding domains. Nature. 2022;604(7904):175–183. pmid:35388192
* View Article
* PubMed/NCBI
* Google Scholar
22. 22. Fantini M, Lisi S, De Los Rios P, Cattaneo A, Pastore A. Protein Structural Information and Evolutionary Landscape by In Vitro Evolution. Molecular Biology and Evolution. 2019;37(4):1179–1192.
* View Article
* Google Scholar
23. 23. Stiffler MA, Poelwijk FJ, Brock KP, Stein RR, Riesselman A, Teyra J, et al. Protein structure from experimental evolution. Cell Systems. 2020;10(1):15–24. pmid:31838147
* View Article
* PubMed/NCBI
* Google Scholar
24. 24. Byrne LC, Day TP, Visel M, Strazzeri JA, Fortuny C, Dalkara D, et al. In vivo–directed evolution of adeno-associated virus in the primate retina. JCI insight. 2020;5(10). pmid:32271719
* View Article
* PubMed/NCBI
* Google Scholar
25. 25. Benichou J, Ben-Hamo R, Louzoun Y, Efroni S. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. Immunology. 2012;135(3):183–191. pmid:22043864
* View Article
* PubMed/NCBI
* Google Scholar
26. 26. Fowler DM, Araya CL, Fleishman SJ, Kellogg EH, Stephany JJ, Baker D, et al. High-resolution mapping of protein sequence-function relationships. Nature methods. 2010;7(9):741. pmid:20711194
* View Article
* PubMed/NCBI
* Google Scholar
27. 27. Nemoto T, Ocari T, Planul A, Tekinsoy M, Zin EA, Dalkara D, et al. In-silico monitoring of directed evolution convergence to unveil best performing variants with credibility score. bioRxiv. 2023.
* View Article
* Google Scholar
28. 28. Rubin AF, Gelman H, Lucas N, Bajjalieh SM, Papenfuss AT, Speed TP, et al. A statistical framework for analyzing deep mutational scanning data. Genome biology. 2017;18(1):150. pmid:28784151
* View Article
* PubMed/NCBI
* Google Scholar
29. 29. Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biology. 2020;21(1):1–23. pmid:32799905
* View Article
* PubMed/NCBI
* Google Scholar
30. 30. Kirkpatrick S, Gelatt CD Jr, Vecchi MP. Optimization by simulated annealing. science. 1983;220(4598):671–680. pmid:17813860
* View Article
* PubMed/NCBI
* Google Scholar
31. 31. Bravi B, Di Gioacchino A, Fernandez-de Cossio-Diaz J, Walczak AM, Mora T, Cocco S, et al. Learning the differences: a transfer-learning approach to predict antigen immunogenicity and T-cell receptor specificity. bioRxiv. 2022; p. 2022–12.
* View Article
* Google Scholar
32. 32. Isacchini G, Walczak AM, Mora T, Nourmohammad A. Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences. 2021;118(14):e2023141118.
* View Article
* Google Scholar
33. 33. Neher RA, Shraiman BI. Statistical genetics and evolution of quantitative traits. Reviews of Modern Physics. 2011;83(4):1283.
* View Article
* Google Scholar
34. 34. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks Debora S and Sander C, Zecchina R, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein fami lies. Proceedings of the National Academy of Sciences. 2011;108(49):E1293–E1301.
* View Article
* Google Scholar
35. 35. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein–protein interaction by message passing. Proceedings of the National Academy of Sciences. 2009;106(1):67–72. pmid:19116270
* View Article
* PubMed/NCBI
* Google Scholar
36. 36. Asti L, Uguzzoni G, Marcatili P, Pagnani A. Maximum-entropy models of sequenced immune repertoires predict antigen-antibody affinity. PLoS computational biology. 2016;12(4):e1004870. pmid:27074145
* View Article
* PubMed/NCBI
* Google Scholar
37. 37. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Molecular biology and evolution. 2016;33(1):268–280. pmid:26446903
* View Article
* PubMed/NCBI
* Google Scholar
38. 38. Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CP, Springer M, Sander C, et al. Mutation effects predicted from sequence co-variation. Nature biotechnology. 2017;35(2):128–135. pmid:28092658
* View Article
* PubMed/NCBI
* Google Scholar
39. 39. Miton CM, Tokuriki N. How mutational epistasis impairs predictability in protein evolution and design. Protein Science. 2016;25(7):1260–1272. pmid:26757214
* View Article
* PubMed/NCBI
* Google Scholar
40. 40. Starr TN, Thornton JW. Epistasis in protein evolution. Protein Science. 2016;25(7):1204–1218. pmid:26833806
* View Article
* PubMed/NCBI
* Google Scholar
41. 41. Rodriguez-Rivas J, Croce G, Muscat M, Weigt M. Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes. Proceedings of the National Academy of Sciences. 2022;119(4):e2113118119. pmid:35022216
* View Article
* PubMed/NCBI
* Google Scholar
42. 42. Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E. 2013;87(1):012707. pmid:23410359
* View Article
* PubMed/NCBI
* Google Scholar
43. 43. Ekeberg M, Hartonen T, Aurell E. Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences. Journal of Computational Physics. 2014;276:341–356.
* View Article
* Google Scholar
44. 44. Ferguson AL, Mann JK, Omarjee S, Ndung’u T, Walker BD, Chakraborty AK. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity. 2013;38(3):606–617. pmid:23521886
* View Article
* PubMed/NCBI
* Google Scholar
45. 45. Zhang H, Quadeer AA, McKay MR. Evolutionary modeling reveals enhanced mutational flexibility of HCV subtype 1b compared with 1a. Iscience. 2022;25(1). pmid:34988406
* View Article
* PubMed/NCBI
* Google Scholar
46. 46. Mann JK, Barton JP, Ferguson AL, Omarjee S, Walker BD, Chakraborty A, et al. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS computational biology. 2014;10(8):e1003776. pmid:25102049
* View Article
* PubMed/NCBI
* Google Scholar
47. 47. Quadeer AA, Louie RH, McKay MR. Identifying immunologically-vulnerable regions of the HCV E2 glycoprotein and broadly neutralizing antibodies that target them. Nature communications. 2019;10(1):2073. pmid:31061402
* View Article
* PubMed/NCBI
* Google Scholar
48. 48. Quadeer AA, Barton JP, Chakraborty AK, McKay MR. Deconvolving mutational patterns of poliovirus outbreaks reveals its intrinsic fitness landscape. Nature communications. 2020;11(1):377. pmid:31953427
* View Article
* PubMed/NCBI
* Google Scholar
49. 49. Khan TA, Friedensohn S, Gorter de Vries AR, Straszewski J, Ruscheweyh HJ, Reddy ST. Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting. Science advances. 2016;2(3):e1501371. pmid:26998518
* View Article
* PubMed/NCBI
* Google Scholar
50. 50. Gérard A, Woolfe A, Mottet G, Reichen M, Castrillon C, Menrath V, et al. High-throughput single-cell activity-based screening and sequencing of antibodies using droplet microfluidics. Nature biotechnology. 2020;38(6):715–721. pmid:32231335
* View Article
* PubMed/NCBI
* Google Scholar
51. 51. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nature methods. 2014;11(8):801–807. pmid:25075907
* View Article
* PubMed/NCBI
* Google Scholar
52. 52. Boyer S, Biswas D, Kumar Soshee A, Scaramozzino N, Nizak C, Rivoire O. Hierarchy and extremes in selections from pools of randomized proteins. Proceedings of the National Academy of Sciences. 2016;113(13):3482–3487. pmid:26969726
* View Article
* PubMed/NCBI
* Google Scholar
53. 53. Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R. Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife. 2016;5:e16965. pmid:27391790
* View Article
* PubMed/NCBI
* Google Scholar
54. 54. Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. The Journal of Immunology. 2018;201(8):2502–2509. pmid:30217829
* View Article
* PubMed/NCBI
* Google Scholar
55. 55. Heinrich L, Tissot N, Hartmann DJ, Cohen R. Comparison of the results obtained by ELISA and surface plasmon resonance for the determination of antibody affinity. Journal of immunological methods. 2010;352(1-2):13–22. pmid:19854197
* View Article
* PubMed/NCBI
* Google Scholar
56. 56. Sorouri M, Fitzsimmons SP, Aydanian AG, Bennett S, Shapiro MA. Diversity of the antibody response to tetanus toxoid: comparison of hybridoma library to phage display library. PloS one. 2014;9(9):e106699. pmid:25268771
* View Article
* PubMed/NCBI
* Google Scholar
Citation: Sesta L, Pagnani A, Fernandez-de-Cossio-Diaz J, Uguzzoni G (2024) Inference of annealed protein fitness landscapes with AnnealDCA. PLoS Comput Biol 20(2): e1011812. https://doi.org/10.1371/journal.pcbi.1011812
1. Di Gioacchino A, Procyk J, Molari M, Schreck JS, Zhou Y, Liu Y, et al. Generative and interpretable machine learning for aptamer design and analysis of in vitro sequence selection. PLoS computational biology. 2022;18(9):e1010561. pmid:36174101
2. Otwinowski J, McCandlish DM, Plotkin JB. Inferring the shape of global epistasis. Proceedings of the National Academy of Sciences. 2018;115(32):E7550–E7558. pmid:30037990
3. Otwinowski J. Biophysical Inference of Epistasis and the Effects of Mutations on Protein Stability and Function. Molecular Biology and Evolution. 2018;35(10):2345–2354. pmid:30085303
4. Otwinowski J, Plotkin JB. Inferring fitness landscapes by regression produces biased estimates of epistasis. Proceedings of the National Academy of Sciences. 2014;111(22):E2301–E2309. pmid:24843135
5. Bisardi M, Rodriguez-Rivas J, Zamponi F, Weigt M. Modeling sequence-space exploration and emergence of epistatic signals in protein evolution. Molecular biology and evolution. 2022;39(1):msab321. pmid:34751386
6. Sesta L, Uguzzoni G, Fernandez-de Cossio-Diaz J, Pagnani A. Amala: Analysis of directed evolution experiments via annealed mutational approximated landscape. International journal of molecular sciences. 2021;22(20):10908. pmid:34681569
7. Fernandez-de Cossio-Diaz J, Uguzzoni G, Pagnani A. Unsupervised Inference of Protein Fitness Landscape from Deep Mutational Scan. Molecular Biology and Evolution. 2020.
8. Araya CL, Fowler DM, Chen W, Muniez I, Kelly JW, Fields S. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proceedings of the National Academy of Sciences. 2012;109(42):16858–16863. pmid:23035249
9. Olson CA, Wu NC, Sun R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Current Biology. 2014;24(22):2643–2651. pmid:25455030
10. Starita LM, Young DL, Islam M, Kitzman JO, Gullingsrud J, Hause RJ, et al. Massively parallel functional analysis of BRCA1 RING domain variants. Genetics. 2015;200(2):413–422. pmid:25823446
11. Aakre CD, Herrou J, Phung TN, Perchuk BS, Crosson S, Laub MT. Evolving new protein-protein interaction specificity through promiscuous intermediates. Cell. 2015;163(3):594–606. pmid:26478181
12. Kitzman JO, Starita LM, Lo RS, Fields S, Shendure J. Massively parallel single-amino-acid mutagenesis. Nature methods. 2015;12(3):203–206. pmid:25559584
13. Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proceedings of the National Academy of Sciences. 2015;112(23):7159–7164. pmid:26040002
14. Starita LM, Pruneda JN, Lo RS, Fowler DM, Kim HJ, Hiatt JB, et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proceedings of the National Academy of Sciences. 2013;110(14):E1263–E1272.
15. Jacquier H, Birgy A, Le Nagard H, Mechulam Y, Schmitt E, Glodt J, et al. Capturing the mutational landscape of the beta-lactamase TEM-1. Proceedings of the National Academy of Sciences. 2013;110(32):13067–13072. pmid:23878237
16. Firnberg E, Labonte JW, Gray JJ, Ostermeier M. A comprehensive, high-resolution map of a gene’s fitness landscape. Molecular biology and evolution. 2014;31(6):1581–1592. pmid:24567513
17. Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KH, Dingens AS, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. cell. 2020;182(5):1295–1310. pmid:32841599
18. Starr TN, Greaney AJ, Stewart CM, Walls AC, Hannon WW, Veesler D, et al. Deep mutational scans for ACE2 binding, RBD expression, and antibody escape in the SARS-CoV-2 Omicron BA. 1 and BA. 2 receptor-binding domains. PLoS pathogens. 2022;18(11):e1010951. pmid:36399443
19. Bolognesi B, Faure AJ, Seuma M, Schmiedel JM, Tartaglia GG, Lehner B. The mutational landscape of a prion-like domain. Nature communications. 2019;10(1):4162. pmid:31519910
20. Haddox HK, Dingens AS, Hilton SK, Overbaugh J, Bloom JD. Mapping mutational effects along the evolutionary landscape of HIV envelope. Elife. 2018;7:e34420. pmid:29590010
21. Faure AJ, Domingo J, Schmiedel JM, Hidalgo-Carcedo C, Diss G, Lehner B. Mapping the energetic and allosteric landscapes of protein binding domains. Nature. 2022;604(7904):175–183. pmid:35388192
22. Fantini M, Lisi S, De Los Rios P, Cattaneo A, Pastore A. Protein Structural Information and Evolutionary Landscape by In Vitro Evolution. Molecular Biology and Evolution. 2019;37(4):1179–1192.
23. Stiffler MA, Poelwijk FJ, Brock KP, Stein RR, Riesselman A, Teyra J, et al. Protein structure from experimental evolution. Cell Systems. 2020;10(1):15–24. pmid:31838147
24. Byrne LC, Day TP, Visel M, Strazzeri JA, Fortuny C, Dalkara D, et al. In vivo–directed evolution of adeno-associated virus in the primate retina. JCI insight. 2020;5(10). pmid:32271719
25. Benichou J, Ben-Hamo R, Louzoun Y, Efroni S. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. Immunology. 2012;135(3):183–191. pmid:22043864
26. Fowler DM, Araya CL, Fleishman SJ, Kellogg EH, Stephany JJ, Baker D, et al. High-resolution mapping of protein sequence-function relationships. Nature methods. 2010;7(9):741. pmid:20711194
27. Nemoto T, Ocari T, Planul A, Tekinsoy M, Zin EA, Dalkara D, et al. In-silico monitoring of directed evolution convergence to unveil best performing variants with credibility score. bioRxiv. 2023.
28. Rubin AF, Gelman H, Lucas N, Bajjalieh SM, Papenfuss AT, Speed TP, et al. A statistical framework for analyzing deep mutational scanning data. Genome biology. 2017;18(1):150. pmid:28784151
29. Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biology. 2020;21(1):1–23. pmid:32799905
30. Kirkpatrick S, Gelatt CD Jr, Vecchi MP. Optimization by simulated annealing. science. 1983;220(4598):671–680. pmid:17813860
31. Bravi B, Di Gioacchino A, Fernandez-de Cossio-Diaz J, Walczak AM, Mora T, Cocco S, et al. Learning the differences: a transfer-learning approach to predict antigen immunogenicity and T-cell receptor specificity. bioRxiv. 2022; p. 2022–12.
32. Isacchini G, Walczak AM, Mora T, Nourmohammad A. Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences. 2021;118(14):e2023141118.
33. Neher RA, Shraiman BI. Statistical genetics and evolution of quantitative traits. Reviews of Modern Physics. 2011;83(4):1283.
34. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks Debora S and Sander C, Zecchina R, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein fami lies. Proceedings of the National Academy of Sciences. 2011;108(49):E1293–E1301.
35. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein–protein interaction by message passing. Proceedings of the National Academy of Sciences. 2009;106(1):67–72. pmid:19116270
36. Asti L, Uguzzoni G, Marcatili P, Pagnani A. Maximum-entropy models of sequenced immune repertoires predict antigen-antibody affinity. PLoS computational biology. 2016;12(4):e1004870. pmid:27074145
37. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Molecular biology and evolution. 2016;33(1):268–280. pmid:26446903
38. Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CP, Springer M, Sander C, et al. Mutation effects predicted from sequence co-variation. Nature biotechnology. 2017;35(2):128–135. pmid:28092658
39. Miton CM, Tokuriki N. How mutational epistasis impairs predictability in protein evolution and design. Protein Science. 2016;25(7):1260–1272. pmid:26757214
40. Starr TN, Thornton JW. Epistasis in protein evolution. Protein Science. 2016;25(7):1204–1218. pmid:26833806
41. Rodriguez-Rivas J, Croce G, Muscat M, Weigt M. Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes. Proceedings of the National Academy of Sciences. 2022;119(4):e2113118119. pmid:35022216
42. Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E. 2013;87(1):012707. pmid:23410359
43. Ekeberg M, Hartonen T, Aurell E. Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences. Journal of Computational Physics. 2014;276:341–356.
44. Ferguson AL, Mann JK, Omarjee S, Ndung’u T, Walker BD, Chakraborty AK. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity. 2013;38(3):606–617. pmid:23521886
45. Zhang H, Quadeer AA, McKay MR. Evolutionary modeling reveals enhanced mutational flexibility of HCV subtype 1b compared with 1a. Iscience. 2022;25(1). pmid:34988406
46. Mann JK, Barton JP, Ferguson AL, Omarjee S, Walker BD, Chakraborty A, et al. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS computational biology. 2014;10(8):e1003776. pmid:25102049
47. Quadeer AA, Louie RH, McKay MR. Identifying immunologically-vulnerable regions of the HCV E2 glycoprotein and broadly neutralizing antibodies that target them. Nature communications. 2019;10(1):2073. pmid:31061402
48. Quadeer AA, Barton JP, Chakraborty AK, McKay MR. Deconvolving mutational patterns of poliovirus outbreaks reveals its intrinsic fitness landscape. Nature communications. 2020;11(1):377. pmid:31953427
49. Khan TA, Friedensohn S, Gorter de Vries AR, Straszewski J, Ruscheweyh HJ, Reddy ST. Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting. Science advances. 2016;2(3):e1501371. pmid:26998518
50. Gérard A, Woolfe A, Mottet G, Reichen M, Castrillon C, Menrath V, et al. High-throughput single-cell activity-based screening and sequencing of antibodies using droplet microfluidics. Nature biotechnology. 2020;38(6):715–721. pmid:32231335
51. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nature methods. 2014;11(8):801–807. pmid:25075907
52. Boyer S, Biswas D, Kumar Soshee A, Scaramozzino N, Nizak C, Rivoire O. Hierarchy and extremes in selections from pools of randomized proteins. Proceedings of the National Academy of Sciences. 2016;113(13):3482–3487. pmid:26969726
53. Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R. Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife. 2016;5:e16965. pmid:27391790
54. Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. The Journal of Immunology. 2018;201(8):2502–2509. pmid:30217829
55. Heinrich L, Tissot N, Hartmann DJ, Cohen R. Comparison of the results obtained by ELISA and surface plasmon resonance for the determination of antibody affinity. Journal of immunological methods. 2010;352(1-2):13–22. pmid:19854197
56. Sorouri M, Fitzsimmons SP, Aydanian AG, Bennett S, Shapiro MA. Diversity of the antibody response to tetanus toxoid: comparison of hybridoma library to phage display library. PloS one. 2014;9(9):e106699. pmid:25268771
About the Authors:
Luca Sesta
Roles: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing
* E-mail: [email protected]
Affiliation: Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
https://orcid.org/0000-0002-6051-0972
Andrea Pagnani
Roles: Funding acquisition, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing
Affiliations: Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy, Italian Institute for Genomic Medicine, Torino, Italy, INFN, Sezione di Torino, Torino, Italy
https://orcid.org/0000-0002-6509-0807
Jorge Fernandez-de-Cossio-Diaz
Contributed equally to this work with: Jorge Fernandez-de-Cossio-Diaz, Guido Uguzzoni
Roles: Conceptualization, Investigation, Methodology, Software, Supervision, Writing – review & editing
Affiliation: Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 & PSL Research, Sorbonne Université, Paris, France
https://orcid.org/0000-0002-4476-805X
Guido Uguzzoni
Contributed equally to this work with: Jorge Fernandez-de-Cossio-Diaz, Guido Uguzzoni
Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing
Affiliation: Italian Institute for Genomic Medicine, Torino, Italy
https://orcid.org/0000-0003-4192-2864
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024 Sesta et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer