Embeddings from protein language models predict

Abstract

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

Details

Title

Embeddings from protein language models predict conservation and variant effects

Author

Marquet, Céline¹

; Heinzinger, Michael¹; Olenyi, Tobias¹; Dallago, Christian¹; Erckert, Kyra¹; Bernhofer, Michael¹; Nechaev, Dmitrii¹; Rost, Burkhard²

¹ TUM-Technical University of Munich, Department of Informatics, Bioinformatics and Computational Biology - i12, Munich, Germany (GRID:grid.6936.a) (ISNI:0000000123222966); TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany (GRID:grid.6936.a) (ISNI:0000000123222966)
² TUM-Technical University of Munich, Department of Informatics, Bioinformatics and Computational Biology - i12, Munich, Germany (GRID:grid.6936.a) (ISNI:0000000123222966); Institute for Advanced Study (TUM-IAS), Munich, Germany (GRID:grid.6936.a) (ISNI:0000000123222966); TUM School of Life Sciences Weihenstephan (TUM-WZW), Freising, Germany (GRID:grid.6936.a)

Pages

1629-1647

Publication year

2022

Publication date

Oct 2022

Publisher

Springer Nature B.V.

ISSN

03406717

e-ISSN

14321203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1007/s00439-021-02411-y

ProQuest document ID

2719234315

© The Author(s) 2021. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Embeddings from protein language models predict conservation and variant effects

Jump to:

Abstract

Details

Full text options

Suggested sources