Abstract

Engineering stabilized proteins is a fundamental challenge in the development of industrial and pharmaceutical biotechnologies. We present Stability Oracle: a structure-based graph-transformer framework that achieves SOTA performance on accurately identifying thermodynamically stabilizing mutations. Our framework introduces several innovations to overcome well-known challenges in data scarcity and bias, generalization, and computation time, such as: Thermodynamic Permutations for data augmentation, structural amino acid embeddings to model a mutation with a single structure, a protein structure-specific attention-bias mechanism that makes transformers a viable alternative to graph neural networks. We provide training/test splits that mitigate data leakage and ensure proper model evaluation. Furthermore, to examine our data engineering contributions, we fine-tune ESM2 representations (Prostata-IFML) and achieve SOTA for sequence-based models. Notably, Stability Oracle outperforms Prostata-IFML even though it was pretrained on 2000X less proteins and has 548X less parameters. Our framework establishes a path for fine-tuning structure-based transformers to virtually any phenotype, a necessary task for accelerating the development of protein-based biotechnologies.

Engineering stabilized proteins is essential for industrial and pharmaceutical biotechnologies. Here, authors present Stability Oracle, a Graph-Transformer framework trained on protein masked microenvironments to predict protein thermodynamic stability, using less training data while achieving improved generalization.

Details

Title
Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations
Author
Diaz, Daniel J. 1   VIAFID ORCID Logo  ; Gong, Chengyue 2 ; Ouyang-Zhang, Jeffrey 2 ; Loy, James M. 3   VIAFID ORCID Logo  ; Wells, Jordan 4   VIAFID ORCID Logo  ; Yang, David 5 ; Ellington, Andrew D. 5   VIAFID ORCID Logo  ; Dimakis, Alexandros G. 6 ; Klivans, Adam R. 2 

 Department of Computer Science, UT Austin, Austin, USA (GRID:grid.89336.37) (ISNI:0000 0004 1936 9924); LLC, Intelligent Proteins, Austin, USA (GRID:grid.89336.37); Department of Chemistry, UT Austin, Austin, USA (GRID:grid.89336.37) (ISNI:0000 0004 1936 9924) 
 Department of Computer Science, UT Austin, Austin, USA (GRID:grid.89336.37) (ISNI:0000 0004 1936 9924) 
 LLC, Intelligent Proteins, Austin, USA (GRID:grid.89336.37); Department of Molecular Biosciences, UT Austin, Austin, USA (GRID:grid.89336.37) (ISNI:0000 0004 1936 9924) 
 McKetta Department of Chemical Engineering, UT Austin, Austin, USA (GRID:grid.89336.37) (ISNI:0000 0004 1936 9924) 
 Department of Molecular Biosciences, UT Austin, Austin, USA (GRID:grid.89336.37) (ISNI:0000 0004 1936 9924) 
 Chandra Family Department of Electrical and Computer Engineering, UT Austin, Austin, USA (GRID:grid.89336.37) (ISNI:0000 0004 1936 9924) 
Pages
6170
Publication year
2024
Publication date
2024
Publisher
Nature Publishing Group
e-ISSN
20411723
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3083763767
Copyright
© The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.