Full Text

Turn on search term navigation

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).

Details

Title
Survey of Protein Sequence Embedding Models
Author
Tran, Chau 1 ; Khadkikar, Siddharth 2 ; Porollo, Aleksey 3   VIAFID ORCID Logo 

 Department of Computer Science, University of Cincinnati, Cincinnati, OH 45219, USA 
 Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA 
 Center for Autoimmune Genomics and Etiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USA; Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USA; Department of Pediatrics, University of Cincinnati, Cincinnati, OH 45267, USA 
First page
3775
Publication year
2023
Publication date
2023
Publisher
MDPI AG
ISSN
16616596
e-ISSN
14220067
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2779604932
Copyright
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.