Content area

Abstract

Proteins play an essential role in various biological and engineering processes. Large protein language models (PLMs) present excellent potential to reshape protein research by accelerating the determination of protein functions and the design of proteins with the desired functions. The prediction and design capacity of PLMs relies on the representation gained from the protein sequences. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily dependent on 3D structures. To address this issue, S‐PLM is introduced as a 3D structure‐aware PLM that utilizes multi‐view contrastive learning to align the sequence and 3D structure of a protein in a coordinated latent space. S‐PLM applies Swin‐Transformer on AlphaFold‐predicted protein structures to embed the structural information and fuses it into sequence‐based embedding from ESM2. Additionally, a library of lightweight tuning tools is provided to adapt S‐PLM for diverse downstream protein prediction tasks. The results demonstrate S‐PLM's superior performance over sequence‐only PLMs on all protein clustering and classification tasks, achieving competitiveness comparable to state‐of‐the‐art methods requiring both sequence and structure inputs. S‐PLM and its lightweight tuning tools are available at https://github.com/duolinwang/S-PLM/.

Details

1009240
Title
S‐PLM: Structure‐Aware Protein Language Model via Contrastive Learning Between Sequence and Structure
Author
Wang, Duolin 1 ; Pourmirzaei, Mahdi 1 ; Abbas, Usman L. 2 ; Zeng, Shuai 1 ; Manshour, Negin 1 ; Esmaili, Farzaneh 1 ; Poudel, Biplab 1 ; Jiang, Yuexu 1 ; Shao, Qing 2 ; Chen, Jin 3 ; Xu, Dong 1   VIAFID ORCID Logo 

 Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA 
 Chemical & Materials Engineering, University of Kentucky, Lexington, KY, USA 
 Department of Medicine and Department of Biomedical Informatics and Data Science, University of Alabama at Birmingham, Birmingham, AL, USA 
Publication title
Volume
12
Issue
5
Number of pages
16
Publication year
2025
Publication date
Feb 1, 2025
Section
Research Article
Publisher
John Wiley & Sons, Inc.
Place of publication
Weinheim
Country of publication
United States
Publication subject
e-ISSN
21983844
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2024-12-12
Milestone dates
2024-08-21 (manuscriptRevised); 2025-02-04 (publishedOnlineFinalForm); 2024-04-20 (manuscriptReceived); 2024-12-12 (publishedOnlineEarlyUnpaginated)
Publication history
 
 
   First posting date
12 Dec 2024
ProQuest document ID
3163164692
Document URL
https://www.proquest.com/scholarly-journals/s-plm-structure-aware-protein-language-model-via/docview/3163164692/se-2?accountid=208611
Copyright
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-08-21
Database
ProQuest One Academic