Content area

Abstract

Clathrin is a key cytoplasmic protein that serves as the predominant structural element in the formation of coated vesicles. Specifically, clarithin enables the scission of newly formed vesicles from the plasma membrane’s cytoplasmic face. Efficient and accurate identification of clathrins is essential for understanding human diseases and aiding drug target development. Recent advancements in computational methods for identifying clathrins using sequence data have greatly improved large-scale clathrin screening. Here, we propose a high-accuracy computational approach, termed PLM-CLA, to achieve more accurate identification of clathrins. In PLM-CLA, we leveraged multi-source pre-trained protein language models (PLMs), which were trained on large-scale protein sequences from multiple database sources, including ProtT5-BFD, ProtT5-UR50, ProstT5, and ESM-2. These models were used to encode complementary feature embeddings, capturing diverse and valuable information. To the best of our knowledge, PLM-CLA is the first attempt designed using various PLM-based embeddings to identify clathrins. To enhance prediction performance, we utilized a feature selection method to optimize these fused feature embeddings. Finally, we employed a long short-term memory (LSTM) neural network model coupled with the optimal feature subset to identify clathrins. Benchmarking experiments, including independent tests, showed that PLM-CLA significantly outperformed state-of-the-art methods, achieving an accuracy of 0.961, MCC of 0.917, and AUC of 0.997. Furthermore, PLM-CLA secured outstanding performance in terms of MCC, with values of 0.971 and 0.904 on two existing independent test datasets. We anticipate that the proposed PLM-CLA model will serve as a promising tool for large-scale identification of clathrins in resource-limited settings.

Details

1009240
Title
Advancing the accuracy of clathrin protein prediction through multi-source protein language models
Author
Shoombuatong, Watshara 1 ; Schaduangrat, Nalini 1 ; Mookdarsanit, Pakpoom 2 ; Nikom, Jaru 3 ; Mookdarsanit, Lawankorn 4 

 Mahidol University, Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Bangkok, Thailand (GRID:grid.10223.32) (ISNI:0000 0004 1937 0490) 
 Chandrakasem Rajabhat University, Computer Science and Artificial Intelligence, Faculty of Science, Bangkok, Thailand (GRID:grid.443698.4) (ISNI:0000 0004 0399 0644) 
 Prince of Songkla University, Research Methodology and Data Analytics Program, Faculty of Science and Technology, Pattani, Thailand (GRID:grid.7130.5) (ISNI:0000 0004 0470 1162) 
 Chandrakasem Rajabhat University, Business Information System, Faculty of Management Science, Bangkok, Thailand (GRID:grid.443698.4) (ISNI:0000 0004 0399 0644) 
Volume
15
Issue
1
Pages
24403
Publication year
2025
Publication date
2025
Publisher
Nature Publishing Group
Place of publication
London
Country of publication
United States
Publication subject
e-ISSN
20452322
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-07-08
Milestone dates
2025-06-23 (Registration); 2025-02-18 (Received); 2025-06-23 (Accepted)
Publication history
 
 
   First posting date
08 Jul 2025
ProQuest document ID
3228193313
Document URL
https://www.proquest.com/scholarly-journals/advancing-accuracy-clathrin-protein-prediction/docview/3228193313/se-2?accountid=208611
Copyright
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-07-09
Database
ProQuest One Academic