Abstract

DNA-binding proteins (DBPs) play critical roles in gene regulation, development, and environmental response across various species, including plants, animals, and microorganisms. While various machine learning and deep learning models have been developed to distinguish DNA-binding proteins (DBPs) from non-DNA-binding proteins (NDBPs), most available tools have focused on human and mouse datasets. As a result, there are limited studies specifically addressing plant-based DNA-binding proteins, which restricts our understanding of their unique roles and functions in plant biology. Developing an efficient framework for improving DBP prediction in plants would enhance our knowledge and enable precise gene expression control, accelerate crop improvement, enhance stress resilience, and optimize metabolic engineering for agricultural advancement. In this work, we developed a tool that uses a protein language model (pLM) pre-trained on millions of sequences. We evaluated several leading models, including ProtT5, Ankh, and ESM-2, and leveraged their high-dimensional, information-rich representations to improve the accuracy of DNA-binding protein prediction in plants significantly. Our final model, pLM-DBPs, a feed-forward neural network classifier utilizing ProtT5-based representations, outperformed existing approaches with a Matthews Correlation Coefficient (MCC) of 83.8% on the independent test set. This represents a 10% improvement over the previous state-of-the-art model for plant-based DBP prediction, highlighting its superior performance compared to the existing approaches.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

* Some minor grammatical errors have been corrected.

* https://github.com/suresh-pokharel/pLM-DBPs

Details

Title
pLM-DBPs: Enhanced DNA-Binding Protein Prediction in Plants Using Embeddings From Protein Language Models
Author
Pokharel, Suresh; Barasa, Kepha; Pratyush, Pawel; Kc, Dukka B
University/institution
Cold Spring Harbor Laboratory Press
Section
New Results
Publication year
2024
Publication date
Nov 7, 2024
Publisher
Cold Spring Harbor Laboratory Press
Source type
Working Paper
Language of publication
English
ProQuest document ID
3125865127
Copyright
© 2024. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.