Content area

Abstract

Identifying representative sequences for groups of functionally similar proteins and enzymes poses significant computational challenges. In this study, we applied submodular optimization, a method effective in data summarization, to select representative sequences for thioesterase enzyme families. We introduced and validated two algorithms, Greedy and Bidirectional Greedy, using curated protein sequence data from the ThYme (Thioester-active enzYmes) database. Both algorithms generated sequence subsets that preserved completeness (inclusion of all known family sequences) and specificity (accurate family representation). The Greedy algorithm outperformed the Bidirectional Greedy algorithm and other methods, particularly in reducing redundancy. Our study offers an efficient approach for identifying representative protein sequences within families that have significant sequence similarity, likely to deliver results close to theoretical optima in polynomial time, with the potential to improve the selection and optimization of representative sequences in protein databases.

Details

1009240
Title
Identifying representative sequences of protein families using submodular optimization
Volume
15
Issue
1
Pages
1069
Publication year
2025
Publication date
2025
Publisher
Nature Publishing Group
Place of publication
London
Country of publication
United States
Publication subject
e-ISSN
20452322
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-01-07
Milestone dates
2025-01-01 (Registration); 2024-07-09 (Received); 2025-01-01 (Accepted)
Publication history
 
 
   First posting date
07 Jan 2025
ProQuest document ID
3152424788
Document URL
https://www.proquest.com/scholarly-journals/identifying-representative-sequences-protein/docview/3152424788/se-2?accountid=208611
Copyright
Copyright Nature Publishing Group 2025
Last updated
2025-03-11
Database
2 databases
  • ProQuest One Academic
  • ProQuest One Academic