Abstract

Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.

Details

Title
Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering
Author
Barone, Federico 1 ; Russo, Elena Tea 2 ; Villegas Garcia, Edith Natalia 1   VIAFID ORCID Logo  ; Punta, Marco 3 ; Cozzini, Stefano 2 ; Ansuini, Alessio 2 ; Cazzaniga, Alberto 2   VIAFID ORCID Logo 

 Padriciano, 99, Area Science Park, Trieste, Italy (GRID:grid.419994.8) (ISNI:0000 0004 1759 4706); University of Trieste, Trieste, Italy (GRID:grid.5133.4) (ISNI:0000 0001 1941 4308) 
 Padriciano, 99, Area Science Park, Trieste, Italy (GRID:grid.419994.8) (ISNI:0000 0004 1759 4706) 
 Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy (GRID:grid.5133.4) (ISNI:0000 0004 1784 8390); Unit of Immunogenetics, Leukemia Genomics and Immunobiology, Division of Immunology, Transplantation and Infectious Disease, IRCCS San Raffaele Institute, Milan, Italy (GRID:grid.18887.3e) (ISNI:0000000417581884) 
Pages
568
Publication year
2024
Publication date
2024
Publisher
Nature Publishing Group
e-ISSN
20524463
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3062956488
Copyright
© The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.