Content area

Abstract

Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide-spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide-spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.

Details

1007527
Journal classification
Substance
Supplemental data
Indexing method: Automated
Title
APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control
Author
Chen, Yiling Elaine 1   VIAFID ORCID Logo  ; Ge, Xinzhou 1   VIAFID ORCID Logo  ; Woyshner, Kyla 2   VIAFID ORCID Logo  ; McDermott, MeiLu 3   VIAFID ORCID Logo  ; Manousopoulou, Antigoni 2   VIAFID ORCID Logo  ; Ficarro, Scott B 4   VIAFID ORCID Logo  ; Marto, Jarrod A 4   VIAFID ORCID Logo  ; Li, Kexin 1   VIAFID ORCID Logo  ; Wang, Leo David 5   VIAFID ORCID Logo  ; Li, Jingyi Jessica 6   VIAFID ORCID Logo 

 Department of Statistics and Data Science, University of California, Los Angeles, CA 90095, USA 
 Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA 
 Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA 
 Department of Cancer Biology and Blais Proteomics Center, Dana-Farber Cancer Institute, Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02215, USA 
 Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA; Department of Pediatrics, City of Hope National Medical Center, Duarte, CA 91010, USA 
 Department of Statistics and Data Science, University of California, Los Angeles, CA 90095, USA; Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, CA 90095, USA; Department of Computational Medicine, University of California, Los Angeles, CA 90095, USA; Department of Biostatistics, University of California, Los Angeles, CA 90095, USA 
Correspondence author
Journal abbreviation
Genomics Proteomics Bioinformatics
Grant
K08 CA201591. NCI NIH HHS. United States. 
R35 GM140888. NIGMS NIH HHS. United States. 
P30CA033572. National Cancer Institute under Cancer Center. 
K08CA201591. National Cancer Institute, USA. 
P30 CA033572. NCI NIH HHS. United States. 
R01 GM120507. NIGMS NIH HHS. United States. 
T32 LM012424. NLM NIH HHS. United States. 
Volume
22
Issue
2
Publication year
2024
Country of publication
ENGLAND
eISSN
2210-3244
Source type
Scholarly Journal
Peer reviewed
Yes
Format availability
Internet
Language of publication
English
Record type
Journal Article
Publication note
Print
Publication history
 
 
   Accepted date
28 Aug 2024
   Revised date
23 Oct 2025
23 Oct 2025
   First submitted date
28 Aug 2024
Medline document status
MEDLINE
PubMed ID
39198030
ProQuest document ID
3099140675
Document URL
https://www.proquest.com/scholarly-journals/apir-aggregating-universal-proteomics-database/docview/3099140675/se-2?accountid=208611
Copyright
© The Author(s) 2024. Published by Oxford University Press and Science Press on behalf of the Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.
Last updated
2025-10-23
Database
ProQuest One Academic