APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control

Abstract

Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide-spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide-spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.

Details

Journal classification

Index Medicus

MeSH subject

Humans;
Software;
Proteomics -- methods (major);
Algorithms (major);
Databases, Protein (major);
Peptides -- metabolism (major);
Peptides -- genetics (major)

Substance

Substance:

Peptides

CAS:

Supplemental data

Indexing method: Automated

Sponsor

NCI NIH HHS, NIGMS NIH HHS, National Cancer Institute under Cancer Center, National Cancer Institute, USA, NCI NIH HHS, NIGMS NIH HHS, NLM NIH HHS

Title

APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control

Author

Chen, Yiling Elaine¹

; Ge, Xinzhou¹

; Woyshner, Kyla²

; McDermott, MeiLu³

; Manousopoulou, Antigoni²

; Ficarro, Scott B⁴

; Marto, Jarrod A⁴

; Li, Kexin¹

; Wang, Leo David⁵

; Li, Jingyi Jessica⁶

¹ Department of Statistics and Data Science, University of California, Los Angeles, CA 90095, USA
² Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA
³ Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
⁴ Department of Cancer Biology and Blais Proteomics Center, Dana-Farber Cancer Institute, Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02215, USA
⁵ Department of Immuno-Oncology, Beckman Research Institute, City of Hope National Medical Center, Duarte, CA 91010, USA; Department of Pediatrics, City of Hope National Medical Center, Duarte, CA 91010, USA
⁶ Department of Statistics and Data Science, University of California, Los Angeles, CA 90095, USA; Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, CA 90095, USA; Department of Computational Medicine, University of California, Los Angeles, CA 90095, USA; Department of Biostatistics, University of California, Los Angeles, CA 90095, USA

Correspondence author

Chen, Yiling Elaine

Publication title

Genomics, proteomics & bioinformatics

Journal abbreviation

Genomics Proteomics Bioinformatics

Grant

K08 CA201591. NCI NIH HHS. United States.

R35 GM140888. NIGMS NIH HHS. United States.

P30CA033572. National Cancer Institute under Cancer Center.

K08CA201591. National Cancer Institute, USA.

P30 CA033572. NCI NIH HHS. United States.

R01 GM120507. NIGMS NIH HHS. United States.

T32 LM012424. NLM NIH HHS. United States.

Volume

Issue

Publication year

2024

Country of publication

ENGLAND

eISSN

2210-3244

Source type

Scholarly Journal

Peer reviewed

Yes

Format availability

Internet

Language of publication

English

Record type

Journal Article

Publication note

Publication history

Accepted date

28 Aug 2024

Revised date

23 Oct 2025

First submitted date

28 Aug 2024

DOI

https://doi.org/10.1093/gpbjnl/qzae042

Medline document status

MEDLINE

PubMed ID

39198030

ProQuest document ID

3099140675

Document URL

https://www.proquest.com/scholarly-journals/apir-aggregating-universal-proteomics-database/docview/3099140675/se-2?accountid=208611

© The Author(s) 2024. Published by Oxford University Press and Science Press on behalf of the Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.

Last updated

2025-10-23

Database

ProQuest One Academic

APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control

Abstract

Details

Full text options

Suggested sources

Search with indexing terms

MeSH subject

APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control

Content area

Abstract

Details

Full text options

Suggested sources

Search with indexing terms

MeSH subject