Full text

Turn on search term navigation

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Biobanks are a rich source of data for genome-wide association studies (GWAS). They store clinical data from electronic health records, with data domains such as laboratory measurements, conditions, and self-reported diagnoses. Traditionally, biobank GWAS utilize case-control cohorts built exclusively from conditions. However, because reported conditions are primarily collected for billing purposes, they face data quality issues. Consequently, incorporating additional data domains in cohort construction can improve cohort accuracy and GWAS results. Here, we assess the impact of various rule-based phenotyping algorithms on GWAS outcomes, examining factors such as power, heritability, replicability, functional annotations, and polygenic risk score prediction accuracy across seven diseases in the UK Biobank. We find that high complexity phenotyping algorithms generally improve GWAS outcomes, including increased power, hits within coding and functional genomic regions, and co-localization with expression quantitative trait loci. Our findings suggest that biobank-scale GWAS can benefit from phenotyping algorithms that integrate multiple data domains.

Details

Title
Multi-domain rule-based phenotyping algorithms enable improved GWAS signal
Author
Newbury, Abigail 1 ; Elhussein, Ahmed 1 ; Gürsoy, Gamze 2 

 Department of Biomedical Informatics, Columbia University, New York City, NY, USA (ROR: https://ror.org/00hj8s172) (GRID: grid.21729.3f) (ISNI: 0000 0004 1936 8729); New York Genome Center, New York City, NY, USA (ROR: https://ror.org/05wf2ga96) (GRID: grid.429884.b) (ISNI: 0000 0004 1791 0895) 
 Department of Biomedical Informatics, Columbia University, New York City, NY, USA (ROR: https://ror.org/00hj8s172) (GRID: grid.21729.3f) (ISNI: 0000 0004 1936 8729); New York Genome Center, New York City, NY, USA (ROR: https://ror.org/05wf2ga96) (GRID: grid.429884.b) (ISNI: 0000 0004 1791 0895); Department of Computer Science, Columbia University, New York City, NY, USA (ROR: https://ror.org/00hj8s172) (GRID: grid.21729.3f) (ISNI: 0000 0004 1936 8729) 
Pages
499
Section
Article
Publication year
2025
Publication date
Dec 2025
Publisher
Nature Publishing Group
e-ISSN
23986352
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3235851458
Copyright
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.