Multi-domain rule-based phenotyping algorithms

Full text

Turn on search term navigation

Introduction

Precision medicine is an approach in healthcare that considers the individual variability in genes, environment, and lifestyle for each person¹. By acknowledging these unique differences, it significantly enhances the diagnosis, treatment, and prevention of diseases, moving away from the traditional one-size-fits-all model. Biobanks are essential in this process. They offer a vast repository of data, encompassing genetic, clinical, environmental, and lifestyle information from diverse populations^2,3. Biobank-scale data has proven particularly useful for genome-wide association studies (GWAS) by providing researchers with sufficient sample size and a diverse array of genetic variants to detect genotype–phenotype associations.

While biobanks provide researchers with access to a larger pool of participants than traditional approaches, they also introduce greater ambiguity in how to define cases and controls within this participant pool. Biobanks contain comprehensive electronic health records (EHRs) that include information from various domains—such as conditions, medications, and procedures⁴. However, data in the EHR is often incomplete and inaccurate. For example, clinicians may leave out information they assumed to be implicit, or there may be discrepancies between the clinical concept definition and the clinician’s intent⁵. Additionally, hospital billing practices can sometimes cause clinical concepts to be altered^5,6. Thus, in order to create disease cohorts using the information in the EHR, researchers often use carefully crafted rules with inclusion and exclusion criteria, created in collaboration with clinicians⁷. This approach is referred to as rule-based phenotyping⁷. Rule-based phenotyping has been used for cohort identification in biobanks for a wide variety of diseases^8,9. A common approach used in GWAS is to define cases by requiring that a condition is recorded with the relevant medical code (e.g., using the International Classification of Diseases [ICD] terminology) on at least two separate occasions (referred to as 2+ condition throughout this study). Another commonly used approach is Phecode phenotyping, which identifies cases and controls using specific sets of ICD codes¹⁰. However, the richness of EHR data allows for more complex phenotyping algorithms, which may better account for heterogeneity and missingness within the EHR and provide greater flexibility to correct for implicit biases encoded in a phenotyping algorithm^5,7,11. More complex rule-based phenotyping algorithms leverage multiple EHR domains, such as observations, measurements, or procedures, in addition to participant conditions. One source of existing multi-domain algorithms is the Observational Health Data Sciences and Informatics (OHDSI) Phenotype Library, which contains over 900 phenotype definitions, some of which were developed in collaboration with clinicians¹². Another source for multi-domain phenotyping algorithms is the UK Biobank’s algorithmically defined outcomes (ADO), which incorporate conditions in the EHR (ICD codes), recorded cause of death, and self-reported conditions from verbal interviews (referred to as self-report conditions)¹³. ADO was carefully curated by the UK Biobank Outcome Adjudication Group together with clinical experts, but it is only available for 15 diseases¹³.

Phenotyping algorithms with low positive predictive value in defining case and control cohorts will result in decreased power, diluted effect sizes, and substantial variability in heritability estimates for GWAS¹⁴. The impact of phenotype definition on GWAS has previously been shown for major depressive disorder, where GWAS hits (variants significantly associated with the disease) from minimal phenotyping definitions are shown to be less specific to the disease⁸. Previous work has also shown that Phecode phenotyping outperforms simpler ICD code phenotyping with respect to reproducing known genotype–phenotype associations and generating effect sizes of greater magnitude across 100 diseases¹⁵. It has also been shown that more complex phenotyping definitions incorporating multiple data domains can lead to greater accuracy in disease risk prediction using polygenic risk scores (PRS) in comparison with phenotyping definitions only using ICD codes or clinical symptoms for several diseases¹⁶. Additionally, it was found that GWAS on cohorts created using a combination of ICD codes and self-reported conditions have increased power while maintaining high genetic correlation with GWAS on cohorts created with either ICD codes or self-reported conditions alone¹⁷. However, a comprehensive comparison of GWAS results using different phenotyping methods is still lacking, including evaluations of replicability, accuracy of polygenic risk scores, genetic correlation, and more across a diverse range of complex diseases. Furthermore, no existing work has evaluated GWAS performed on cohorts created using OHDSI and ADO phenotyping algorithms in comparison to more traditional phenotyping approaches.

Here, we compare GWAS results on cohorts created using 2+ condition, Phecode, OHDSI, and ADO algorithms for Alzheimer’s disease, asthma, chronic obstructive pulmonary disease (COPD), myocardial infarction (MI), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and type II diabetes (T2D) using genetics and EHR data from 405,811 UK Biobank participants. We selected these diseases because they are well-researched, complex conditions that impact diverse biological functions and exhibit varying prevalence across populations. We compared power, heritability, number of GWAS hits overlapping with coding and functional genomic regions, number of GWAS hits colocalized with expression Quantitative Trait Loci (eQTL) in relevant tissues, replicability, and accuracy of derived polygenic risk scores. The goal of this work is to aid researchers in selecting the appropriate off-the-shelf phenotyping algorithm for use in a GWAS. Overall, we found that complex phenotyping rules (such as the ADO and some of the OHDSI algorithms) had more power and produced more functional GWAS hits, while showing similar replicability, heritability, and accuracy of derived polygenic risk scores to other, simpler phenotyping algorithms.

Results

High complexity EHR phenotyping rules result in increased GWAS power

We used EHR data from 405,811 UK Biobank participants in the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) format¹⁸ to create cohorts using four different rule-based phenotyping algorithms (Fig. 1). All rule-based phenotyping algorithms, except for OHDSI, apply the same set of domain rules across all diseases. For example, ADO algorithms use both condition codes and self-reported conditions, while Phecode and 2+ condition rely only on condition codes. In contrast, OHDSI algorithms vary in the number and types of domains they use to define each disease. For instance, Alzheimer’s disease is defined by five domains: condition occurrence, drug exposure, procedure occurrence, measurement (i.e., lab tests, vital signs, and findings from pathology reports) and observation (i.e., clinical facts obtained in context of examination and data that cannot be represented by another domain such as medical history and family history)¹⁹. Meanwhile, SLE is defined by only two domains: condition occurrence and drug exposure (see the “Methods” section for specific details on OHDSI’s rules for each disease studied).

Fig. 1 Workflow of this study and rule-based phenotyping algorithms. [Images not available. See PDF.]

a Blue arrows indicate the flow of analysis. Black box depicts the analysis performed for each of the seven diseases. Phenotype algorithm comparison is performed leveraging information from all seven diseases. b Explanation of case and control logic for the four different rule-based phenotyping algorithms. The purple icon indicates cases, while the black icon indicates controls.

To address this variability, we classify the phenotyping algorithms by their complexity level. Since self-reported conditions may reflect information from multiple domains (such as diagnosed conditions, drug exposure, and procedures), we classify ADO algorithms as “high complexity” for all diseases. Likewise, we label OHDSI algorithms for Alzheimer’s, asthma, and T2D as “high complexity”. We categorize OHDSI algorithms for COPD, myocardial infarction, rheumatoid arthritis, and SLE, as well as Phecode algorithms across all diseases, as “medium complexity”. Phecode algorithms are included as “medium complexity” as they contain curated condition sets for inclusion and exclusion and require condition occurrence on two or more distinct dates. Lastly, we categorize 2+ condition algorithms across all diseases as “low complexity” due to their simplicity. Note that an ADO algorithm is only available for Alzheimer’s disease, asthma, COPD, and MI, and there is no Phecode algorithm for COPD.

We created cohorts using all of the phenotyping algorithms present for each disease and performed sample quality control following standard GWAS sample quality control procedures (see the “Methods” section). We first compared the number of cases and controls in each cohort that remained after sample quality control. We found that the cohorts created with the high complexity algorithms had the highest number of cases and the highest number of unique cases (i.e., cases not found by any other phenotyping algorithm) (Fig. 2a, Supplementary Fig. 1). In the absence of high complexity algorithms (RA and SLE), the medium complexity OHDSI algorithm returned the greatest number of cases. The number of controls across all algorithms and diseases was similar (in the range of 343k to 405k). We next performed GWAS on each cohort (see the “Methods” section), including the covariates of age, sex, genotype array, and top 20 principal components of the genotype matrix. We first assessed each GWAS based on statistical power, the number of significant associations (hits), and liability-scale heritability. These metrics reflect, either directly or indirectly, the ability of the GWAS to detect an association if present. Power was calculated using the GAS power calculator²⁰ and liability-scale heritability was calculated using linkage disequilibrium (LD) score regression (LDSC), which regresses each GWAS chi-squared statistic on the LD score of the variant to estimate SNP-based heritability (h²_SNP) (see the “Methods” section)²¹. We found that cohorts created using high complexity algorithms generally resulted in GWAS with the greatest power (Fig. 2b). As the relative risk (the probability of having the disease with one copy of the risk allele compared to zero copies) increased, for the majority of diseases, the GWAS power for all phenotyping algorithms approached one. Since the diagnostic accuracy of a phenotyping algorithm alters the effective sample size (and thus power) of downstream GWAS, we used the phenotype evaluation tool PheValuator to estimate the positive predictive value (ppv) and negative predictive value (npv) of all algorithms in order to better understand the relative power of each algorithm (see the “Methods” section)^14,22. Effective sample size is altered by a dilution factor calculated using ppv and npv (ppv + npv−1)¹⁴. Due to small sample sizes and model convergence issues, we were able to evaluate only a select number of diseases with PheValuator. We found that algorithms generally had similar npv and ppv. This suggests that high complexity algorithms generally achieved the greatest dilution-adjusted effective sample sizes, and also means that the increase in the number of cases was not due to false positives in cohort creation (Supplementary Fig. 2c). This helps to explain the observation that high complexity algorithms yielded a number of hits generally greater than lower complexity algorithms (Fig. 2c, Supplementary Table 3). Notably, the phenotyping algorithms generating the highest number of hits also produced the greatest number of unique hits (i.e., significant variants not found by any other phenotyping algorithm) (Supplementary Figs. 3, 4). Additionally, effect sizes for shared hits were highly correlated between phenotyping algorithms for each disease, with all R² values above 0.95.

Fig. 2 GWAS power, number of hits, and heritability. [Images not available. See PDF.]

Red and orange asterisks indicate high and medium complexity phenotyping algorithms, respectively. a Log₁₀ case counts for each phenotyping algorithm. b GWAS power calculated using GAS power calculator. Overlapping lines for ADO and OHDSI algorithms for Alzheimer’s disease, Phecode and OHDSI algorithms for MI, and Phecode and 2+ condition algorithms for T2D. c Log₁₀ count of GWAS hits, using significance level 5 × 10⁻⁸. d LDSC liability-scale SNP heritability estimates. Error bars represent the standard error of the estimate.

Next, we assessed the heritability and did not find large differences in the liability-scale heritability estimates across phenotyping algorithms (Fig. 2d). The maximum range of liability-scale heritability estimates within each disease was 6%. These trends held when excluding variants in the major histocompatibility (MHC) region from the LDSC heritability calculation (Supplementary Fig. 5). Additionally, we found that the genetic correlation between all phenotyping algorithms within each disease was above 0.93 (Supplementary Fig. 6).

While it is important to have well-powered GWAS that return a large number of hits, it is equally important that the hits returned are relevant to the disease of interest and reflect the known biology of that disease. Therefore, we investigated whether each GWAS was able to replicate known associations for the disease of interest. We first performed an 80/20 stratified split of participants from each cohort to create discovery and replication GWAS, respectively (see the “Methods” section). We found that cohorts created with high complexity algorithms had the greatest number of replicated hits (Fig. 3a). We then looked at the size of overlap between GWAS hits and variants previously linked to each disease in ClinVar (see the “Methods” section)²³ and did not find any notable differences across phenotyping algorithms (Fig. 3b). The low overlap count is due, in part, to the low number of ClinVar variants for certain diseases (Supplementary Fig. 7a). However, the standardized MONDO IDs used to filter ClinVar were important in guaranteeing highly specific known ClinVar variants (see the “Methods” section). The pattern of overlap remained consistent when the set of variants linked to each disease in ClinVar was expanded to include variants in LD (Supplementary Fig. 7). Lastly, we looked at the replicability metrics calculated using the phenotype-genotype reference map (PGRM) package (Fig. 3c) (see the “Methods” section)²⁴. The actual over expected ratio (AER) compares the observed to expected number of replicated associations, with expectations based on summed power estimates. Overall replication rate is the proportion of known associations replicated in the GWAS, while the powered replication rate reflects this proportion for associations with at least 80% power. We found that for all diseases but RA, all phenotyping algorithms had similar replicability across all three metrics, regardless of their complexity level. Interestingly, for RA, which lacks a high complexity algorithm, the low complexity algorithm (2+ condition) greatly outperformed the medium complexity algorithm (OHDSI) across all three metrics.

Fig. 3 Replication of known associations. [Images not available. See PDF.]

Red and orange asterisks indicate high and medium complexity phenotyping algorithms, respectively. a Log₁₀ replicated hits for each phenotyping algorithm. b Number of GWAS hits overlapping with disease-specific ClinVar variants for each phenotyping algorithm. No ClinVar variants were found significant by any GWAS for MI and SLE. c Actual over expected ratio (AER), overall replication rate (RR), and powered replication rate (RR) calculated using the PGRM package.

Overall, we found that high complexity algorithms resulted in the greatest power and number of hits, while there were no notable differences in liability-scale heritability and replicability between algorithms.

High complexity EHR phenotyping rules result in an increased number of coding and functional GWAS hits

Functional annotation of GWAS hits is essential as it provides insights into the biological relevance of identified variants, helping to connect genetic associations with underlying molecular mechanisms. High complexity algorithms generally resulted in higher numbers of novel hits on the coding genome (i.e., exons) (Fig. 4a, Supplementary Table 3). Novel hits are defined as those not found to be associated with the disease in ClinVar (see the “Methods” section). Overall, we observed a trend where, in the absence of high-complexity algorithms, medium-complexity algorithms produced the highest number of novel GWAS hits on coding regions of the genome (Fig. 4a). We next investigated whether there was a difference in the number of novel hits overlapping with exons for genes most relevant to the disease of interest between phenotyping algorithms. For each disease, we selected the top 100 genes with the highest relevance score in GeneCards (see the “Methods” section)^{25, 26–27}. We observed that high complexity algorithms resulted in the greatest or equal number of novel hits overlapping with exons of these selected genes (Fig. 4b, Supplementary Table 4). We found that these trends remained consistent across varying numbers of top genes, from 10 to 1000 (Supplementary Fig. 8a). We also looked at the number of distinct genes per algorithm (i.e., genes not found by any other algorithm for a disease) that have novel GWAS hits overlapping with their exons and found that high complexity algorithms generally found the greatest number of distinct genes that had novel hits on their exons (Supplementary Table 5).

Fig. 4 Functional enrichment of GWAS hits. [Images not available. See PDF.]

Red and orange asterisks indicate high and medium complexity phenotyping algorithms, respectively. a Min–max scaled number of significant novel GWAS associations overlapping with genes. b Number of significant novel GWAS associations overlapping with exons of the top 100 genes for each disease. c Log₁₀ number of significant novel GWAS associations that are also eQTLs for the top 100 genes for each disease. Significant associations that are eQTLs for multiple genes are counted once for each gene they affect. d Min–max scaled number of causal GWAS variants colocalized with an eQTL. GTEx tissues displayed on the x-axis are those with the highest number of colocalized variants with high contribution to the credible set (Prob_in_pCausalSet > 0.4) across all phenotyping algorithms.

We next extended this analysis by looking at GWAS hits overlapping with the eQTLs by comparing the number of novel hits that were eQTLs of the top 100 relevant genes (Fig. 4c, Supplementary Table 4). We observed that, in general, high complexity algorithms resulted in the greatest or equal number of novel hits overlapping with eQTLs of the top 100 genes. These trends remained consistent across varying numbers of top genes, from 10 to 1000 (Supplementary Fig. 8b). We also found that high complexity algorithms found the greatest number of distinct genes that had novel hits on their eQTLs (Supplementary Table 6). To further investigate eQTL signal, we performed colocalization of GWAS and eQTL signals across all 48 tissues in GTEx using eCAVIAR, which estimates the colocalization posterior probability, or the posterior probability that a SNP is causal in both the GWAS and eQTL study (see the “Methods” section)^28,29. We found that high complexity algorithms generally resulted in the greatest number of colocalized variants across relevant tissues (Fig. 4d, Supplementary Table 7). This might indicate that high complexity algorithms enhance the biological relevance and interpretability of GWAS results by more effectively identifying genes and tissues linked to disease mechanisms²⁸. Tissues that do not appear directly relevant to the disease of interest (e.g. artery aorta for Alzheimer’s disease) are likely due to expression of a relevant gene (e.g. PVLR2) in a non-relevant tissue or in a tissue relevant to a disease that has an association with the disease of interest (e.g., metabolic diseases vs. Alzheimer’s)^30,31 (Supplementary Table 7). Furthermore, tissues that do not appear directly relevant to the disease may have increased sample sizes compared to more disease-relevant tissues, resulting in stronger eQTL signals and greater power (e.g. artery aorta tissue has a greater sample size than many brain-related tissues in GTEx).

We also examined whether any GWAS hits were significantly enriched for certain gene sets with gene-set enrichment analysis with MAGMA (see the “Methods” section)³², which is performed by aggregating gene-level association results to determine the biological relevance. We found that there were no significant comparative conclusions across algorithms, and only a subset of them had GWAS that returned any significant gene-sets (Supplementary Table 8). Lastly, we investigated the proximity of GWAS hits to functional elements, including eQTLs, candidate cis-regulatory elements (cCREs), and transcription start sites (TSS), as proximity to these elements may indicate that GWAS hits are located on regions involved in gene regulation (see the “Methods” section). There were generally no significant differences in mean distance to cCREs, TSS, or eQTLs across algorithms (Supplementary Fig. 9).

Overall, we found that high complexity algorithms had greater overlap with coding and functional regions of the genome, a greater number of relevant genes not discovered by other phenotyping algorithms (either through overlap with exons or eQTLs), and a greater number of variants colocalized with eQTL studies across relevant tissues, indicating that high complexity algorithms provide greater insight into disease etiology than low complexity algorithms.

Differences in EHR phenotyping algorithms have no effect on Polygenic Risk Score accuracy

The accuracy of polygenic risk scores (PRS) is crucial for their effectiveness in predicting an individual’s genetic susceptibility to diseases, enabling better risk stratification and potentially informing personalized prevention and treatment strategies. PRS accuracy depends on several factors, including the quality and size of the initial GWAS used to identify risk variants and the similarity between the discovery and target cohorts. Notably, the disease cohort used to construct the PRS plays a critical role in its predictive power. Therefore, we next evaluated how informative each GWAS was with respect to disease prediction, as the cohorts built by different algorithms vary in composition. We trained and tested a logistic regression model for disease prediction using the PRS (generated by clumping and p-value thresholding), age, sex, genotype array, and 20 PCs as covariates. Disease status for training and testing was determined by the phenotyping algorithm used to generate the GWAS results. We found there were no large differences across phenotyping algorithms in terms of PRS AUROC (Fig. 5), which held across all p-value thresholds used to generate the PRS (Supplementary Fig. 10a). AUROC values within a disease group only differed by at most 5% across all models, indicating that predictive performance does not vary substantially between PRS models generated by different phenotype algorithms. Additionally, we looked at the PRS distributions and found that while cases typically had a slightly higher mean polygenic risk score than controls, there was no obvious trend in case/control separation between phenotyping algorithms (Supplementary Fig. 10b).

Fig. 5 Disease risk prediction. [Images not available. See PDF.]

Red and orange asterisks indicate high and medium complexity phenotyping algorithms, respectively. AUROC for predicting the disease status, using a logistic regression model including PRS (from p-value threshold 0.05), age, sex, genotype array, and principal components as covariates. AUROC averaged across folds of 5-fold CV. Error bars represent the standard error across folds. Disease status for training and testing is determined by each phenotyping algorithm.

Cohort composition across different EHR phenotyping algorithms

Given that different phenotyping algorithms lead to different GWAS results, particularly with respect to functional annotations and GWAS-eQTL colocalization, we wanted to understand how cohort composition varied between different algorithms, as these differences may reflect selection for distinct disease mechanisms. For this analysis, we focused on cases. We first compared the most common index events (events that trigger the entry of a person into a cohort) between phenotyping algorithms (Supplementary Table 9). We found that in ADO algorithms that specified self-report (Asthma, COPD, and MI), a large percentage of cases (over 35%) satisfied this index event. High complexity OHDSI algorithms (Alzheimer’s, Asthma, COPD, MI, and T2D) often had concepts from a combination of domains (i.e., condition, drug exposure, and observation) in the top index events. Together, these findings suggest that high-complexity algorithms rely on a more diverse set of data domains to identify cases for cohort entry. Algorithms for the same disease often shared similar condition concepts among their top index events. This overlap likely explains why low-complexity algorithms rarely identified cases that were not also captured by their high-complexity counterparts. We next compared demographics between phenotyping algorithms. We found that all algorithms had similar female:male ratios (Supplementary Fig. 2a). We compared the mean age of cases in the cohorts using pairwise t-tests (Supplementary Fig. 2b). We found that cohorts built by high complexity algorithms were significantly younger than all other cohorts for T2D, COPD, and asthma (Bonferroni corrected significance level of 0.001). We then assessed the algorithmic fairness of each phenotyping algorithm for a given disease across case subgroups stratified by sex by applying metrics commonly used in algorithmic fairness and epidemiology fields (see the “Methods” section)¹¹ (Supplementary Fig. 11). We found significant differences in equality of predicted prevalence between male and female subgroups across all algorithms for all diseases except Alzheimer’s disease. Interestingly, the proportion of high complexity algorithms to achieve equality of sensitivity was greater than medium or low complexity algorithms, while the proportion of high complexity algorithms to achieve equality of precision was lower than medium or low complexity algorithms. Thus, which class of algorithms (high, medium, or low complexity) provides the most fair definition with respect to subgroups stratified by sex depends on which metric a researcher wants to prioritize (sensitivity vs. precision).

Discussion

In this study, we assessed the effects of four different rule-based phenotyping algorithms, with varying levels of complexity, on GWAS results. We compared GWAS results using the metrics of power, number of hits, heritability, replicability, number of hits overlapping with coding and functional regions of the genome, number of hits colocalized with eQTL, and PRS prediction accuracy. Our findings show that conducting GWAS using high complexity algorithms that incorporate multiple data domains may be more beneficial as opposed to using traditional phenotyping approaches.

Although the choice of phenotyping algorithm does not fully explain the missing heritability in GWAS, it may contribute to the limited overlap observed between GWAS and eQTL findings. This issue of limited colocalization has been well-documented; one analysis showed that fewer than half of the GWAS hits from 87 diverse studies colocalized with eQTLs^{33, 34–35}. Interestingly, we found that high-complexity algorithms identified a greater number of variants colocalized with eQTLs—that is, variants with potential causal roles in both GWAS and eQTL studies—compared to lower-complexity algorithms. A recent study suggested that this limited overlap could be due to systematic differences in the types of variants identified by GWAS and eQTL studies. Notably, this study used GWAS summary statistics from phenotypes defined by ICD codes^36,37. Therefore, we believe that the accuracy of cohort definitions might impact the ability to detect variants involved in disease-relevant molecular pathways. However, when interpreting findings from colocalization analyses to discover relevant molecular pathways, it is important to determine whether a variant colocalizes in a certain tissue because it regulates a disease-relevant gene in a relevant tissue, or merely regulates the expression of that gene in a non-relevant tissue. Tissues not directly relevant to the disease of interest may have increased power in colocalization analyses compared with disease-relevant tissues due to technical factors such as larger sample sizes.

Notably, the accuracy of a phenotyping algorithm partly depends on the specific database queried. This is because there is variability in how clinical information is recorded across different EHR databases (referred to as “coding heterogeneity”)³⁸. Furthermore, the accuracy of one “type” of phenotyping algorithm is dependent on which diseases are evaluated, as algorithms have been shown to have varying performance relative to one another across diseases²². Therefore, the performance of each phenotyping algorithm in relation to downstream GWAS results in this study might be more informative by considering the context of the database used (UK Biobank) and the diseases examined. Nonetheless, we believe this analysis provides valuable insights for researchers, as it indicates that integrating multiple data domains with clinical expert input can enhance both the power and interpretability of GWAS findings.

Another important consideration is how these phenotyping algorithms were originally defined. OHDSI algorithms were designed to operate within the OMOP CDM, whereas Phecode and ADO algorithms were originally based on ICD codes, which are considered non-standard within the OMOP CDM framework. Consequently, we converted ICD codes to SNOMED codes to enable their use within the OMOP CDM. This conversion, however, led to some ICD codes remaining unmapped where no SNOMED equivalent existed, potentially resulting in information loss due to limited semantic interoperability between the two coding systems³⁹. Although maintaining a consistent data source for all phenotyping algorithms was essential for this study, we note that potential mismatches between ICD and SNOMED codes may have introduced bias against algorithms originally defined by ICD codes (ADO and Phecode).

Although we evaluated GWAS replicability and observed consistent results across phenotyping algorithms, replicability serves only as an indirect measure of GWAS sensitivity. Specifically, it remains uncertain how well replicability reflects the specificity of associations to the targeted disease, given that known associations in the PGRM package may be affected by some degree of phenotype misclassification. Consequently, further research is warranted to assess the disease specificity of GWAS findings, ideally grounded in the biological context of each disease. Additionally, our results suggest that curated repositories of complex, high-quality phenotyping algorithms will be instrumental in advancing the understanding of disease etiology. Thus, future efforts should prioritize the development of these repositories, guided by clinical expertise and established guidelines.

Methods

Rule-based phenotyping algorithms

Rule-based phenotyping was performed on 502,365 participants with clinical data in the UKBB Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) v5.3¹⁸^,⁴⁰ format. The UKBB has approval from the North West Multi-centre Research Ethics Committee (MREC) as a Research Tissue Bank (RTB) approval (21/NW/0157). This research has been conducted with the UKBB Resource under application number 100316. All participants in the UKBB provided informed consent. The OMOP CDM is an EHR data standard made up of tables linked by primary and foreign keys⁴¹. For example, the person table holds information on each participant and their demographics, while the condition occurrence table holds information on all recorded conditions for each participant⁴¹. Standardized clinical data tables include condition occurrence, drug exposure, procedure occurrence, measurement, observation, and visit occurrence⁴¹. Each record in these tables is coded with a standardized OMOP concept id, which is a part of a larger concept hierarchy⁴. For example, the concept for diabetes mellitus (id: 201820) is a descendant of the concept for disorder of the endocrine system (id: 31821). Additionally, OMOP concepts can be mapped back to source vocabularies such as SNOMED, ICD-9, and ICD-10 using the concept and concept relationship tables⁴.

2+ condition algorithm

We defined cases by participants with two or more occurrences of a condition concept or its descendants in their condition occurrence table. Controls were defined as those participants having 0 occurrences of the condition concept or its descendants in their condition occurrence table. We used OMOP condition concept id 378419 to represent Alzheimer’s disease, 317009 for asthma, 255573 for COPD, 4329847 for MI, 201826 for T2D, 80809 for RA, and 257628 for SLE.

Phecode phenotyping

We used Phecode map 1.2 with ICD-9 codes^10,42. Phecode 290.11 was used for Alzheimer’s disease, 495 for asthma, 411.2 for MI, 250.2 for T2D, 714.1 for RA, and 695.42 for SLE. There is no Phecode definition for COPD. To perform Phecode phenotyping in the OMOP CDM, we mapped all ICD-9 codes to OMOP concept ids using the OMOP concept and concept relationship tables⁴. A majority of the ICD-9 codes mapped to condition concept ids, though four ICD-9 codes defining control cohorts mapped to observation concept ids. As per the Phecode algorithm, we defined cases as those participants with an OMOP concept id within the inclusion set on two distinct condition start dates¹⁰. We define controls as those participants without any OMOP concept ids in the exclusion set in their EHR¹⁰.

OHDSI phenotyping

We selected algorithms from the OHDSI PL v3.1.6 that used at least one data domain in addition to condition occurrence⁴³. We implemented phenotype algorithm 255 for Alzheimer’s disease, which uses the condition occurrence, drug exposure, procedure occurrence, measurement, observation, and visit occurrence tables of the OMOP CDM to define the disease. We implemented phenotype 27 for asthma, which uses the drug exposure, condition occurrence, and observation tables. We used phenotype 28 for COPD, which uses the drug exposure and condition occurrence tables. We used phenotyping algorithm 71 for MI, which uses the condition occurrence and visit occurrence tables. We used phenotyping algorithm 288 for T2D, which uses the condition occurrence, drug exposure, and measurement tables. We used phenotyping algorithm 196 for RA, which uses the condition occurrence and observation tables. We used phenotyping algorithm 119 for SLE, which uses the condition occurrence and drug exposure tables.

UKBB ADO phenotyping

We constructed ADO v2.0 cohorts for Alzheimer’s, Asthma, COPD, and MI from data in the OMOP CDM under the UKBB specified guidelines¹³. Specifically, we first converted ICD-9 and ICD-10 codes defining each into OMOP concept IDs using the concept and concept_relationship tables. As per ADO guidelines, we defined cases as those participants with either (1) an OMOP concept id in the condition_occurrence table, (2) an OMOP concept id listed as a cause of death in the death table, or (3) self-report of the disease per the baseline assessment¹³. Self-report of diseases were coded in the value fields of the observation table under observation_concept_id 4214956¹⁸.

Cohort evaluation

We generated information on the most common index events (events that trigger entry of a person into a cohort) using the index event breakdown feature of OHDSI CohortDiagnostics v3.4.1^44,45. To obtain the count of participants satisfying the self-report index event specified by the ADO algorithms, we manually queried the OMOP CDM database, as the current version of CohortDiagnostics does not support retrieving counts for self-report criteria.

We defined age for each participant as 2012 minus the year of birth recorded in the person table, since 2012 was the year of the first UKBB baseline data release⁴⁶. We performed pairwise two-sided t-tests of mean age between all cohorts for a disease. A Bonferroni-corrected significance level of 0.001 was used for each test, which was calculated from a family-wise error rate of 0.05 and N = 30 (tests performed for all possible phenotyping pairs across the seven diseases). Participant sex at birth was obtained from the person table, where OMOP gender concept id 8532 represents female sex at birth and 8507 represents male sex at birth. We assessed the algorithmic fairness of each phenotyping algorithm across case subgroups stratified by sex for the metrics of equality of predicted prevalence (for the case label), equality of sensitivity, and equality of precision (for the case label)¹¹. Equality of predicted prevalence for the case label determines if P(Ŷ = 1|Sex = Male) = P(Ŷ = 1|Sex = Female). Equality of sensitivity determines if P(Ŷ = 1 | Y = 1,Sex = Male) = P(Ŷ = 1 | Y = 1,Sex = Female). Equality of precision for the case label determines if P(Y = 1 | Ŷ = 1,Sex = Male) = P(Y = 1 | Ŷ = 1,Sex = Female). As in Sun et al. we create a silver standard ground truth label by majority vote for all phenotypes within a disease (i.e., over half of the phenotypes classify an individual as a case). To test for a significant difference in the probability of interest between male and female subgroups, we used a two-proportion Z-test and a significance level of 0.05 for each test.

We performed phenotype algorithm evaluation using OHDSI PheValuator. PheValuator constructs a diagnostic predictive model from the OMOP CDM, leveraging highly sensitive and highly specific cohorts for a disease of interest²². These cohort definitions can be found in the supplementary material. Predictions from the diagnostic model act as a probabilistic gold standard and can subsequently be used to evaluate the performance of existing phenotyping algorithms²². PheValuator requires an adequate case size for model training and thus is unable to evaluate phenotyping algorithms for diseases with low prevalence. Therefore, of the seven diseases studied, PheValuator was unable to assess algorithms for Alzheimer’s, RA, and SLE. Additionally, the PheValuator predictive model was unable to converge for T2D. Note that PheValuator treats all participants who are not cases as controls, and thus may be slightly inaccurate in its evaluation of Phecode and 2+ condition algorithms, which have a third subset of participants that do not serve as cases or controls.

Genotype quality control

93M autosomal imputed variants released by the UKBB with genotype probability above 0.9 were hard-called from dosages (hard-call-threshold 0.1) using PLINKv2^{47, 48–49}. Quality control included filtering for a genotype call rate of 95% or greater, minor allele frequency ≥5%, and Hardy–Weinberg equilibrium p-value ≤ 1×10^-6, as well as removing all variants with duplicate rsids (reference SNP cluster ids), resulting in 4,053,867 remaining variants. We removed all insertions and deletions (INDELS) and further filtered variants for an imputation INFO score above 0.3, resulting in 3,686,405 final bi-allelic SNPs after quality control.

Sample quality control

We next performed quality control on the 487,159 samples in the imputed variants dataset. We calculated the sample missingness rate (--mind) for remaining samples on a subset of 605,836 directly assayed variants previously selected by the UKBB for missingness calculations². These variants can be found in Resource 1955 with tag ‘in_HetMiss’⁵⁰. No samples had a missingness rate above 5%.

We then performed kinship calculations with KING v2.3.2 on the 487,159 samples and a set of 93,511 directly assayed variants previously selected by the UKBB for relatedness calculations^2,51. These variants can be found in Resource 1955 with tag ‘in_Relatedness’⁵⁰. KING was used to determine a group of individuals that contained no pairs of 3rd degree relatives or closer. This resulted in 405,811 samples, which will be referred to as unrelated samples, used for downstream principal component and GWAS analysis.

Population stratification

To control for population stratification, we computed principal components on the 405,811 unrelated samples remaining after sample quality control. We used flashpca v2.0 to calculate 20 principal components from 147,604 directly measured SNPs previously used by UKBB for PCA calculation (in_PCA in Resource 1955)¹⁸. Specifically, this set of SNPs excluded SNPs with missingness above 0.015, minor allele frequency less than 1%, and in regions of long-range LD and was further LD-pruned using an r² threshold of 0.1, with a window size of 1000 and step size of 80 markers¹⁸.

Genome-wide association studies

We included age and sex as obtained in the cohort evaluation, and the 20 principal components as covariates in the genome-wide associations. Additionally, we determined the genotype array (UK BiLEVE Axiom array vs. UK Biobank Axiom array) by calculating missingness at variants unique to the UK BiLEVE Axiom array and included the array as a covariate. We performed genome-wide associations in two ways. First, we used PLINKv2.0 logistic regression (--glm) with all covariates linearly transformed to mean 0 and variance 1 (--covar-variance-standardize). Second, due to the large case-control imbalance in some phenotypes, we used SAIGE v1.4.3, which incorporates a saddlepoint approximation to calibrate test statistics⁵². A sparse genetic relationship matrix was derived from the 93,511 directly assayed variants selected by the UKBB for relatedness calculations⁵⁰. A null logistic mixed model was fit using all covariates, and subsequently variant-level association tests were performed using SAIGE with Firth correction applied for variants with p < 0.01. GWAS hits for both PLINK and SAIGE were defined as variants with p-value less than the significance level of 5 × 10⁻⁸.

For all diseases other than SLE, PLINK and SAIGE had a large overlap in hits (Supplementary Table 1). Therefore, for all diseases other than SLE, we used PLINK results for SNP–phenotype associations with no error code in subsequent downstream analyses. For SLE, we used SAIGE results for SNP-phenotype associations in subsequent downstream analyses.

GWAS evaluation—power

We estimated the power of each GWAS using the GAS Power Calculator under an additive model with significance level 5 × 10⁻⁸ and disease allele frequency of 0.2²⁰. Case and control counts after sample quality control were used for the power estimation. We set population prevalence to 0.07 for Alzheimer’s disease, 0.12 for asthma, 0.02 for COPD, 0.02 for MI, 0.01 for RA, 0.001 for SLE, and 0.09 for T2D^{53, 54, 55, 56, 57, 58–59}. We varied the relative risk from 1 to 1.5 in increments of 0.01 to thoroughly capture differences in power among phenotyping algorithms.

GWAS evaluation—SNP heritability and genetic correlation

We estimated h²_SNP from GWAS summary statistics using LDSC²¹. First, we used LDSC to calculate LD scores from imputed SNPs after quality control using a 1 centiMorgan (cM) window. The cM coordinates were filled in from 1000 Genomes haplotypes Phase 3 integrated variant set release⁶⁰. We then used LDSC to estimate liability-scale h²_SNP by setting the population prevalence of disease equal to the sample prevalence of each cohort. We also used LDSC to generate estimates of genetic correlation between GWAS^21,61. Specifically, we estimated the genetic correlation between GWAS from different phenotyping algorithms for the same disease. Since different scales do not matter in the context of genetic correlation, we did not perform liability-scale conversion prior to calculating genetic correlation.

GWAS evaluation—known associations

We performed an 80/20 stratified split for each cohort to generate discovery and replication GWAS. This split corresponds to the split used in the fold one of the PRS models. We generated GWAS results using SAIGE for SLE and PLINK for all other diseases. Variants that were significant in the discovery GWAS were tested in the replication GWAS on the remaining 20% of participants. A variant was considered replicated if it had a p-value < 0.05 and a concordant direction of effect.

We collected variants previously known to be associated with each of the seven diseases from ClinVar with the last modified date 2024-06-18²³. To identify each disease in the ClinVar dataset, we manually mapped each disease to a set of MONDO IDs (Supplementary Table 2)⁶². Each MONDO ID represented either the disease or susceptibility to the disease. For each disease, we filtered for any of its corresponding MONDO IDs or their descendants in the PhenotypeIDS column. These MONDO IDs are important for strictly defining the disease of interest, such that any known associations are highly specific. For example, MONDO:0008834, which is common when the string ‘Asthma’ is queried for in the ClinVar database, corresponds to “asthma, nasal polyps, and aspirin intolerance”. Any known associations for MONDO:0008834 would therefore not directly represent a known association for asthma. We converted all ClinVar variants recorded from assembly GRCh38 to GRCh37 for comparison with UKBB SNPs via the Ensembl REST API⁶³. When matching ClinVar and UKBB variants, we allowed for the possibility that recorded reference and alternate alleles were flipped. To expand the set of ClinVar variants to include those in LD, we used the publicly available LD matrices of UK Biobank participants with British ancestry⁶⁴. We queried using the tool ldmat for variants with an r² > 0.8 within base-pair windows of ±10,000, ±50,000, and ±100,000⁶⁵.

We also assessed the ability to replicate known associations using the phenotype–genotype reference map (PGRM)²⁴. Using the PGRM package, UKBB variants, aligned to the GRCh37 reference allele, were annotated with known associations in cohorts of European genetic ancestry. As per the PGRM package, associations were considered to be replicated in the UKBB at p < 0.05, and they were considered powered at power $\geq$ 80%²⁴. For each GWAS, we calculated the overall replication rate, powered replication rate, and actual over expected ratio (AER), which is the number of replicated associations divided by the sum of power for each tested association²⁴.

GWAS evaluation—variant annotation

We annotated novel hits with gene-based, eQTL, cCRE, and TSS annotations. Novel hits were those not found to be associated with the disease in ClinVar. We produced gene-based annotation of significant variants using ANNOVAR with build version hg19⁶⁶. These annotations provided the gene name for exonic/intronic/3’ UTR/5’UTR/ncRNA variants, as well as gene names of the two neighboring genes for intergenic variants. ANNOVAR ncRNA exonic and ncRNA intronic annotations were combined with exonic and intronic annotations, respectively. For visualization purposes, we calculate the min–max scaled number of annotations, and set the min–max scaled number to 0.5 in the case that the number of annotations was the same across all phenotyping algorithms. We obtained data for eQTL annotations by combining significant variant-gene associations across all tissues from GTEx single-tissue eQTL data V8²⁹. eQTL positions were mapped to the GRCh37 build using the lookup table provided by GTEx²⁹. When matching eQTLs to UKBB variants, we allowed for the possibility that reference and alternate alleles were flipped. To subset exon or eQTL hits to those linked to genes relevant to a disease of interest, we used GeneCards^25,26. GeneCards determines the relevance score for each gene with respect to the queried disease using Elasticsearch 7.11, which weights search results by term frequency, inverse document frequency, and size of the subfield within the document containing the term, while also boosting certain subfields and considering direct versus indirect links between a disease and a gene^{25, 26–27}. The following string searches were used for each disease: Asthma, “Chronic Obstructive Pulmonary Disease”, “Myocardial Infarction”, “Alzheimer’s Disease”, “Type 2 Diabetes”, “Rheumatoid Arthritis”, “Systemic Lupus Erythematosus”. We obtained cCRE annotations from the ENCODE portal annotation file set ENCSR800VNX⁶⁷. We converted cCRE annotations from GRCh38 to GRCh37 using the UCSC LiftOver tool, which successfully converted 2,344,525 records and failed to convert 4329 records⁶⁸. Lastly, we collected TSS coordinates from refTSS4⁶⁹.

To investigate how close GWAS hits were to functional regions of the genome, we calculated distance to the nearest cCRE or TSS region as the smallest distance to the start or end of the region. In the case where the variant was within the cCRE or TSS, we set the distance to zero. We additionally calculated the distance to the nearest eQTL. We did not consider strandedness in any distance calculation. We compared the mean distance of the significant GWAS hits to cCREs, TSSs, and eQTLs for different phenotyping definitions of the same disease using pairwise t-tests. We only calculated the mean distance to the functional elements for significant SNPs not within the functional element (non-zero distance). A Bonferroni-corrected significance level of 0.001 was used for all pairwise tests.

GWAS evaluation—GWAS–eQTL colocalization

We performed colocalization of GWAS and eQTL studies using eCAVIAR, which computes the posterior probability that a variant in a region is causal in both studies²⁸. eCAVIAR is able to predict multiple causal variants in a region and requires GWAS and eQTL summary statistics as well as LD between all variants in a region. We downloaded data on all SNP-gene associations (cis-eQTLs) from GTEx v7 eQTL studies for all 48 tissues²⁹. GTEx v7 was used instead of v8 as data on all SNP-gene associations from v7 are readily available for direct download on the GTEx website, while v8 is only available on a requester pays bucket on the Google Cloud Platform (GCP) and is considerably large (690 GB), making it less accessible for standard research workflows. LD data were obtained from publicly available LD matrices of UK Biobank participants with British ancestry and queried using the tool ldmat^64,65. Missing values within any LD matrix were filled with zero. As LD matrices were not released for regions of long-range LD, we avoid performing colocalization analysis in these regions⁶⁴. Specifically, we do not include variants in chromosome 6 from position 26000001 to 35000001 (Major Histocompatibility complex region), chromosome 8 from 8000001 to 12000001, and chromosome 11 from 47000001 to 57000001⁷⁰. For each GWAS hit, we then selected 50 variants upstream and downstream to create a region of interest. In the case that there were not 50 variants upstream and downstream of the hit, we did not perform colocalization analysis. GWAS and eQTL colocalization was then performed on the region for all egenes (genes with qval ≤ 0.05). We used the default setting of a maximum of two causal SNPs in a region. To identify variants that were causal in both the eQTL and GWAS studies, we used a colocalization cutoff threshold of 0.4, as previous work has shown that this threshold leads to high accuracy and precision with minimal loss in recall rate²⁸. For visualization purposes, we calculate the min-max scaled number of colocalized variants, and set min–max scaled number to 0.5 in the case that the number of colocalized variants is the same across all phenotyping algorithms. We chose the top GTEx tissues to display as those with the highest number of colocalized variants with high contribution to the credible set (Prob_in_pCausalSet > 0.4) across all phenotyping algorithms.

Gene-set enrichment analysis

We performed gene-set enrichment analysis using MAGMA v1.10³². First, imputed SNPs after quality control were annotated with gene locations from NCBI with build 37^71,72. We then performed gene-based analysis from GWAS summary statistics, where each gene test statistic was calculated as the mean $χ^{2}$ statistic for all SNPs in the gene. In the case of synonymous rsids, only the first listed synonym in the synonym file provided by MAGMA was retained (synonym-dup = drop-dup). The imputed SNPs after quality control were provided as an LD reference panel for the gene-based analysis. We performed competitive gene-set analysis using 50 human MSigDB hallmark gene sets, which encompass known biological processes⁷³. Competitive gene-set analysis determines whether genes in a gene set are more strongly associated with the phenotype than genes outside of the gene set³². We used a Bonferroni-corrected significance level of 0.001 to determine the significance of each gene-set. Internal variables of gene size, gene density, sample size, and inverse of the mean minor allele count were included in gene-set analysis per the default model.

Disease-risk prediction

To assess how well the GWAS from each phenotyping algorithm was able to predict disease risk, we trained and evaluated a PRS model. The polygenic risk score was generated using the traditional approach of clumping and p-value thresholding, with PLINK v2.0^49,50,74. Clumping was performed with a linkage disequilibrium r² parameter 0.1 and a clumping window 250 kb. P-value thresholding was performed at each of the following thresholds: 5 × 10⁻⁸, 1 × 10⁻⁵, 1 × 10⁻³, 0.01, 0.05, 0.1, 0.2, 0.3⁷⁴. The PRS prediction model was a logistic regression model with polygenic risk score, age, sex, genotype array, and 20 principal components as covariates. To train the model, we performed 5-fold cross-validation, where within each fold, we used 80% of participants to generate GWAS results and train the logistic regression model, and 20% to evaluate model performance. We generated GWAS results using SAIGE for SLE and PLINK for all other diseases. A stratified split was used to maintain case prevalence in each fold. We evaluated predictive performance based on the prediction of case/control status as defined by the phenotyping algorithm used to generate the GWAS.

Acknowledgements

This research has been conducted using the UK Biobank Resource under application number 100316. This work uses data provided by patients and collected by the NHS as part of their care and support. This study was funded by a Roy and Diana Vagelos Precision Medicine Award, a Warren Alpert Foundation award, and an NIH grant R35GM147004 to G.G. and an NIH grant T15LM007079 to A.N.

Author contributions

A.N., A.E., and G.G. initiated and designed the study. A.N. performed phenotyping, GWAS, and GWAS evaluation and analysis. G.G. supervised the research. A.N. and G.G. wrote the manuscript. All authors approved the final version of the manuscript.

Data availability

Genotype and phenotype data used in this study are from the UK Biobank resource obtained under application number 100316.

Code availability

Code detailing phenotyping on the UK Biobank OMOP CDM can be found here: https://github.com/G2Lab/UKBBPhenotyping, while code on GWAS and evaluation can be found here: https://github.com/G2Lab/phenotyping-gwas-eval.

Competing interests

The authors declare no competing interests.

Supplementary information

The online version contains supplementary material available at https://doi.org/10.1038/s41746-025-01815-8.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Ashley, EA. Towards precision medicine. Nat. Rev. Genet.; 2016; 17, pp. 507-522.1:CAS:528:DC%2BC28XhtlChs77M

2. Bycroft, C et al. The UK Biobank resource with deep phenotyping and genomic data. Nature; 2018; 562, pp. 203-209.1:CAS:528:DC%2BC1cXhvV2qsbvI

3. All of Us Research Program Investigatorset al. The ‘All of Us’ Research Program. N. Engl. J. Med.; 2019; 381, pp. 668-676.

4. Reich, C. & Ostropolets, A. Standardized vocabularies. In The Book of OHDSI (eds Observational Health Data Sciences and Informatics) Ch. 5 (Observational Health Data Sciences and Informatics, 2021).

5. Hripcsak, G; Albers, DJ. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc.; 2013; 20, pp. 117-121.

6. Hersh, WR et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med. Care; 2013; 51, pp. S30-S37.

7. Banda, JM; Seneviratne, M; Hernandez-Boussard, T; Shah, NH. Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annu. Rev. Biomed. Data Sci.; 2018; 1, pp. 53-68.

8. Cai, N et al. Minimal phenotyping yields genome-wide association signals of low specificity for major depression. Nat. Genet.; 2020; 52, pp. 437-447.1:CAS:528:DC%2BB3cXlvFymtbk%3D

9. Patel, RS et al. Reproducible disease phenotyping at scale: Example of coronary artery disease in UK Biobank. PLoS ONE; 2022; 17, e0264828.1:CAS:528:DC%2BB38XhtVegu7jF

10. Bastarache, L. Using phecodes for research with the electronic health record: from PheWAS to PheRS. Annu. Rev. Biomed. Data Sci.; 2021; 4, pp. 1-19.

11. Sun, TY; Bhave, SA; Altosaar, J; Elhadad, N. Assessing phenotype definitions for algorithmic fairness. AMIA Annu. Symp. Proc.; 2022; 2022, pp. 1032-1041.

12. Weaver, J. et al. Best practices for creating the standardized content of an entry in the OHDSI Phenotype Library. https://www.ohdsi.org/wp-content/uploads/2019/09/james-weaver_a_book_in_the_phenotype_library_2019symposium.pdf.

13. UK Biobank Outcome Adjudication Group. UK Biobank Algorithmically Defined Outcomes (ADOs). https://biobank.ndph.ox.ac.uk/showcase/showcase/docs/alg_outcome_main.pdf (2022).

14. Burstein, D. et al. Detecting and adjusting for hidden biases due to phenotype misclassification in genome-wide association studies. Preprint at medRxivhttps://doi.org/10.1101/2023.01.17.23284670 (2023).

15. Wei, W-Q et al. Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS ONE; 2017; 12, e0175508.

16. Song, W; Huang, H; Zhang, C-Z; Bates, DW; Wright, A. Using whole genome scores to compare three clinical phenotyping methods in complex diseases. Sci. Rep.; 2018; 8, 11360.

17. DeBoever, C et al. Assessing digital phenotyping to enhance genetic studies of human diseases. Am. J. Hum. Genet.; 2020; 106, pp. 611-622.1:CAS:528:DC%2BB3cXntVKkurs%3D

18. Papez, V et al. Transforming and evaluating the UK Biobank to the OMOP Common Data Model for COVID-19 research and beyond. J. Am. Med. Inform. Assoc.; 2022; 30, pp. 103-111.

19. Belenkaya, R., Reich, C., Ryan, P. & OHDSI CDM Workgroup Team. OMOP Common Data Model V5.0.1. https://www.ohdsi.org/web/wiki/doku.php?id=documentation:cdm:common_data_model (2016).

20. Johnson, J. L. Genetic Association Study (GAS) Power Calculator. (2017).

21. Bulik-Sullivan, BK et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet.; 2015; 47, pp. 291-295.1:CAS:528:DC%2BC2MXhvFCntb8%3D

22. Swerdel, JN; Hripcsak, G; Ryan, PB. PheValuator: Development and evaluation of a phenotype algorithm evaluator. J. Biomed. Inform.; 2019; 97, 103258.

23. NIH National Library of Medicine. ClinVar. https://www.ncbi.nlm.nih.gov/clinvar/ (2025).

24. Bastarache, L. et al. The phenotype–genotype reference map: Improving biobank data science through replication. Am. J. Hum. Genet. 110, 1522–1533 (2023).

25. Weizmann Institute of Science. GeneCards - Human Genes. https://www.genecards.org/ (2025).

26. Stelzer, G et al. The GeneCards suite: from gene data mining to disease genome sequence analyses. Curr. Protoc. Bioinform.; 2016; 54, pp. 30.1-1.30.33.

27. Elastic. What is Search Relevance? https://www.elastic.co/what-is/search-relevance (2025).

28. Hormozdiari, F et al. Colocalization of GWAS and eQTL signals detects target genes. Am. J. Hum. Genet.; 2016; 99, pp. 1245-1260.1:CAS:528:DC%2BC28XhvFSitLfJ

29. Broad Institute. GTEx Portal. https://www.gtexportal.org/home (2025).

30. Liang, X et al. Association and interaction of TOMM40 and PVRL2 with plasma amyloid-β and Alzheimer’s disease among Chinese older adults: a population-based study. Neurobiol. Aging; 2022; 113, pp. 143-151.1:CAS:528:DC%2BB38XhtFaqsLjK

31. Guardiola, M et al. Metabolic overlap between Alzheimer’s disease and metabolic syndrome identifies the PVRL2 gene as a new modulator of diabetic dyslipidemia. Int. J. Mol. Sci.; 2023; 24, 7415.1:CAS:528:DC%2BB3sXptlOrsbc%3D

32. de Leeuw, CA; Mooij, JM; Heskes, T. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol.; 2015; 11, e1004219.

33. Tang, L. GWAS and eQTL disparity. Nat. Methods; 2023; 20, 1873.1:CAS:528:DC%2BB3sXisFGgsr3E

34. Mostafavi, H; Spence, JP; Naqvi, S; Pritchard, JK. Systematic differences in discovery of genetic effects on gene expression and complex traits. Nat. Genet.; 2023; 55, pp. 1866-1875.1:CAS:528:DC%2BB3sXitFOqsL%2FN

35. GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science; 2020; 369, pp. 1318-1330.

36. UK Biobank. Data-Field 41202. https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202 (2024).

37. Neale lab. UK Biobank GWAS Results. Neale lab http://www.nealelab.is/uk-biobank (2018).

38. Ostropolets, A et al. Adapting electronic health records-derived phenotypes to claims data: lessons learned in using limited clinical data for phenotyping. J. Biomed. Inform.; 2020; 102, 103363.

39. Xu, J. Mapping SNOMED CT to ICD-10-CM. (Rutgers University - School of Health Professions, 2016). https://doi.org/10.7282/T3H70HVK.

40. Odysseus Data Services, Inc. Resource 1420: OMOP Release Notes. https://biobank.ndph.ox.ac.uk/ukb/ukb/docs/omop_release_notes.pdf (2020).

41. Blacketer, C. The common data model. In The Book of OHDSI (eds Observational Health Data Sciences and Informatics) Ch. 4 (Observational Health Data Sciences and Informatics, 2021).

42. Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol.31, 1102–1110 (2013).

43. Rao, G. PhenotypeLibrary: The OHDSI Phenotype Library (2025).

44. Gilbert, J., Rao, G., Schuemie, M., Ryan, P. & Weaver, J. CohortDiagnostics: Diagnostics for OHDSI Cohorts (2025).

45. Rao, GA et al. CohortDiagnostics: phenotype evaluation across a network of observational data sources using population-level characterization. PLoS ONE; 2025; 20, e0310634.1:CAS:528:DC%2BB2MXisVyhs7o%3D

46. UK Biobank. Data releases. https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/past-data-releases (2025).

47. UK Biobank. Category 263. https://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=263 (2018).

48. Purcell, S. PLINK 2.0. (2025).

49. Purcell, S et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet.; 2007; 81, pp. 559-575.1:CAS:528:DC%2BD2sXhtVSqurrL

50. UK Biobank. Resource 1955. https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=1955 (2018).

51. Manichaikul, A et al. Robust relationship inference in genome-wide association studies. Bioinformatics; 2010; 26, pp. 2867-2873.1:CAS:528:DC%2BC3cXhsVSlt7bK

52. Zhou, W et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet.; 2018; 50, pp. 1335-1341.1:CAS:528:DC%2BC1cXhsFSqtbfN

53. Alzheimer’s disease. nhs.uk https://www.nhs.uk/conditions/alzheimers-disease/ (2024).

54. National Institute for Health and Care Excellence. Asthma Prevalence. https://cks.nice.org.uk/topics/asthma/background-information/prevalence/ (2025).

55. Stone, PW; Hickman, K; Holmes, S; Feary, JR; Quint, JK. Comparison of COPD primary care in England, Scotland, Wales, and Northern Ireland. NPJ Prim. Care Respir. Med.; 2022; 32, 46.

56. National Institute for Health and Care Excellence. MI Prevalence. https://cks.nice.org.uk/topics/mi-secondary-prevention/background-information/prevalence/ (2025).

57. National Institute for Health and Care Excellence. Rheumatoid arthritis Prevalence and Incidence. https://cks.nice.org.uk/topics/rheumatoid-arthritis/background-information/prevalence-incidence/ (2025).

58. Rees, F et al. The incidence and prevalence of systemic lupus erythematosus in the UK, 1999–2012. Ann. Rheum. Dis.; 2016; 75, pp. 136-141.

59. Public Health England. Diabetes Prevalence Model (Public Health England).

60. 1,000 Genomes haplotypes -- Phase 3 integrated variant set release in NCBI build 37 (hg19) coordinates (2014).

61. Bulik-Sullivan, B et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet.; 2015; 47, pp. 1236-1241.1:CAS:528:DC%2BC2MXhsFKqu7vM

62. Monarch Initiative. Mondo Disease Ontology. https://mondo.monarchinitiative.org/ (2025).

63. Ensembl project. Ensembl REST API Endpoints. https://rest.ensembl.org/ (2025).

64. UK Biobank. Linkage Disequilibrium Matrices (UK Biobank).

65. Weiner, R. J., Lakhani, C., Knowles, D. A. & Gürsoy, G. LDmat: efficiently queryable compression of linkage disequilibrium matrices. Bioinformatics39, btad092 (2023).

66. Wang, K; Li, M; Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res.; 2010; 38, e164.

67. Weng, Z. ENCSR800VNX (The ENCODE Data Coordination Center, 2023).

68. Hinrichs, A. S. et al. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res.34, D590–D598 (2006).

69. Abugessaisa, I. et al. RefTSS: A reference data set for human and mouse transcription start sites. J. Mol. Biol.431, 2407–2422 (2019).

70. LD directly from Alkes group server. Github https://github.com/omerwe/polyfun/issues/17 (2019).

71. National Center for Biotechnology Information. NCBI Gene. https://www.ncbi.nlm.nih.gov/gene (2025).

72. MAGMA. CNCRhttps://cncr-nl.ontw.stuurlui.dev/research/magma/ (2014).

73. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA102, 15545–15550 (2005).

74. Choi, SW; Mak, TS-H; O’Reilly, PF. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc.; 2020; 15, pp. 2759-2772.1:CAS:528:DC%2BB3cXhsVGltLbI

Word count: 9571

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Biobanks are a rich source of data for genome-wide association studies (GWAS). They store clinical data from electronic health records, with data domains such as laboratory measurements, conditions, and self-reported diagnoses. Traditionally, biobank GWAS utilize case-control cohorts built exclusively from conditions. However, because reported conditions are primarily collected for billing purposes, they face data quality issues. Consequently, incorporating additional data domains in cohort construction can improve cohort accuracy and GWAS results. Here, we assess the impact of various rule-based phenotyping algorithms on GWAS outcomes, examining factors such as power, heritability, replicability, functional annotations, and polygenic risk score prediction accuracy across seven diseases in the UK Biobank. We find that high complexity phenotyping algorithms generally improve GWAS outcomes, including increased power, hits within coding and functional genomic regions, and co-localization with expression quantitative trait loci. Our findings suggest that biobank-scale GWAS can benefit from phenotyping algorithms that integrate multiple data domains.

Details

Title

Multi-domain rule-based phenotyping algorithms enable improved GWAS signal

Author

Newbury, Abigail¹; Elhussein, Ahmed¹; Gürsoy, Gamze²

¹ Department of Biomedical Informatics, Columbia University, New York City, NY, USA (ROR: https://ror.org/00hj8s172) (GRID: grid.21729.3f) (ISNI: 0000 0004 1936 8729); New York Genome Center, New York City, NY, USA (ROR: https://ror.org/05wf2ga96) (GRID: grid.429884.b) (ISNI: 0000 0004 1791 0895)
² Department of Biomedical Informatics, Columbia University, New York City, NY, USA (ROR: https://ror.org/00hj8s172) (GRID: grid.21729.3f) (ISNI: 0000 0004 1936 8729); New York Genome Center, New York City, NY, USA (ROR: https://ror.org/05wf2ga96) (GRID: grid.429884.b) (ISNI: 0000 0004 1791 0895); Department of Computer Science, Columbia University, New York City, NY, USA (ROR: https://ror.org/00hj8s172) (GRID: grid.21729.3f) (ISNI: 0000 0004 1936 8729)

Pages

499

Section

Article

Publication year

2025

Publication date

Dec 2025

Publisher

Nature Publishing Group

e-ISSN

23986352

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41746-025-01815-8

ProQuest document ID

3235851458

Multi-domain rule-based phenotyping algorithms enable improved GWAS signal

Jump to:

Full text

Abstract

Details

Suggested sources