Content area

Abstract

The first project addresses estimating correlations among higher-level biological variables, such as proteins and gene pathways, when only lower-level measurements (e.g., peptides, genes) are directly observed. Traditional methods aggregate these lower-level data, but aggregation methods vary, impacting correlation estimates. A latent factor model is proposed to directly estimate these higher-level correlations without aggregation. Additionally, a shrinkage estimator is introduced to ensure positive definiteness and enhance accuracy. The estimator's asymptotic normality is proven, facilitating efficient computation of p-values for identifying significant correlations. The method's efficacy is demonstrated via comprehensive simulations and analyses of proteomics and gene expression datasets, implemented in the R package highcor.

The second project examines microbial co-metabolism within the gut microbiome, crucial for understanding host-microbiome interactions influencing human health. Recent paired microbiome-metabolome studies (PM2S) suggest strong correlations among certain gut metabolites due to shared microbial pathways. However, confounding factors complicate these observations. To address this, a microbial correlation metric based on a partially linear model is proposed, isolating microbial-driven associations while accounting for confounders. The proposed estimator achieves semi-parametric consistency, and a calibrated estimator attaining parametric efficiency is further developed using external metagenomic datasets. This calibration facilitates precise p-value computations for significant microbial co-metabolism detection. Extensive numerical analysis confirms the method's enhanced precision, advancing understanding of microbiome-driven metabolic interactions.

The third project focuses on detecting spatially variable genes using spatial transcriptomics, which enables comprehensive spatial profiling of gene expression. Conventional methods, such as SpatialDE and SPARK, rely on global variance measures, potentially missing genes with localized expression peaks. A peak-centric algorithm is introduced, specifically identifying prominent local maxima and assessing their statistical significance via Gaussian tail probabilities. Validation on a mouse olfactory bulb dataset demonstrates the method's effectiveness in highlighting genes specific to hippocampal substructures, offering novel insights into spatial gene regulation and tissue organization.

Details

Title
Statistical Models in Metabolomics, Spatial Transcriptomics and Proteomics Data
Author
Shi, Haoran
Publication year
2025
Publisher
ProQuest Dissertations & Theses
ISBN
9798314876503
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
3202581705
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.