Advancing Causal Machine Learning for Metabolomic Biomarker Discovery and Cross-Species Gene Regulation
Abstract (summary)
The elucidation of robust, causal signals from high-dimensional “omics” data remains a central challenge in the post-genomic era. In metabolomics, traditional statistical techniques often recover confounded associations between biomarkers and disease that later fail to validate. In regulatory genomics, sequence-to-function (S2F) deep learning models are powerful, yet often opaque and prone to poor out-of-distribution generalization. From this perspective, both challenges stem from the same underlying issue: a reliance on observational data, where correlative signals obscure causal mechanisms. This dissertation introduces two computational frameworks that advance causal machine learning for biomedical data. Integrative Model for Atherosclerotic Disease (IMAD) combines automated causal structure learning with statistical modeling to map dependencies among metabolomic, clinical, and demographic variables. Applied to a case-control study of cardiovascular disease (CVD), and more specifically, atherosclerotic cardiovascular disease (ASCVD) in Japan, IMAD improved classification AUROC relative to association-based models and isolated glutamic acid and trigonelline as putative direct effectors of these outcomes. Mean and Correlation Alignment (MORALE) learns species-invariant sequence representations for transcription factor binding prediction by aligning distributional moments in latent space—easily embedded into any architecture—providing a foundation for a more advanced framework aimed at causally disentangling conserved regulatory signals from species-specific elements. Evaluated on liver ChIP-seq data from up to five mammals, MORALE yielded consistent gains in area under the precision-recall curve (auPRC) and avoided the performance degradation observed with adversarial domain adaptation by way of a gradient reversal layer (GRL). Collectively, these methods encourage the integration of causal principles—and those amenable to them—to yield models that are both robust and generalizable, facilitating biomarker discovery and regulatory inference.
Indexing (details)
Statistics;
Artificial intelligence
0800: Artificial intelligence
0463: Statistics