Content area

Abstract

This thesis develops a suite of novel statistical frameworks and tools for scalable symbolic regression (SR) with a focus on high-dimensional regimes. SR seeks to discover closed-form mathematical expressions that explain the relationship between a response and a set of predictors, offering both interpretability and predictive accuracy. Despite its appeal, SR remains computationally challenging, particularly in large-p settings where the combinatorial explosion of model search space can render existing methods intractable.

The first chapter formulates SR as an ultra-high-dimensional Operator-Induced Structural linear regression problem. To navigate this vast model space efficiently, we introduce parametrics assisted by nonparametrics (PAN), an iterative framework utilizing nonparametric variable selection to enable scalable SR. We instantiate PAN as iBART, which alternates between Bayesian additive regression trees (BART)-based variable selection and feature synthesis. This iterative dimension reduction shrinks the search space to promising subspaces, significantly improving scalability and accuracy. Simulations demonstrate that iBART is reliable and efficient. In an application to single-atom catalysis, iBART identifies meaningful descriptors, offering insights into sintering-free catalyst design.

The second chapter addresses the scalability bottlenecks of existing SR algorithms in the large-p regime--a setting increasingly common in modern scientific applications. We propose PAN+SR, a two-stage framework that integrates ab initio nonparametric variable selection with any SR algorithm. We propose a novel clustering-based selection method operated on variable inclusion proportion ranks, which efficiently reduces dimensionality while minimizing false negatives, a key requirement for symbolic recovery. To evaluate PAN+SR in large-p settings, we design an SR benchmark comprising 35 real-world datasets and 100 synthetic datasets based on nonlinear equations in the Feynman Lectures on Physics. PAN+SR consistently improves the performance of 19 SR algorithms across diverse settings.

In the third chapter, we address two fundamental challenges in BART-based variable selection: high computational burden and unstable selection accuracy. We provide a comprehensive review of existing variable importance metrics and introduce a new measure based on variable count and rank statistics. Extensive numerical experiments show that the proposed measure consistently outperforms 7 existing BART-based methods across diverse settings. Its accuracy, robustness, and efficiency make it suitable for both recall-oriented screening and precision-focused selection.

Collectively, these contributions bridge nonparametric statistics and symbolic modeling, advancing the foundations of model-free variable selection and interpretable modeling in high-dimensional regimes.

Details

1010268
Title
From Bayesian Nonparametrics to Symbols: Scalable and Accurate Model-Free Variable Selection for High-Dimensional Symbolic Regression
Number of pages
202
Publication year
2025
Degree date
2025
School code
0187
Source
DAI-B 87/4(E), Dissertation Abstracts International
ISBN
9798297610071
Advisor
Committee member
Luo, Hengrui; Senftle, Thomas; Chen, Ken
University/institution
Rice University
Department
Statistics
University location
United States -- Texas
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32331921
ProQuest document ID
3258778160
Document URL
https://www.proquest.com/dissertations-theses/bayesian-nonparametrics-symbols-scalable-accurate/docview/3258778160/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic