Content area
This thesis develops a suite of novel statistical frameworks and tools for scalable symbolic regression (SR) with a focus on high-dimensional regimes. SR seeks to discover closed-form mathematical expressions that explain the relationship between a response and a set of predictors, offering both interpretability and predictive accuracy. Despite its appeal, SR remains computationally challenging, particularly in large-p settings where the combinatorial explosion of model search space can render existing methods intractable.
The first chapter formulates SR as an ultra-high-dimensional Operator-Induced Structural linear regression problem. To navigate this vast model space efficiently, we introduce parametrics assisted by nonparametrics (PAN), an iterative framework utilizing nonparametric variable selection to enable scalable SR. We instantiate PAN as iBART, which alternates between Bayesian additive regression trees (BART)-based variable selection and feature synthesis. This iterative dimension reduction shrinks the search space to promising subspaces, significantly improving scalability and accuracy. Simulations demonstrate that iBART is reliable and efficient. In an application to single-atom catalysis, iBART identifies meaningful descriptors, offering insights into sintering-free catalyst design.
The second chapter addresses the scalability bottlenecks of existing SR algorithms in the large-p regime--a setting increasingly common in modern scientific applications. We propose PAN+SR, a two-stage framework that integrates ab initio nonparametric variable selection with any SR algorithm. We propose a novel clustering-based selection method operated on variable inclusion proportion ranks, which efficiently reduces dimensionality while minimizing false negatives, a key requirement for symbolic recovery. To evaluate PAN+SR in large-p settings, we design an SR benchmark comprising 35 real-world datasets and 100 synthetic datasets based on nonlinear equations in the Feynman Lectures on Physics. PAN+SR consistently improves the performance of 19 SR algorithms across diverse settings.
In the third chapter, we address two fundamental challenges in BART-based variable selection: high computational burden and unstable selection accuracy. We provide a comprehensive review of existing variable importance metrics and introduce a new measure based on variable count and rank statistics. Extensive numerical experiments show that the proposed measure consistently outperforms 7 existing BART-based methods across diverse settings. Its accuracy, robustness, and efficiency make it suitable for both recall-oriented screening and precision-focused selection.
Collectively, these contributions bridge nonparametric statistics and symbolic modeling, advancing the foundations of model-free variable selection and interpretable modeling in high-dimensional regimes.