Content area

Abstract

Reinforcement learning for large models is constrained in three practical ways that this dissertation addresses in sequence. First, we study policy optimization from off-policy data and show how estimating the density ratio (via a learned behavior policy) reduces the variance of importance-weighted objectives. This estimation step is not only central to off-policy bandits; it also underpins PPO/TRPO, whose hybrid update pattern performs multiple policy-improvement steps per batch and is therefore partially off-policy. Second, we establish a theoretical foundation for PPO/TRPO under high-capacity function approximation, proving global convergence with overparameterized neural critics and actors and quantifying the cost of policy evaluation/improvement per outer iteration. Third, we move beyond algorithmic foundations to an application in language-model post-training: for structured tasks such as text-to-SQL, we resolve reward scarcity by exploiting task structure to build execution-free reward models, enabling RL at the scale of SFT corpora that lack executable databases.

The second part, Global Convergence of Neural Trust-Region / Proximal Policy Optimization, turns to theoretical insights into the most popular online RL algorithm in both the classic settings and the language model post-training. We analyze a variant of PPO/TRPO in which both the actor and the critic are overparametrized two-layer neural networks. We show that the algorithm converges to the globally optimal policy at a sublinear rate O(1/ √ K) in the number of outer policy-improvement iterations, and that each iteration admits polynomial-time policy evaluation and policy improvement: O(1/ε2 ) TD and SGD steps suffice to keep the approximation errors within the constants of the outer rate. This closes the gap between the practical PPO-style updates used in modern systems and a nonasymptotic convergence guarantee under expressive models. papers.

The third part, Execution-Free RL for Structured Tasks in Language Model Post-Training, tackles the reward-availability problem that arises in RL post-training of LLMs. In current text-to-SQL corpora, the main cost is not running SQL, but constructing or curating the database and test suites needed to execute and compare generated queries; most labeled text–SQL pairs simply do not come with such databases. We introduce a graph-based evaluation metric (FuncEval-GMN) that parses SQL into relational operator trees using only the schema and then predicts functional equivalence with a graph matching network, thereby removing the need to build per-example databases and achieving higher AUC than exact-set or execution-based metrics on Spider and competitive accuracy on WikiSQL/BIRD. Building on this evaluator, we develop Graph-Reward-SQL, an execution-free RL fine-tuning framework that supplies GMN-based outcome rewards and stepwise rewards over CTEs; on Spider and BIRD it consistently outperforms execution-based and LLM-based reward models while cutting inference time and GPU usage, making RL feasible at SFT scale.

Scope. Parts I and II were conducted at Northwestern University, where this author contributed substantively to the problem formulation, methodology, and theoretical analysis. Part III was completed at ByteDance; in that work, this author focused more on project guidance and oversight, including problem scoping, methodological review, and advising of experiment designs and implementations.

Details

1010268
Business indexing term
Title
From Policy Optimization Foundations to Language Model Post-Training on Structured Tasks
Author
Number of pages
177
Publication year
2025
Degree date
2025
School code
0163
Source
DAI-B 87/6(E), Dissertation Abstracts International
ISBN
9798270226558
Committee member
Yang, Zhuoran; Rhee, Chang-Han
University/institution
Northwestern University
Department
Industrial Engineering and Management Sciences
University location
United States -- Illinois
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32285780
ProQuest document ID
3283762126
Document URL
https://www.proquest.com/dissertations-theses/policy-optimization-foundations-language-model/docview/3283762126/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic