From Policy Optimization Foundations to Language Model Post-Training on Structured Tasks

Abstract

Reinforcement learning for large models is constrained in three practical ways that this dissertation addresses in sequence. First, we study policy optimization from off-policy data and show how estimating the density ratio (via a learned behavior policy) reduces the variance of importance-weighted objectives. This estimation step is not only central to off-policy bandits; it also underpins PPO/TRPO, whose hybrid update pattern performs multiple policy-improvement steps per batch and is therefore partially off-policy. Second, we establish a theoretical foundation for PPO/TRPO under high-capacity function approximation, proving global convergence with overparameterized neural critics and actors and quantifying the cost of policy evaluation/improvement per outer iteration. Third, we move beyond algorithmic foundations to an application in language-model post-training: for structured tasks such as text-to-SQL, we resolve reward scarcity by exploiting task structure to build execution-free reward models, enabling RL at the scale of SFT corpora that lack executable databases.

The second part, Global Convergence of Neural Trust-Region / Proximal Policy Optimization, turns to theoretical insights into the most popular online RL algorithm in both the classic settings and the language model post-training. We analyze a variant of PPO/TRPO in which both the actor and the critic are overparametrized two-layer neural networks. We show that the algorithm converges to the globally optimal policy at a sublinear rate O(1/ √ K) in the number of outer policy-improvement iterations, and that each iteration admits polynomial-time policy evaluation and policy improvement: O(1/ε2 ) TD and SGD steps suffice to keep the approximation errors within the constants of the outer rate. This closes the gap between the practical PPO-style updates used in modern systems and a nonasymptotic convergence guarantee under expressive models. papers.

The third part, Execution-Free RL for Structured Tasks in Language Model Post-Training, tackles the reward-availability problem that arises in RL post-training of LLMs. In current text-to-SQL corpora, the main cost is not running SQL, but constructing or curating the database and test suites needed to execute and compare generated queries; most labeled text–SQL pairs simply do not come with such databases. We introduce a graph-based evaluation metric (FuncEval-GMN) that parses SQL into relational operator trees using only the schema and then predicts functional equivalence with a graph matching network, thereby removing the need to build per-example databases and achieving higher AUC than exact-set or execution-based metrics on Spider and competitive accuracy on WikiSQL/BIRD. Building on this evaluator, we develop Graph-Reward-SQL, an execution-free RL fine-tuning framework that supplies GMN-based outcome rewards and stepwise rewards over CTEs; on Spider and BIRD it consistently outperforms execution-based and LLM-based reward models while cutting inference time and GPU usage, making RL feasible at SFT scale.

Scope. Parts I and II were conducted at Northwestern University, where this author contributed substantively to the problem formulation, methodology, and theoretical analysis. Part III was completed at ByteDance; in that work, this author focused more on project guidance and oversight, including problem scoping, methodological review, and advising of experiment designs and implementations.

Details

Business indexing term

Subject:

Artificial intelligence

Subject

Computer science;
Artificial intelligence;
Information technology

Classification

0984: Computer science
0489: Information Technology
0800: Artificial intelligence

Identifier / keyword

Contextual bandit; Large language model; Natural language processing; Reinforcement learning; Structured Query Language

Title

From Policy Optimization Foundations to Language Model Post-Training on Structured Tasks

Author

Liu, Boyi

Number of pages

177

Publication year

2025

Degree date

2025

School code

0163

Source

DAI-B 87/6(E), Dissertation Abstracts International

ISBN

9798270226558

Advisor

Wang, Zhaoran

Committee member

Yang, Zhuoran; Rhee, Chang-Han

University/institution

Northwestern University

Department

Industrial Engineering and Management Sciences

University location

United States -- Illinois

Degree

Ph.D.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

32285780

ProQuest document ID

3283762126

Document URL

https://www.proquest.com/dissertations-theses/policy-optimization-foundations-language-model/docview/3283762126/se-2?accountid=208611

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

From Policy Optimization Foundations to Language Model Post-Training on Structured Tasks

Content area

Abstract

Details