Content area

Abstract

LLMs have demonstrated remarkable capabilities, and there is growing interest in using them as agents—systems that can translate complex human goals, expressed in natural language, into sequences of actions within digital environments like web browsers. Achieving this requires two core competencies: first, the ability to understand arbitrary and compositional language inputs; and second, the capacity to learn about unfamiliar environments so that language goals can be grounded in effective, multi-step decision-making. This thesis addresses both of these challenges.

In the first part, I introduce Tree Projections, a framework for understanding how transformers build compositional structure. I then present a series of results based on Tree Projections that illuminate the mechanisms behind compositional generalization, grokking, and sample-efficient learning in transformers. While Tree Projections help explain successful generalization, prior work has shown that standard transformers struggle with deep recursion due to a lack of mechanisms for unbounded hierarchical depth. To address this, I propose Pushdown Layers, an architectural augmentation that adds a stack-based memory to transformers. Pushdown Layers improve sample efficiency and generalization on tasks requiring nested or recursive reasoning.

In the second party, I introduce NNetNav and BAGEL, methods for unsupervised, open-ended exploration in web environments that enable models to automatically collect training data for new websites, without human supervision. Our best results come from fine-tuning LLMs with demonstrations collected via NNetNav, which uses the hierarchical structure of language to guide exploration policies. Using NNetNav, we collect 10,000 demonstrations from 20 real-world websites and fine-tune an 8B model, setting a new state-of-the-art among unsupervised methods and outperforming zero-shot GPT-4 on multiple browser benchmarks.

Taken together, these contributions bring us closer to digital language agents that can both handle the complexity of language instructions and autonomously learn from interacting with their environments.

Details

1010268
Business indexing term
Title
Building the Learning-From-Interaction Pipeline for Large Language Models
Number of pages
159
Publication year
2025
Degree date
2025
School code
0212
Source
DAI-B 87/1(E), Dissertation Abstracts International
ISBN
9798288815157
Committee member
Andreas, Jacob; Hashimoto, Tatsunori
University/institution
Stanford University
University location
United States -- California
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32201011
ProQuest document ID
3238016459
Document URL
https://www.proquest.com/dissertations-theses/building-learning-interaction-pipeline-large/docview/3238016459/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic