Content area
LLMs have demonstrated remarkable capabilities, and there is growing interest in using them as agents—systems that can translate complex human goals, expressed in natural language, into sequences of actions within digital environments like web browsers. Achieving this requires two core competencies: first, the ability to understand arbitrary and compositional language inputs; and second, the capacity to learn about unfamiliar environments so that language goals can be grounded in effective, multi-step decision-making. This thesis addresses both of these challenges.
In the first part, I introduce Tree Projections, a framework for understanding how transformers build compositional structure. I then present a series of results based on Tree Projections that illuminate the mechanisms behind compositional generalization, grokking, and sample-efficient learning in transformers. While Tree Projections help explain successful generalization, prior work has shown that standard transformers struggle with deep recursion due to a lack of mechanisms for unbounded hierarchical depth. To address this, I propose Pushdown Layers, an architectural augmentation that adds a stack-based memory to transformers. Pushdown Layers improve sample efficiency and generalization on tasks requiring nested or recursive reasoning.
In the second party, I introduce NNetNav and BAGEL, methods for unsupervised, open-ended exploration in web environments that enable models to automatically collect training data for new websites, without human supervision. Our best results come from fine-tuning LLMs with demonstrations collected via NNetNav, which uses the hierarchical structure of language to guide exploration policies. Using NNetNav, we collect 10,000 demonstrations from 20 real-world websites and fine-tune an 8B model, setting a new state-of-the-art among unsupervised methods and outperforming zero-shot GPT-4 on multiple browser benchmarks.
Taken together, these contributions bring us closer to digital language agents that can both handle the complexity of language instructions and autonomously learn from interacting with their environments.