Content area

Abstract

Pretrained Language Models (LMs) have demonstrated remarkable general-purpose capabilities by encoding vast amounts of knowledge from the internet. However, effectively steering these models to serve diverse downstream applications, such as following instructions, chatting with users, using tools, or performing complex reasoning, poses another set of challenges that require diverse, high-quality, and increasingly costly training data.

This dissertation explores scalable paradigms for structuring, creating, and optimizing data to facilitate the broader generalization of language models and enhance their critical capabilities. First, through the creation of the Super-NaturalInstructions benchmark—a large-scale dataset with over 1,600 NLP tasks—I demonstrate that unifying NLP tasks via natural language instructions enables model generalization at the task level. Second, I propose Self-Instruct, a novel framework where LMs generate their own instructional data to train themselves, thereby demonstrating model self-improvement. Third, I develop HyPER, a framework that routes preference annotation tasks between humans and AI to optimize data quality and collection efficiency for preference-based learning. Finally, I systematically study the impact of diverse open instruction-tuning datasets on LM capabilities, leading to the development of the Tülu series of openly available and highly capable models.

Together, these efforts—unifying task structures, leveraging model-generated synthetic data, optimizing human-AI data partnerships, and fostering open data ecosystems—have demonstrated an effective path to building a strong, scalable, and community-driven data foundation for post-training language models. Finally, I envision future directions that can further enhance this data foundation for building more advanced and sustainable AI systems.

Details

1010268
Business indexing term
Title
Scalable Data Paradigms for Steering General-Purpose Language Models
Number of pages
200
Publication year
2026
Degree date
2026
School code
0250
Source
DAI-B 87/1(E), Dissertation Abstracts International
ISBN
9798288834073
Committee member
Koh, Pang Wei; Steinert-Threlkeld, Shane
University/institution
University of Washington
Department
Computer Science and Engineering
University location
United States -- Washington
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32115931
ProQuest document ID
3230023261
Document URL
https://www.proquest.com/dissertations-theses/scalable-data-paradigms-steering-general-purpose/docview/3230023261/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic