Content area

Abstract

Pretrained Language Models (LMs) have demonstrated remarkable general-purpose capabilities by encoding vast amounts of knowledge from the internet. However, effectively steering these models to serve diverse downstream applications, such as following instructions, chatting with users, using tools, or performing complex reasoning, poses another set of challenges that require diverse, high-quality, and increasingly costly training data.

This dissertation explores scalable paradigms for structuring, creating, and optimizing data to facilitate the broader generalization of language models and enhance their critical capabilities. First, through the creation of the Super-NaturalInstructions benchmark—a large-scale dataset with over 1,600 NLP tasks—I demonstrate that unifying NLP tasks via natural language instructions enables model generalization at the task level. Second, I propose Self-Instruct, a novel framework where LMs generate their own instructional data to train themselves, thereby demonstrating model self-improvement. Third, I develop HyPER, a framework that routes preference annotation tasks between humans and AI to optimize data quality and collection efficiency for preference-based learning. Finally, I systematically study the impact of diverse open instruction-tuning datasets on LM capabilities, leading to the development of the Tülu series of openly available and highly capable models.

Together, these efforts—unifying task structures, leveraging model-generated synthetic data, optimizing human-AI data partnerships, and fostering open data ecosystems—have demonstrated an effective path to building a strong, scalable, and community-driven data foundation for post-training language models. Finally, I envision future directions that can further enhance this data foundation for building more advanced and sustainable AI systems.

Details

Title
Scalable Data Paradigms for Steering General-Purpose Language Models
Author
Wang, Yizhong
Publication year
2026
Publisher
ProQuest Dissertations & Theses
ISBN
9798288834073
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
3230023261
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.