Content area

Abstract

In the race towards Artificial General Intelligence, data is the fuel that powers our most advanced models. Vision-Language Models like LLaVA and CLIP are trained on billions of image-text pairs, while Large Language Models (LLMs) like GPT and Claude may process trillions of text samples. Despite the abundance of data, ensuring its quality and effective curation remains more of an art than a science. This process must manage real-world data that is multimodal, noisy, and lacks a guaranteed relationship to target tasks. Furthermore, the process is compounded by the complex training dynamics of neural networks, where the value of each data point depends heavily on the evolving state of model training.

Without principled guidance, these challenges often create systematic blind spots, and their impact remains unclear due to a lack of theoretical understanding. My research aims to develop theoretical foundations for data curation through designing theory-inspired algorithms under realistic assumptions and establishing systematic empirical evaluation frameworks to understand the limitations of existing methods including: 1/ target-aware data curation in pretraining 2/label-efficient finetuning 3/ inference-efficient data synthesis and 4/ Interactive learning theories.

Details

1010268
Business indexing term
Title
Algorithmic Data Efficient Learning in the Era of Large Model
Number of pages
429
Publication year
2025
Degree date
2025
School code
0250
Source
DAI-B 87/3(E), Dissertation Abstracts International
ISBN
9798293849888
Committee member
Koh, Pang Wei; Wang, Yingfei
University/institution
University of Washington
Department
Computer Science and Engineering
University location
United States -- Washington
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32117037
ProQuest document ID
3251632236
Document URL
https://www.proquest.com/dissertations-theses/algorithmic-data-efficient-learning-era-large/docview/3251632236/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic