Content area

Abstract

Machine learning research has historically focused on algorithmic improvements, with better training methods driving innovation. Given that the amount of data available for training these models was often limited, research aimed on improving the way these relatively small amounts of data could be used. More recently, this focus has shifted from iteration on the algorithms to iteration on the data itself. Techniques such as data filtering, reannotation and data mixing, among many others, are often used to improve upon the data itself.

In this work, we will examine ways to increase the quality of datasets in image-text and language domains, with a focus on dataset curation and filtering. Given the amount of readily available data on the web, these techniques can be reliably applied as methods to increase the downstream performance of models, with the improvement stemming directly from the higher quality datasets they were trained on. We will also examine dataset curation from the view of synthetic dataset generation in the domain of language model fine-tuning. These synthetic datasets can be used to distill reasoning capabilities from large, high-performing reasoning models to smaller, more compact ones, improving their usability and reducing inference costs.

Details

1010268
Business indexing term
Title
Data Curation for Foundation Model Training
Number of pages
286
Publication year
2025
Degree date
2025
School code
0227
Source
DAI-A 87/6(E), Dissertation Abstracts International
ISBN
9798270231736
Committee member
Schmidt, Ludwig; Sanghavi, Sujay; Shakkottai, Sanjay; Tamir, Jonathan
University/institution
The University of Texas at Austin
Department
Electrical and Computer Engineering
University location
United States -- Texas
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32459173
ProQuest document ID
3284362875
Document URL
https://www.proquest.com/dissertations-theses/data-curation-foundation-model-training/docview/3284362875/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic