Content area
In the race towards Artificial General Intelligence, data is the fuel that powers our most advanced models. Vision-Language Models like LLaVA and CLIP are trained on billions of image-text pairs, while Large Language Models (LLMs) like GPT and Claude may process trillions of text samples. Despite the abundance of data, ensuring its quality and effective curation remains more of an art than a science. This process must manage real-world data that is multimodal, noisy, and lacks a guaranteed relationship to target tasks. Furthermore, the process is compounded by the complex training dynamics of neural networks, where the value of each data point depends heavily on the evolving state of model training.
Without principled guidance, these challenges often create systematic blind spots, and their impact remains unclear due to a lack of theoretical understanding. My research aims to develop theoretical foundations for data curation through designing theory-inspired algorithms under realistic assumptions and establishing systematic empirical evaluation frameworks to understand the limitations of existing methods including: 1/ target-aware data curation in pretraining 2/label-efficient finetuning 3/ inference-efficient data synthesis and 4/ Interactive learning theories.