Content area
Machine learning research has historically focused on algorithmic improvements, with better training methods driving innovation. Given that the amount of data available for training these models was often limited, research aimed on improving the way these relatively small amounts of data could be used. More recently, this focus has shifted from iteration on the algorithms to iteration on the data itself. Techniques such as data filtering, reannotation and data mixing, among many others, are often used to improve upon the data itself.
In this work, we will examine ways to increase the quality of datasets in image-text and language domains, with a focus on dataset curation and filtering. Given the amount of readily available data on the web, these techniques can be reliably applied as methods to increase the downstream performance of models, with the improvement stemming directly from the higher quality datasets they were trained on. We will also examine dataset curation from the view of synthetic dataset generation in the domain of language model fine-tuning. These synthetic datasets can be used to distill reasoning capabilities from large, high-performing reasoning models to smaller, more compact ones, improving their usability and reducing inference costs.