Content area
Machine learning models are used in many fields, including robotics, cybersecurity, and healthcare. Models trained in a supervised learning setting often require substantial amounts of labeled data, and current manual labeling techniques are slow and costly when training models with large unlabeled datasets. Similarly, current automated or semi-automated labeling techniques may still require significant human intervention; therefore, novel approaches to data labeling are needed. Our goal is to build a system that can generalize well on diverse datasets, minimize human intervention and minimize costs.
In this research, we developed an automated data labeling framework that uses large open-weight multimodal models as a cost-effective and reliable way to assist in generating bounding boxes. We experimented with over a dozen multimodal large language models (LLMs) that can query images, and we evaluated both their multi-label classification and object detection performance on the Microsoft COCO 2017 dataset. We obtained a good multi-label classification performance, which enabled us to build a simple recommender system that queries multimodal large language models and proposes automated labels for image datasets. To enhance the object detection performance of the system, we used a Vision-Language Model (VLM) called Grounding DINO to localize objects in images for a given label. This VLM is an open-vocabulary detection model that can generalize well on various datasets.
Once these automated labels are proposed, we show that a user can quickly select the correct ones and manually enter fewer missed labels into a user interface, therefore minimizing the manual labeling effort. After the labels are validated, they are given to Grounding DINO to detect objects and generate their bounding boxes and confidence levels. An active learning pipeline is then used to train a second VLM iteratively, using query strategies to select the next best samples and assist in making predictions. Our second VLM is OpenCLIP ViT-L/14, whose training includes fine-tuning its last six transformer heads and its projection layer in addition to training a linear classifier head. It is used to transfer what was learned from Grounding DINO’s detection results into a model that can be improved over time through active learning.
The results from our research show that the multi-label classification performance for the tested Ollama multimodal LLMs can reach up to 76% F1 score and 98% accuracy. Results also show that the top multimodal LLM can correctly identify 82.5% of labels in our dataset if we propose 4 times the total number of labels in the dataset. This allows a user to quickly scroll and click to select the correct labels on a user interface while only having to manually enter 14 out of 80 labels. Grounding DINO significantly improves the object detection performance and allows drawing accurate bounding boxes given valid labels. The resulting data labeling pipeline can generalize well on new data and has been shown to yield an average precision (AP) of 52% or more on the COCO dataset.