It appears you don't have support to open PDFs in this web browser. To view this file, Open with your PDF reader
Abstract
Machine learning models are used in many fields, including robotics, cybersecurity, and healthcare. Models trained in a supervised learning setting often require substantial amounts of labeled data, and current manual labeling techniques are slow and costly when training models with large unlabeled datasets. Similarly, current automated or semi-automated labeling techniques may still require significant human intervention; therefore, novel approaches to data labeling are needed. Our goal is to build a system that can generalize well on diverse datasets, minimize human intervention and minimize costs.
In this research, we developed an automated data labeling framework that uses large open-weight multimodal models as a cost-effective and reliable way to assist in generating bounding boxes. We experimented with over a dozen multimodal large language models (LLMs) that can query images, and we evaluated both their multi-label classification and object detection performance on the Microsoft COCO 2017 dataset. We obtained a good multi-label classification performance, which enabled us to build a simple recommender system that queries multimodal large language models and proposes automated labels for image datasets. To enhance the object detection performance of the system, we used a Vision-Language Model (VLM) called Grounding DINO to localize objects in images for a given label. This VLM is an open-vocabulary detection model that can generalize well on various datasets.
Once these automated labels are proposed, we show that a user can quickly select the correct ones and manually enter fewer missed labels into a user interface, therefore minimizing the manual labeling effort. After the labels are validated, they are given to Grounding DINO to detect objects and generate their bounding boxes and confidence levels. An active learning pipeline is then used to train a second VLM iteratively, using query strategies to select the next best samples and assist in making predictions. Our second VLM is OpenCLIP ViT-L/14, whose training includes fine-tuning its last six transformer heads and its projection layer in addition to training a linear classifier head. It is used to transfer what was learned from Grounding DINO’s detection results into a model that can be improved over time through active learning.
The results from our research show that the multi-label classification performance for the tested Ollama multimodal LLMs can reach up to 76% F1 score and 98% accuracy. Results also show that the top multimodal LLM can correctly identify 82.5% of labels in our dataset if we propose 4 times the total number of labels in the dataset. This allows a user to quickly scroll and click to select the correct labels on a user interface while only having to manually enter 14 out of 80 labels. Grounding DINO significantly improves the object detection performance and allows drawing accurate bounding boxes given valid labels. The resulting data labeling pipeline can generalize well on new data and has been shown to yield an average precision (AP) of 52% or more on the COCO dataset.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





