Automated Data Labeling for Object Detection Using Large Open-Weight Multimodal Models

Abstract

Machine learning models are used in many fields, including robotics, cybersecurity, and healthcare. Models trained in a supervised learning setting often require substantial amounts of labeled data, and current manual labeling techniques are slow and costly when training models with large unlabeled datasets. Similarly, current automated or semi-automated labeling techniques may still require significant human intervention; therefore, novel approaches to data labeling are needed. Our goal is to build a system that can generalize well on diverse datasets, minimize human intervention and minimize costs.

In this research, we developed an automated data labeling framework that uses large open-weight multimodal models as a cost-effective and reliable way to assist in generating bounding boxes. We experimented with over a dozen multimodal large language models (LLMs) that can query images, and we evaluated both their multi-label classification and object detection performance on the Microsoft COCO 2017 dataset. We obtained a good multi-label classification performance, which enabled us to build a simple recommender system that queries multimodal large language models and proposes automated labels for image datasets. To enhance the object detection performance of the system, we used a Vision-Language Model (VLM) called Grounding DINO to localize objects in images for a given label. This VLM is an open-vocabulary detection model that can generalize well on various datasets.

Once these automated labels are proposed, we show that a user can quickly select the correct ones and manually enter fewer missed labels into a user interface, therefore minimizing the manual labeling effort. After the labels are validated, they are given to Grounding DINO to detect objects and generate their bounding boxes and confidence levels. An active learning pipeline is then used to train a second VLM iteratively, using query strategies to select the next best samples and assist in making predictions. Our second VLM is OpenCLIP ViT-L/14, whose training includes fine-tuning its last six transformer heads and its projection layer in addition to training a linear classifier head. It is used to transfer what was learned from Grounding DINO’s detection results into a model that can be improved over time through active learning.

The results from our research show that the multi-label classification performance for the tested Ollama multimodal LLMs can reach up to 76% F1 score and 98% accuracy. Results also show that the top multimodal LLM can correctly identify 82.5% of labels in our dataset if we propose 4 times the total number of labels in the dataset. This allows a user to quickly scroll and click to select the correct labels on a user interface while only having to manually enter 14 out of 80 labels. Grounding DINO significantly improves the object detection performance and allows drawing accurate bounding boxes given valid labels. The resulting data labeling pipeline can generalize well on new data and has been shown to yield an average precision (AP) of 52% or more on the COCO dataset.

Details

Business indexing term

Subject:

Artificial intelligence

Subject

Artificial intelligence;
Computer science;
Information technology

Classification

0800: Artificial intelligence
0489: Information Technology
0984: Computer science

Identifier / keyword

Multimodal models; Automated data labeling; Large language models; Object detection; Active learning

Title

Automated Data Labeling for Object Detection Using Large Open-Weight Multimodal Models

Author

Songong, Garigue T.

Number of pages

138

Publication year

2026

Degree date

2026

School code

0075

Source

DAI-B 87/6(E), Dissertation Abstracts International

ISBN

9798265484734

Advisor

Akinfaderin, Adewale; Fossaceca, John

Committee member

Shajaiah, Haya

University/institution

The George Washington University

Department

Computer Science

University location

United States -- District of Columbia

Degree

D.Engr.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

32398359

ProQuest document ID

3281700130

Document URL

https://www.proquest.com/dissertations-theses/automated-data-labeling-object-detection-using/docview/3281700130/se-2?accountid=208611

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

Automated Data Labeling for Object Detection Using Large Open-Weight Multimodal Models

Content area

Abstract

Details