Content area

Abstract

Recent advances in diffusion-based image generation have enabled more diverse, high-quality image generation, opening new possibilities in graphic design, filmmaking, and advertising. However, these tasks often require precise control over the generation process to meet specific artistic, narrative, or branding goals. This demands conditioning inputs such as text instructions, reference images, or visual attributes, which require training data that accurately reflect image-condition associations. Existing training data creation approaches, including manual annotation, data re-purposing, and prompt engineering, offer some utility but face notable limitations in quality and scalability, particularly for image editing applications. These limitations ultimately constrain the capabilities of the resulting models.

To address the limitations, this dissertation presents our work on novel algorithms to automatically generate high-quality, scalable training data for enabling and enhancing instruction-guided and attribute-based image editing with diffusion models, explored from two directions: refining existing datasets and developing evaluation models to guide fine-tuning.

For instruction-guided image editing, we identify semantic misalignment between text instructions and before/after image pairs as a major limitation in current training datasets. We then propose a self-supervised method to detect and correct this misalignment, improving editing quality after fine-tuning on the corrected samples. Additionally, we note that existing evaluation metrics often rely on models with limited semantic understanding. To address this, we fine-tune vision-language models as robust evaluators using high-quality synthetic data. These evaluators also act as reward models to guide editing model training via reinforcement learning.

Building on the evaluator-as-the-guide framework, we explore attribute-based editing with novel visual attributes. We introduce a web-crawling pipeline to curate samples for few-shot fine-tuning, enabling diffusion models to become attribute-aware. They can generate diverse samples to train an attribute scorer that directs attribute-based editing.

Finally, we apply our methods to applications such as virtual try-on and reference- or stroke-guided editing by introducing new conditioning mechanisms within diffusion models. Together, our work enables high-quality, scalable training data generation for diffusion-based conditional image editing, introducing novel conditioning mechanisms that empower users to generate high-fidelity content in a controllable manner.

Details

1010268
Business indexing term
Title
Automated Dataset Creation for Conditional Image Editing with Diffusion Models
Number of pages
183
Publication year
2025
Degree date
2025
School code
0035
Source
DAI-B 87/4(E), Dissertation Abstracts International
ISBN
9798297683532
Committee member
Manjunath, B. S.
University/institution
University of California, Santa Barbara
Department
Computer Science
University location
United States -- California
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32241693
ProQuest document ID
3264423197
Document URL
https://www.proquest.com/dissertations-theses/automated-dataset-creation-conditional-image/docview/3264423197/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic