Content area

Abstract

Effective natural language processing systems typically require extensive human annotations, creating a major bottleneck for deploying models on new tasks. This thesis develops methods that reduce the dependence on human supervision by exploiting the inherent structure of the data, task, and language models themselves.

First, we present X-Class, which performs text classification using only class names by exploiting corpus-level distributional structure. Rather than requiring labeled examples, the method learns adaptive document representations that align with the given classes through clustering, allowing the corpus itself to provide supervisory signal. Specifically, X-Class estimates class representations by incrementally adding similar words, obtains document representations via class-attention mechanisms, and trains classifiers on confident pseudo-labeled documents. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on seven benchmark datasets.

Second, we introduce Goal-Driven Explainable Clustering (GoalEx), which exploits task structure by decomposing clustering into a propose-assign-select pipeline: language models generate candidate cluster explanations conditioned on user goals, and optimization selects the subset that best covers the corpus. This task decomposition naturally produces interpretable outputs—each cluster comes with a human-readable explanation of what it represents. Under both automatic and human evaluation, our method produces more accurate and goal-related explanations than prior methods.

Third, we present FFF-NER, a few-shot fine-tuning framework for Named Entity Recognition that exploits task structure by aligning fine-tuning with pre-training objectives. We hypothesize that fine-tuning performance improves when the fine-tuning task resembles the pre-training task. By decoupling span detection from type prediction and formulating NER as masked token prediction, our method achieves state-of-the-art few-shot NER performance on ten benchmark datasets.

Fourth, we present Model-induced Process Supervision (MiPS), which exploits the structure of language model reasoning itself. By sampling how well partial solutions can be completed, the method automatically generates training labels for verifying multi-step reasoning, removing the need for expensive step-by-step human annotation. Our approach significantly improves performance on math and coding tasks compared to output-supervised verifiers. 

Together, these works establish that systematic exploitation of structure—whether in data distributions, task formulations, or model behaviors—can effectively replace or augment human supervision, enabling scalable and interpretable NLP systems.

Details

1010268
Title
Exploiting Data, Task, and Model Structure for Supervision-Efficient Natural Language Processing
Number of pages
123
Publication year
2025
Degree date
2025
School code
0033
Source
DAI-A 87/6(E), Dissertation Abstracts International
ISBN
9798270246198
Committee member
Berg-Kirkpatrick, Taylor; McAuley, Julian; Roth, Dan; Ren, Bing
University/institution
University of California, San Diego
Department
Computer Science and Engineering
University location
United States -- California
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32399020
ProQuest document ID
3285841983
Document URL
https://www.proquest.com/dissertations-theses/exploiting-data-task-model-structure-supervision/docview/3285841983/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic