Content area

Abstract

Document classification is essential in domains such as healthcare and education, encompassing three major steps: annotation, training accurate models, and evaluation. Each of these steps is labor-intensive and time-consuming, requiring substantial amounts of labeled data, which is both costly and resource-demanding. This dissertation addresses these challenges by presenting innovative methodologies to enhance annotation efficiency, model training in low-resource settings, and automated scoring.

In the healthcare domain, we tackle the challenges of annotation and training with limited resources. First, we develop a visualization approach for rapid labeling of clinical notes for smoking status extraction. The annotation process is labor-intensive and time-consuming; thus, we introduce a tool that accelerates annotation by clustering similar sentences and highlighting important keywords. This reduces the cognitive load on annotators, resulting in faster and more efficient labeling.

Next, we address the problem of training accurate classifiers in low-resource settings with limited labeled data. In our first approach, MERIT (Minimal Supervision Through Label Augmentation for Biomedical Relation Extraction), we propose using shortest dependency path (SDP) representation and specific distance thresholds to propagate labels and augment high-quality labeled data. This method improves classifier accuracy compared to using limited labeled data alone. We extend this in our second approach by developing an iterative algorithm to learn automatic thresholds for label propagation. This method is tested in various scenarios, including semi-supervised learning, supervised learning, and in-context learning, demonstrating significant improvements in model performance.

In the education domain, we focus on the problem of assessing narratives generated by school-aged children, a task that is both expensive and time-consuming for teachers. We leverage large language models (LLMs) to learn the scoring patterns of teachers accurately, offering a reliable tool for automated narrative scoring. This approach reduces the subjectivity and resource requirements of manual scoring, providing a scalable and consistent alternative.

Experimental results across these methodologies demonstrate their effectiveness in improving annotation speed, data utilization, and model accuracy. This dissertation contributes to advancing document classification in low-resource settings, offering practical solutions for critical tasks in healthcare and education.

Details

1010268
Business indexing term
Title
Maximizing Learning Efficiency With Limited Labeled Data: Applications to Healthcare and Education
Author
Number of pages
107
Publication year
2025
Degree date
2025
School code
0225
Source
DAI-B 86/11(E), Dissertation Abstracts International
ISBN
9798315760146
Committee member
Wang, Yan; MacNeil, Stephen; Carnevale, Vincenzo
University/institution
Temple University
Department
Computer and Information Science
University location
United States -- Pennsylvania
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
31766014
ProQuest document ID
3213154982
Document URL
https://www.proquest.com/dissertations-theses/maximizing-learning-efficiency-with-limited/docview/3213154982/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic