Detecting and Understanding Stigmatizing Language

Abstract

Stigmatizing language such as negative descriptors, misgendering, or expressions of disbelief used in Electronic Health Records (EHRs) can perpetuate bias, erode patient trust, and reinforce healthcare disparities. Building upon the growing intersection of health informatics and natural language processing (NLP), this dissertation integrates methodological innovation with fairness-oriented analysis across two complementary case studies to document, detect, and counter stigmatizing language in EHRs.

Case 1 introduces a multi-stage transfer learning (MSTL) framework that sequentially adapts transformer-based models through semantic, syntactic, and task-specific fine-tuning. Using datasets spanning hate speech, clinical phenotypes, and stigmatizing language corpora, the framework achieved an accuracy of 89.83% and F1 = 93.18, significantly outperforming traditional baselines and large-language-model comparisons (e.g., GPT-4o). Statistical validation through Wilcoxon-Mann-Whitney tests with Bonferroni correction confirmed the robustness of performance gains (p < .05). The MSTL-Longformer model demonstrated consistent accuracy across demographic subgroups, highlighting its capacity to detect subtle and context-dependent forms of stigma in long clinical narratives.

Case 2 extends this framework to fairness auditing, baseline modeling, and interpretive contextualization for gender-expansive patient (GEP) documentation. Multivariable logistic-regression and odds-ratio analyses revealed that GEP and Black/African American patients had the highest adjusted odds of stigmatizing documentation, underscoring intersectional disparities in clinical narratives. Baseline models for SL detection among GEPs established comparative benchmarks for future studies, while comprehensive annotation guidelines were developed to standardize identification of ambiguous expressions, family-attributed remarks, and misgendering.

Together, these studies demonstrate that detecting linguistic bias in EHRs is both a computational and ethical challenge. The proposed frameworks show that domain-progressive transfer learning can substantially improve model accuracy while fairness-aware evaluation exposes structural inequities in documentation. By combining advanced NLP modeling with ethical inquiry, this dissertation contributes new methodological and conceptual tools for responsible AI in healthcare. The findings illustrate how linguistic equity can be operationalized through data-driven innovations, ensuring that the language of medicine and the algorithms can promote inclusion, transparency, and justice in patient care.

Details

Title

Detecting and Understanding Stigmatizing Language in Electronic Health Records Using Natural Language Processing

Author

Xue, Liyang

Publication year

2026

Publisher

ProQuest Dissertations & Theses

ISBN

9798273327665

Source type

Dissertation or Thesis

Language of publication

English

ProQuest document ID

3294661553

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Detecting and Understanding Stigmatizing Language in Electronic Health Records Using Natural Language Processing

Content area

Abstract

Details

Suggested sources