Adverse Event Prediction Using Natural Language Processing (NLP)

Abstract

The increasing availability of digital health records allows us to gain knowledge about the determinants of cancer outcomes in an unprecedented manner. Nevertheless, methods to efficiently and accurately identify adverse events in large retrospective studies are limited. Natural language processing (NLP) is a potential solution to extract valuable information from patient data, typically stored in unstructured text data and isolated datasets in hospital systems. Given the success of large language models (LLMs) in natural language comprehension tasks, we explored the generation of a large labeled clinical notes dataset using a generative LLM with prompt engineering and the efficacy in using this labeled dataset for fine-tuning encoder-based LLMs. We deduced that the generative LLM LLaMA 70B produces accurate note level predictions of adverse event (AE) occurrence by comparing LLaMA 70B’s predictions against a small sample of clinical notes annotated by an oncologist. Thus, we used LLaMA 70B to annotate a dataset of 7,345 patients (412,530 clinical notes) from the MSK-IMPACT dataset. The performance of this annotated dataset in fine-tuning ModernBERT and Clinical Longformer to predict AE occurrence was compared to the performance of fine-tuning these models using another version of the dataset annotated using clinical trial data. In this study, precision and recall scores of 0.80 are considered acceptable as they reflect an optimal balance between accurate predictions and sufficient sensitivity. Our results prove that the dataset labeled by LLaMA 70B performs better than the labels produced using clinical trial data. To evaluate the performance of the LLaMA 70B generated labels, we compared the LLaMA 70B predictions to our clinical trial data. On our training set of 5,875 patients, LLaMA 70B achieved a macro-averaged recall of 0.90, accuracy of 0.71, precision of 0.07, F1-score of 0.13, and specificity of 0.70. The evaluation metrics were similar for the test set. We find that LLaMA 70B note level predictions serve as better labels than our clinical trial note level labels as both ModernBERT and Clinical Longformer performed better when trained and tested on LLaMA 70B labels. While smoothing the LLaMA 70B predictions and more prompt engineering were tried to improve the performance of patient level predictions against ground truth patient level clinical trial labels, these methods did not lead to remarkable results. We find that manual inspection of LLaMA’s note level predictions by a medical expert is the best method to validate them. The most effective approach to create a clinical notes dataset with high quality labels is to have medical experts manually annotate the notes.

Details

Subject

Computer science;
Computer engineering;
Oncology;
Bioinformatics

Classification

0984: Computer science
0464: Computer Engineering
0992: Oncology
0715: Bioinformatics

Identifier / keyword

Adverse events; Cancer; Large language models; Natural language processing

Title

Adverse Event Prediction Using Natural Language Processing (NLP)

Author

Ahmed, Nishat

Number of pages

151

Publication year

2025

Degree date

2025

School code

0057

Source

MAI 86/11(E), Masters Abstracts International

ISBN

9798314872345

Advisor

Keene, Sam

Committee member

Shoop, Barry

University/institution

The Cooper Union for the Advancement of Science and Art

Department

Electrical Engineering

University location

United States -- New York

Degree

M.E.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

32003387

ProQuest document ID

3201915407

Document URL

https://www.proquest.com/dissertations-theses/adverse-event-prediction-using-natural-language/docview/3201915407/se-2?accountid=208611

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

Adverse Event Prediction Using Natural Language Processing (NLP)

Content area

Abstract

Details