Content area
The increasing availability of digital health records allows us to gain knowledge about the determinants of cancer outcomes in an unprecedented manner. Nevertheless, methods to efficiently and accurately identify adverse events in large retrospective studies are limited. Natural language processing (NLP) is a potential solution to extract valuable information from patient data, typically stored in unstructured text data and isolated datasets in hospital systems. Given the success of large language models (LLMs) in natural language comprehension tasks, we explored the generation of a large labeled clinical notes dataset using a generative LLM with prompt engineering and the efficacy in using this labeled dataset for fine-tuning encoder-based LLMs. We deduced that the generative LLM LLaMA 70B produces accurate note level predictions of adverse event (AE) occurrence by comparing LLaMA 70B’s predictions against a small sample of clinical notes annotated by an oncologist. Thus, we used LLaMA 70B to annotate a dataset of 7,345 patients (412,530 clinical notes) from the MSK-IMPACT dataset. The performance of this annotated dataset in fine-tuning ModernBERT and Clinical Longformer to predict AE occurrence was compared to the performance of fine-tuning these models using another version of the dataset annotated using clinical trial data. In this study, precision and recall scores of 0.80 are considered acceptable as they reflect an optimal balance between accurate predictions and sufficient sensitivity. Our results prove that the dataset labeled by LLaMA 70B performs better than the labels produced using clinical trial data. To evaluate the performance of the LLaMA 70B generated labels, we compared the LLaMA 70B predictions to our clinical trial data. On our training set of 5,875 patients, LLaMA 70B achieved a macro-averaged recall of 0.90, accuracy of 0.71, precision of 0.07, F1-score of 0.13, and specificity of 0.70. The evaluation metrics were similar for the test set. We find that LLaMA 70B note level predictions serve as better labels than our clinical trial note level labels as both ModernBERT and Clinical Longformer performed better when trained and tested on LLaMA 70B labels. While smoothing the LLaMA 70B predictions and more prompt engineering were tried to improve the performance of patient level predictions against ground truth patient level clinical trial labels, these methods did not lead to remarkable results. We find that manual inspection of LLaMA’s note level predictions by a medical expert is the best method to validate them. The most effective approach to create a clinical notes dataset with high quality labels is to have medical experts manually annotate the notes.