Leveraging large language models to mimic domain

Abstract

Background

The integration of big data and artificial intelligence (AI) in healthcare, particularly through the analysis of electronic health records (EHR), presents significant opportunities for improving diagnostic accuracy and patient outcomes. However, the challenge of processing and accurately labeling vast amounts of unstructured data remains a critical bottleneck, necessitating efficient and reliable solutions. This study investigates the ability of domain specific, fine-tuned large language models (LLMs) to classify unstructured EHR texts with typographical errors through named entity recognition tasks, aiming to improve the efficiency and reliability of supervised learning AI models in healthcare.

Methods

Turkish clinical notes from pediatric emergency room admissions at Hacettepe University İhsan Doğramacı Children’s Hospital from 2018 to 2023 were analyzed. The data were preprocessed with open source Python libraries and categorized using a pretrained GPT-3 model, “text-davinci-003,” before and after fine-tuning with domain-specific data on respiratory tract infections (RTI). The model’s predictions were compared against ground truth labels established by pediatric specialists.

Results

Out of 24,229 patient records classified as poorly labeled, 18,879 were identified without typographical errors and confirmed for RTI through filtering methods. The fine-tuned model achieved a 99.88% accuracy, significantly outperforming the pretrained model’s 78.54% accuracy in identifying RTI cases among the remaining records. The fine-tuned model demonstrated superior performance metrics across all evaluated aspects compared to the pretrained model.

Conclusions

Fine-tuned LLMs can categorize unstructured EHR data with high accuracy, closely approximating the performance of domain experts. This approach significantly reduces the time and costs associated with manual data labeling, demonstrating the potential to streamline the processing of large-scale healthcare data for AI applications.

Details

Title

Leveraging large language models to mimic domain expert labeling in unstructured text-based electronic healthcare records in non-english languages

Author

Akbasli, Izzet Turkalp; Ahmet Ziya Birbilen; Teksam, Ozlem

Pages

1-9

Section

Research

Publication year

2025

Publication date

2025

Publisher

Springer Nature B.V.

e-ISSN

14726947

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1186/s12911-025-02871-6

ProQuest document ID

3187545772

© 2025. This work is licensed under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Leveraging large language models to mimic domain expert labeling in unstructured text-based electronic healthcare records in non-english languages

Jump to:

Abstract

Details

Full text options

Suggested sources