Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Abstract

Background:Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain.

Objective:This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy.

Methods:Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance.

Results:The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F₁-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures.

Conclusions:This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method’s applicability to other domains and languages.

Details

Company / organization

Name:

OpenAI

NAICS:

541715

Identifier / keyword

natural language processing; named entity recognition; large language model; synthetic data; LLM; NLP; machine learning; artificial intelligence; NER; medical entity; Estonian; health care data; annotated data; data annotation; clinical decision support; data mining

Title

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Author

Šuvalov, Hendrik

; Lepson, Mihkel

; Kukk, Veronika

; Malk, Maria

; Ilves, Neeme

; Hele-Andra Kuulmets

; Kolde, Raivo

Publication title

Journal of Medical Internet Research; Toronto

Volume

First page

e66279

Publication year

2025

Publication date

2025

Section

Information Retrieval

Publisher

Gunther Eysenbach MD MPH, Associate Professor

Place of publication

Toronto

Country of publication

Canada

Publication subject

Medical Sciences--Computer Applications

e-ISSN

1438-8871

Source type

Scholarly Journal

Language of publication

English

Document type

Journal Article

Publication history

Online publication date

2025-03-18

Milestone dates

2024-09-09 (Preprint first published); 2024-09-09 (Submitted); 2024-12-19 (Revised version received); 2025-01-31 (Accepted); 2025-03-18 (Published)

Publication history

First posting date

18 Mar 2025

DOI

https://doi.org/10.2196/66279

ProQuest document ID

3222368309

Document URL

https://www.proquest.com/scholarly-journals/using-synthetic-health-care-data-leverage-large/docview/3222368309/se-2?accountid=208611

© 2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Last updated

2025-11-07

Database

ProQuest One Academic

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Content area

Abstract

Details