Abstract

Translate

Social determinants of health (SDoH) play a critical role in patient outcomes, yet their documentation is often missing or incomplete in the structured data of electronic health records (EHRs). Large language models (LLMs) could enable high-throughput extraction of SDoH from the EHR to support research and clinical care. However, class imbalance and data limitations present challenges for this sparsely documented yet critical information. Here, we investigated the optimal methods for using LLMs to extract six SDoH categories from narrative text in the EHR: employment, housing, transportation, parental status, relationship, and social support. The best-performing models were fine-tuned Flan-T5 XL for any SDoH mentions (macro-F1 0.71), and Flan-T5 XXL for adverse SDoH mentions (macro-F1 0.70). Adding LLM-generated synthetic data to training varied across models and architecture, but improved the performance of smaller Flan-T5 models (delta F1 + 0.12 to +0.23). Our best-fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models in the zero- and few-shot setting, except GPT4 with 10-shot prompting for adverse SDoH. Fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p < 0.05). Our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. These results demonstrate the potential of LLMs in improving real-world evidence on SDoH and assisting in identifying patients who could benefit from resource support.

Details

Title

Large language models to identify social determinants of health in electronic health records

Author

Guevara, Marco¹; Chen, Shan¹

; Thomas, Spencer²; Chaunzwa, Tafadzwa L.¹; Franco, Idalid³; Kann, Benjamin H.¹

; Moningi, Shalini³; Qian, Jack M.¹; Goldstein, Madeleine⁴; Harper, Susan⁴; Aerts, Hugo J. W. L.⁵

; Catalano, Paul J.⁶; Savova, Guergana K.⁷; Mak, Raymond H.¹; Bitterman, Danielle S.¹

¹ Mass General Brigham, Harvard Medical School, Artificial Intelligence in Medicine (AIM) Program, Boston, USA (GRID:grid.38142.3c) (ISNI:000000041936754X); Brigham and Women’s Hospital/Dana-Farber Cancer Institute, Department of Radiation Oncology, Boston, USA (GRID:grid.62560.37) (ISNI:0000 0004 0378 8294)
² Mass General Brigham, Harvard Medical School, Artificial Intelligence in Medicine (AIM) Program, Boston, USA (GRID:grid.38142.3c) (ISNI:000000041936754X); Brigham and Women’s Hospital/Dana-Farber Cancer Institute, Department of Radiation Oncology, Boston, USA (GRID:grid.62560.37) (ISNI:0000 0004 0378 8294); Boston Children’s Hospital, Harvard Medical School, Computational Health Informatics Program, Boston, USA (GRID:grid.38142.3c) (ISNI:000000041936754X)
³ Brigham and Women’s Hospital/Dana-Farber Cancer Institute, Department of Radiation Oncology, Boston, USA (GRID:grid.62560.37) (ISNI:0000 0004 0378 8294)
⁴ Dana-Farber Cancer Institute, Adult Resource Office, Boston, USA (GRID:grid.65499.37) (ISNI:0000 0001 2106 9910)
⁵ Mass General Brigham, Harvard Medical School, Artificial Intelligence in Medicine (AIM) Program, Boston, USA (GRID:grid.38142.3c) (ISNI:000000041936754X); Brigham and Women’s Hospital/Dana-Farber Cancer Institute, Department of Radiation Oncology, Boston, USA (GRID:grid.62560.37) (ISNI:0000 0004 0378 8294); Maastricht University, Radiology and Nuclear Medicine, GROW & CARIM, Maastricht, The Netherlands (GRID:grid.5012.6) (ISNI:0000 0001 0481 6099)
⁶ Dana-Farber Cancer Institute and Department of Biostatistics, Harvard T. H. Chan School of Public Health, Department of Data Science, Boston, USA (GRID:grid.65499.37) (ISNI:0000 0001 2106 9910)
⁷ Boston Children’s Hospital, Harvard Medical School, Computational Health Informatics Program, Boston, USA (GRID:grid.38142.3c) (ISNI:000000041936754X)

Pages

Publication year

2024

Publication date

Dec 2024

Publisher

Nature Publishing Group

e-ISSN

23986352

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41746-023-00970-0

ProQuest document ID

2912906428

© The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Large language models to identify social determinants of health in electronic health records

Jump to:

Abstract

Details

Suggested sources