Content area

Abstract

Disclosure: J.H. Flory: None. A. Petrov: None. M. Tuttle: None.

Background: Large language models (LLMs) may have a role in providing clinical decision support. If a clinical question has a clear right answer, modern LLMs often show impressive accuracy providing this answer. However, for clinical questions where there is genuine uncertainty (i.e., equipoise) as to the best course of action (for example, whether to biopsy a thyroid nodule of intermediate size and risk characteristics) LLM performance is not straightforward to characterize. Methods: The LLM GPT-4 was queried with 100 vignettes describing hypothetical patients with an incidentally discovered thyroid nodule. Patient characteristics (including demographics, comorbidities, and treatment preferences), nodule size, and nodule radiographic characteristics were varied at random. GPT-4 was asked to choose whether the nodules should be biopsied. Five different query formats (‘prompts’) were tested: a ‘simple’ prompt; a ‘less-aggressive’ prompt that requested to minimize unnecessary biopsies; an ‘ATA’ prompt that requested to follow American Thyroid Association guidelines, a ‘TI-RADS’ prompt that requested to follow the Thyroid Reporting and Data System guidelines, and a ‘TI-RADS detailed’ prompt that also provided a 225-word precis of the TI-RADS guidelines. Results: Overall rates of biopsy recommendation ranged from 67% for the ATA prompt to 33% for the less-aggressive prompt. Recommendations varied according to patient and nodule characteristics. For example, for nodules with TIRADS score 4-6 (‘intermediate risk’) and size < 1 cm, GPT-4 never recommended biopsy; for size 1-1.5 cm, the ‘TI-RADS detailed’ prompt recommended biopsy in 13% versus 20% for the ‘less-aggressive prompt’ and 33% for the other 3 prompts; for size 1.5-2.5 cm the biopsy recommendation rate ranged from 9% for the ‘less-aggressive’ prompt to 63% for the ‘TI-RADS detailed’ prompt; for size > 2.5 cm biopsy rates ranged from 31% for the ‘less aggressive’ prompt to 96% for the ‘TI-RADS detailed’ prompt. Recommendations were further personalized by patient characteristics. For example, patients who were randomly assigned severe comorbidities such as end stage heart failure were less likely to be recommended for biopsy (OR 0.21, 95% CI 0.05 - 0.91) Conclusions: GPT-4 provided thyroid nodule management recommendations that frequently varied from clinical guidelines, even when a brief precis of a relevant guideline was provided to the model as a reference. In at least some cases the variation may have reflected appropriate personalization of care (e.g., avoiding biopsy in patients with major comorbidities and limited life expectancy). These findings highlight the need for experts to make value judgments when interpreting and designing LLMs for clinical decision support. We recommend against reliance on LLM output until it is shown to align not just with clinical guidelines but also with patient and clinician preferences.

Presentation: Monday, July 14, 2025

Details

1009240
Title
MON-391 The Large Language Model GPT-4 Compared to Guidelines on Thyroid Nodule Management Under Conditions of Clinical Uncertainty
Author
Flory, James H 1 ; Petrov, Aleksandr 2 ; Tuttle, Michael 3 

 MD, MSCE Memorial Sloan Kettering Cancer Center, NY, NY, USA 
 Memorial Sloan Kettering Cancer Center, NY, NY, USA 
 Memorial Sloan Kettering, New York, NY, USA 
Publication title
Volume
9
Issue
Supplement_1
Number of pages
3
Publication year
2025
Publication date
Oct-Nov 2025
Section
Abstract
Publisher
Oxford University Press
Place of publication
Oxford
Country of publication
United Kingdom
e-ISSN
24721972
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-10-22
Publication history
 
 
   First posting date
22 Oct 2025
ProQuest document ID
3264003516
Document URL
https://www.proquest.com/scholarly-journals/mon-391-large-language-model-gpt-4-compared/docview/3264003516/se-2?accountid=208611
Copyright
© 2025 The Author(s) 2025. Published by Oxford University Press on behalf of the Endocrine Society. This work is published under https://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-09
Database
ProQuest One Academic