Content area

Abstract

Background:Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field. More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent.

Objective:This study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field.

Methods:In this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024. Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports. The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. The NMA was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs were performed using a grading scale. A larger surface under the cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy.

Results:The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases. Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087). ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis. Gemini (SUCRA=0.9649) had the highest rated SUCRA value for accuracy in the area of triage and classification.

Conclusions:Our study indicates that ChatGPT-4o has an advantage when answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate at the top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios.

Trial Registration:PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245

Details

1009240
Title
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis
Author
Wang, Ling  VIAFID ORCID Logo  ; Li, Jinglin  VIAFID ORCID Logo  ; Zhuang, Boyang  VIAFID ORCID Logo  ; Huang, Shasha  VIAFID ORCID Logo  ; Fang, Meilin  VIAFID ORCID Logo  ; Wang, Cunze  VIAFID ORCID Logo  ; Li, Wen  VIAFID ORCID Logo  ; Zhang, Mohan  VIAFID ORCID Logo  ; Gong, Shurong  VIAFID ORCID Logo 
Publication title
Volume
27
First page
e64486
Publication year
2025
Publication date
2025
Section
Digital Health Reviews
Publisher
Gunther Eysenbach MD MPH, Associate Professor
Place of publication
Toronto
Country of publication
Canada
e-ISSN
1438-8871
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-04-30
Milestone dates
2024-07-18 (Preprint first published); 2024-07-18 (Submitted); 2025-02-04 (Revised version received); 2025-04-03 (Accepted); 2025-04-30 (Published)
Publication history
 
 
   First posting date
30 Apr 2025
ProQuest document ID
3222368409
Document URL
https://www.proquest.com/scholarly-journals/accuracy-large-language-models-when-answering/docview/3222368409/se-2?accountid=208611
Copyright
© 2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-07
Database
ProQuest One Academic