Full text

Turn on search term navigation

© 2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/" target="_blank">https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Background:Diagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows.

Objective:This study aimed to evaluate the diagnostic accuracy of ChatGPT-4o and 4 open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, and Llama3.1:70b) for rare diseases, assesses the language effect on diagnostic performance, and explore retrieval augmented generation (RAG) and chain-of-thought (CoT) reasoning.

Methods:We extracted clinical manifestations of 121 rare diseases from China’s inaugural rare disease catalog. ChatGPT-4o generated a primary and 5 differential diagnoses, while 4 LLMs were assessed in both English and Chinese contexts. The lowest-performing model underwent RAG and CoT re-evaluation. Diagnostic accuracy was compared via the McNemar test. A survey evaluated 11 clinicians’ familiarity with rare diseases.

Results:ChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. Language effects varied across models: qwen2.5:7b showed comparable performance in Chinese (51.2%) and English (47.9%; χ²1=0.32, P=.57), whereas Llama3.1:8b exhibited significantly higher English accuracy (67.8% vs 31.4%; χ²1=40.20, P<.001). Among larger models, qwen2.5:72b maintained cross-lingual consistency considering the odds ratio (OR; Chinese: 82.6% vs English: 83.5%; OR 0.88, 95% CI 0.27-2.76,P=1.000), contrasting with Llama3.1:70b’s language-dependent variation (Chinese: 80.2% vs English: 90.1%; OR 0.29,95% CI 0.08-0.83, P=.02). Cross-model comparisons revealed Llama3.1:8b underperformed qwen2.5:7b in Chinese (χ²1=13.22,P<.001) but surpassed it in English (χ²1=13.92,P<.001). No significant differences were observed between qwen2.5:72b and Llama3.1:70b (English: OR 0.33, P=.08; Chinese: OR 1.5, 95% CI 0.48-5.12,P=.07); qwen2.5:72b matched ChatGPT-4o’s performance in both languages (English: OR 0.33, P=.08; Chinese: OR 0.44, P=.09); Llama3.1:70b mirrored ChatGPT-4o’s English accuracy (OR 1, P=1.000) but lagged in Chinese (OR 0.33; P=.02). RAG implementation enhanced qwen2.5:7b’s accuracy to 79.3% (χ²1=31.11, P<.001) with 85.9% retrieval precision. The distilled model Deepseek-R1:7b markedly underperformed (9.9% vs qwen2.5:7b; χ²1=42.19, P<.001). Clinician surveys revealed significant knowledge gaps in rare disease management.

Conclusions:ChatGPT-4o demonstrated superior diagnostic performance for rare diseases. While Llama3.1:8b demonstrates viability for localized deployment in resource-constrained English diagnostic workflows, Chinese applications require larger models to achieve comparable diagnostic accuracy. This urgency is heightened by the release of open-source models like DeepSeek-R1, which may see rapid adoption without thorough validation. Successful clinical implementation of LLMs requires 3 core elements: model parameterization, user language, and pretraining data. The integration of RAG significantly enhanced open-source LLM accuracy for rare disease diagnosis, although caution remains warranted for low-parameter reasoning models showing substantial performance limitations. We recommend hospital IT departments and policymakers prioritize language relevance in model selection and consider integrating RAG with curated knowledge bases to enhance diagnostic utility in constrained settings, while exercising caution with low-parameter models.

Details

Title
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
Author
Zhong, Wei  VIAFID ORCID Logo  ; Liu, YiFan  VIAFID ORCID Logo  ; Liu, Yan  VIAFID ORCID Logo  ; Yang, Kai  VIAFID ORCID Logo  ; Gao, HuiMin  VIAFID ORCID Logo  ; Yan, HuiHui  VIAFID ORCID Logo  ; Hao, WenJing  VIAFID ORCID Logo  ; Yan, YouSheng  VIAFID ORCID Logo  ; Yin, ChengHong  VIAFID ORCID Logo 
First page
e69929
Section
Artificial Intelligence
Publication year
2025
Publication date
2025
Publisher
Gunther Eysenbach MD MPH, Associate Professor
e-ISSN
1438-8871
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3222369608