Full Text

Turn on search term navigation

Copyright © 2024, Ishida et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Introduction: This study evaluates the diagnostic performance of the latest large language models (LLMs), GPT-4o (OpenAI, San Francisco, CA, USA) and Claude 3 Opus (Anthropic, San Francisco, CA, USA), in determining causes of death from medical histories and postmortem CT findings.

Methods: We included 100 adult cases whose postmortem CT scans were diagnosable for the causes of death using the gold standard of autopsy results. Their medical histories and postmortem CT findings were compiled, and clinical and imaging diagnoses of both the underlying and immediate causes of death, as well as their personal information, were carefully separated from the database to be shown to the LLMs. Both GPT-4o and Claude 3 Opus generated the top three differential diagnoses for each of the underlying or immediate causes of death based on the following three prompts: 1) medical history only; 2) postmortem CT findings only; and 3) both medical history and postmortem CT findings. The diagnostic performance of the LLMs was compared using McNemar’s test.

Results: For the underlying cause of death, GPT-4o achieved primary diagnostic accuracy rates of 78%, 72%, and 78%, while Claude 3 Opus achieved 72%, 56%, and 75% for prompts 1, 2, and 3, respectively. Including any of the top three differential diagnoses, GPT-4o’s accuracy rates were 92%, 90%, and 92%, while Claude 3 Opus’s rates were 93%, 69%, and 93% for prompts 1, 2, and 3, respectively. For the immediate cause of death, GPT-4o’s primary diagnostic accuracy rates were 55%, 58%, and 62%, while Claude 3 Opus’s rates were 60%, 62%, and 63% for prompts 1,2, and 3, respectively. For any of the top three differential diagnoses, GPT-4o’s accuracy rates were 88% for prompt 1 and 91% for prompts 2 and 3, whereas Claude 3 Opus’s rates were 92% for all three prompts. Significant differences between the models were observed for prompt two in diagnosing the underlying cause of death (p = 0.03 and <0.01 for the primary and top three differential diagnoses, respectively).

Conclusion: Both GPT-4o and Claude 3 Opus demonstrated relatively high performance in diagnosing both the underlying and immediate causes of death using medical histories and postmortem CT findings.

Details

Title
Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings
Author
Ishida Masanori; Gonoi Wataru; Nyunoya Keisuke; Abe, Hiroyuki; Shirota Go; Okimoto Naomasa; Fujimoto Kotaro; Kurokawa Mariko; Nakai Motoki; Saito Kazuhiro; Ushiku Tetsuo; Abe, Osamu
University/institution
U.S. National Institutes of Health/National Library of Medicine
Publication year
2024
Publication date
2024
Publisher
Cureus Inc.
e-ISSN
21688184
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3111393607
Copyright
Copyright © 2024, Ishida et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.