Content area
Full Text
Correspondence to Dr Renaud Duval, Department of Ophthalmology, University of Montreal, Montreal, Canada; [email protected]
WHAT IS ALREADY KNOWN ON THIS TOPIC
Clinicians are exploring the use of large language models (LLMs) like Generative Pre-trained Transformer (GPT) to improve diagnostic accuracy and clinical decision-making in medicine, notably in ophthalmology. Studies show that GPT-4 outperforms previous models in ophthalmology question banks, but its text generation method reveals limitations in critical thinking. Early research using ophthalmology case reports suggests a high agreement between LLMs and experts, yet the application of LLMs in a large set of ophthalmology clinical challenges remains unexplored.
WHAT THIS STUDY ADDS
This study assesses GPT-4’s performance on ophthalmological cases featured in the Journal of the American Medical Association Ophthalmology Clinical Challenges section, showcasing its diagnostic and decision-making capabilities. It also evaluates the efficacy of various prompting strategies and positions GPT-4’s performance in relation to ophthalmology trainees.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
This study underscores the potential of LLMs within ophthalmology, suggesting a future where AI complements clinical expertise. By demonstrating that GPT-4 can achieve commendable performance in complex ophthalmology cases, this study may catalyse the discussion on integrating AI in clinical decision support systems and encourage policy frameworks that facilitate the responsible deployment of LLMs in patient care.
Introduction
Globally, clinicians and scientists alike are contemplating the potential uses of large language models (LLMs) in improving diagnostic precision and supporting clinical decision-making processes.1 LLMs, which represent fine-tuned foundation models trained on large datasets, can produce coherent text and demonstrate complex reasoning capabilities.2–5 Generative Pre-trained Transformer (GPT)-4 currently sets the industry standard in the LLM domain, showing considerable improvements over its predecessors in the medical domain.4 Notably, GPT-4’s diagnostic and clinical decision-making abilities seem to be enhanced as it continues to learn.5
In ophthalmology, our group has previously studied the performance of GPT in medical question answering. We have shown that GPT-4 can achieve an accuracy of 72.9% on the large ophthalmology question banks, outperforming GPT-3.5 by 18.3%.6 7 Since our original work, numerous subsequent studies have corroborated our findings.6–11 We have also shown that GPT-4 performs best in recall questions compared with ones involving clinical decision-making.7 Thus, the ability of LLMs to...