Content area

Abstract

Related Article This is a corrected version. See correction statement in https://formative.jmir.org/2025/1/e71664

Background:The rapid development of large language models (LLMs) such as OpenAI’s ChatGPT has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing examination assistance. Recently, LLMs have been enhanced with image recognition capabilities.

Objective:This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations.

Methods:This study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. Student performance data were obtained from AMBOSS, including metrics such as the “student passed mean” and “majority vote.” Statistical analysis was conducted using Python (Python Software Foundation), with key libraries for data manipulation and visualization.

Results:GPT-4 1106 Vision Preview (OpenAI) outperformed Bard Gemini Pro (Google), correctly answering 56.9% (609/1070) of questions compared to Bard’s 44.6% (477/1070), a statistically significant difference (χ2₁=32.1, P<.001). However, GPT-4 1106 left 16.1% (172/1070) of questions unanswered, significantly higher than Bard’s 4.1% (44/1070; χ2₁=83.1, P<.001). When considering only answered questions, GPT-4 1106’s accuracy increased to 67.8% (609/898), surpassing both Bard (477/1026, 46.5%; χ2₁=87.7, P<.001) and the student passed mean of 63% (674/1070, SE 1.48%; χ2₁=4.8, P=.03). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (282/465, 60.65% vs 327/605, 54.1%; χ2₁=4.4, P=.04) and Bard Gemini Pro exhibiting a similar trend (255/465, 54.8% vs 222/605, 36.7%; χ2₁=34.3, P<.001). The student majority vote achieved an overall accuracy of 94.5% (1011/1070), significantly outperforming both artificial intelligence models (GPT-4 1106: χ2₁=408.5, P<.001; Bard Gemini Pro: χ2₁=626.6, P<.001).

Conclusions:Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical.

Details

1009240
Business indexing term
Title
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study
Author
Publication title
Volume
8
First page
e57592
Number of pages
11
Publication year
2024
Publication date
2024
Section
Development and Evaluation of Research Methods, Instruments and Tools
Publisher
JMIR Publications
Place of publication
Toronto
Country of publication
Canada
Publication subject
e-ISSN
2561326X
Source type
Scholarly Journal
Language of publication
English
Document type
Case Study, Journal Article
Publication history
 
 
Online publication date
2024-12-17
Milestone dates
2024-02-20 (Preprint first published); 2024-02-20 (Submitted); 2024-09-06 (Revised version received); 2024-09-09 (Accepted); 2024-12-17 (Published)
Publication history
 
 
   First posting date
17 Dec 2024
ProQuest document ID
3148807217
Document URL
https://www.proquest.com/scholarly-journals/evaluating-bard-gemini-pro-gpt-4-vision-against/docview/3148807217/se-2?accountid=208611
Copyright
© 2024. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-08-27
Database
ProQuest One Academic