Content area

Abstract

Background:Artificial intelligence (AI) has become widely applied across many fields, including medical education. Content validation and its answers are based on training datasets and the optimization of each model. The accuracy of large language model (LLMs) in basic medical examinations and factors related to their accuracy have also been explored.

Objective:We evaluated factors associated with the accuracy of LLMs (GPT-3.5, GPT-4, Google Bard, and Microsoft Bing) in answering multiple-choice questions from basic medical science examinations.

Methods:We used questions that were closely aligned with the content and topic distribution of Thailand’s Step 1 National Medical Licensing Examination. Variables such as the difficulty index, discrimination index, and question characteristics were collected. These questions were then simultaneously input into ChatGPT (with GPT-3.5 and GPT-4), Microsoft Bing, and Google Bard, and their responses were recorded. The accuracy of these LLMs and the associated factors were analyzed using multivariable logistic regression. This analysis aimed to assess the effect of various factors on model accuracy, with results reported as odds ratios (ORs).

Results:The study revealed that GPT-4 was the top-performing model, with an overall accuracy of 89.07% (95% CI 84.76%‐92.41%), significantly outperforming the others (P<.001). Microsoft Bing followed with an accuracy of 83.69% (95% CI 78.85%‐87.80%), GPT-3.5 at 67.02% (95% CI 61.20%‐72.48%), and Google Bard at 63.83% (95% CI 57.92%‐69.44%). The multivariable logistic regression analysis showed a correlation between question difficulty and model performance, with GPT-4 demonstrating the strongest association. Interestingly, no significant correlation was found between model accuracy and question length, negative wording, clinical scenarios, or the discrimination index for most models, except for Google Bard, which showed varying correlations.

Conclusions:The GPT-4 and Microsoft Bing models demonstrated equal and superior accuracy compared to GPT-3.5 and Google Bard in the domain of basic medical science. The accuracy of these models was significantly influenced by the item’s difficulty index, indicating that the LLMs are more accurate when answering easier questions. This suggests that the more accurate models, such as GPT-4 and Bing, can be valuable tools for understanding and learning basic medical science concepts.

Details

1009240
Location
Company / organization
Title
Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study
Publication title
Volume
11
First page
e58898
Publication year
2025
Publication date
2025
Section
Artificial Intelligence (AI) in Medical Education
Publisher
JMIR Publications
Place of publication
Toronto
Country of publication
Canada
Publication subject
e-ISSN
23693762
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-01-13
Milestone dates
2024-03-29 (Preprint first published); 2024-03-29 (Submitted); 2024-05-22 (Revised version received); 2024-12-04 (Accepted); 2025-01-13 (Published)
Publication history
 
 
   First posting date
13 Jan 2025
ProQuest document ID
3158513018
Document URL
https://www.proquest.com/scholarly-journals/factors-associated-with-accuracy-large-language/docview/3158513018/se-2?accountid=208611