Generative pre-trained transformer 4o (GPT-4o) in

Abstract

Objectives

This study aims to assess the accuracy of generative pre-trained transformer 4o (GPT-4o) in answering multiple response questions from the European Diploma in Radiology (EDiR) examination, comparing its performance to that of human candidates.

Materials and methods

Results from 42 EDiR candidates across Europe were compared to those from 26 fourth-year medical students who answered exclusively using the ChatGPT-4o in a prospective study (October 2024). The challenge consisted of 52 recall or understanding-based EDiR multiple-response questions, all without visual inputs.

Results

The GPT-4o achieved a mean score of 82.1 ± 3.0%, significantly outperforming the EDiR candidates with 49.4 ± 10.5% (p < 0.0001). In particular, chatGPT-4o demonstrated higher true positive rates while maintaining lower false positive rates compared to EDiR candidates, with a higher accuracy rate in all radiology subspecialties (p < 0.0001) except informatics (p = 0.20). There was near-perfect agreement between GPT-4 responses (κ = 0.872) and moderate agreement among EDiR participants (κ = 0.334). Exit surveys revealed that all participants used the copy-and-paste feature, and 73% submitted additional questions to clarify responses.

Conclusions

GPT-4o significantly outperformed human candidates in low-order, text-based EDiR multiple-response questions, demonstrating higher accuracy and reliability. These results highlight GPT-4o’s potential in answering text-based radiology questions. Further research is necessary to investigate its performance across different question formats and candidate populations to ensure broader applicability and reliability.

Critical relevance statement

GPT-4o significantly outperforms human candidates in factual radiology text-based questions in the EDiR, excelling especially in identifying correct responses, with a higher accuracy rate compared to radiologists.

Key Points

In EDiR text-based questions, ChatGPT-4o scored higher (82%) than EDiR participants (49%).

Compared to radiologists, GPT-4o excelled in identifying correct responses.

GPT-4o responses demonstrated higher agreement (κ = 0.87) compared to EDiR candidates (κ = 0.33).

Details

Title

Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists

Pages

Publication year

2025

Publication date

Dec 2025

Publisher

Springer Nature B.V.

e-ISSN

18694101

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1186/s13244-025-01941-7

ProQuest document ID

3180262272

Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists

Abstract

Details

Full text options

Suggested sources

Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists

Content area

Abstract

Details

Full text options

Suggested sources