Assessing the Accuracy of Diagnostic Capabilities

Full text

Turn on search term navigation

1. Introduction

Artificial Intelligence (AI) is rapidly reshaping the medical landscape by redefining how information is synthesized, interpreted, and applied across clinical, administrative, and educational domains [1,2]. A significant innovation in AI is the development of large language models (LLMs), transformer-based architectures with strong contextual and semantic capabilities. Despite shared underlying structures, these models excel at handling diverse, multi-modal data, making them well-suited for biomedical and healthcare applications where varied data types are the norm [3]. Unlike traditional rule-based AI, LLMs support multi-turn (conversations that span multiple exchanges or dialogue turns), dynamic interactions that mirror human dialogue [2].

Models such as GPT-4o, Gemini, Grok, and DeepSeek, are trained on massive biomedical and clinical corpora, hold billions of parameters, and already support tasks like triage, chronic disease management, decision support, and patient education. They offer scalable, immediate, and personalized informational support [2,4]. Benchmarking on PromptBench reveals that commercial LLMs, including ChatGPT, Claude, and Gemini, outperform traditional AI systems (like neural translators and symbolic inference engines) in regards to contextual understanding, although they remain limited for formal logic and symbolic reasoning [5].

In medical education, LLMs like ChatGPT are used as pilot approaches in various domains, such as radiotherapy education, helping to demystify complex technical concepts, like dose distribution or radiation side effects [2]. Unlike human standardized patients (SPs), which are expensive and resource-intensive, LLMs can act as a low-cost, highly adaptable standardized patient [6]. Additionally, GPT-4 has demonstrated performance close to that of human evaluators by identifying both factual mistakes and reasoning flaws in student work. By comparison, AI systems using rule-based machine learning showed difficulty interpreting ambiguous language, reinforcing the superior contextual understanding of LLMs [7].

While early results show that LLMs (like ChatGPT) can enhance learner engagement and knowledge retention in medical training, further refinement is needed to ensure reliability, transparency, and clinical trust. Despite these advantages, serious challenges remain. LLMs sometimes “hallucinate”, i.e., they produce verbose or misleading content and lack transparent reasoning, especially in high-stakes contexts [2,3,4]. However, LLM performance in real-world clinical environments still lags behind that of trained physicians. This underperformance is often attributed to issues with interpretability, inconsistent reasoning, the risk of biased or outdated responses, and ethical concerns related to privacy and accountability [2]. While they perform well on simpler tasks, they often falter when it comes to complex, ambiguous clinical problems, particularly when lacking fine-tuned, domain-specific data [8,9]. Even when tested on licensing exams, LLMs showed performance highly dependent on prompt quality and pretraining exposure [10].

Critically, diagnostic reasoning requires more than fact retrieval. Forming a clinical diagnosis involves analyzing patient data, applying medical knowledge, and testing diagnostic hypotheses—a process known as clinical diagnostic reasoning [11].

Recent investigations into LLMs for use in clinical diagnostics shows potential in narrow specialties such as pediatrics [12], pulmonology [13], radiology [14], and even for application to rare disorders [15] or in self-diagnosis contexts [4], demonstrating LLMs’ capacity to interpret complex data and propose credible differential diagnoses, particularly for rare cases. Moreover, comparative studies highlighted that chain-of-thought (CoT) prompts significantly enhanced diagnostic reasoning by structuring inputs in a way that emulates expert thinking, whereas standard prompts often produced overly verbose outputs that hinder decision-making effectiveness [16]. In clinical decision-making contexts in fields such as oncology and cardiology, LLMs produced richer and more personalized treatment suggestions than do database-driven tools, although they occasionally proposed unverified or hallucinated interventions, highlighting the need for clinical oversight [17].

Attempts to enhance LLMs through knowledge graphs have shown promise, particularly in zero-shot settings (no prior examples given to the LLMs), although limitations remain. A notable limitation lies in their reduced capacity to accurately capture all clinically relevant concepts, especially those that are indirect, nuanced, or context-dependent, yet critical for comprehensive diagnostic reasoning [11]. Their performance, particularly in complex or multilingual clinical environments, remains uneven—LLMs excel in English, but show gaps in low-resource languages [18]. Studies on post-operative patient support indicate that LLMs, while occasionally imprecise, offer superior clarity, empathy, and conversational quality compared to the results for rigid rule-based assistants—suggesting that a hybrid approach may offer the best of both systems [19].

Importantly, real-world clinical scenarios are inherently complex and surpass multiple-choice medical examination. Therefore, a comprehensive evaluation of LLMs must consider not only their ability to generate medically accurate language, but also their capacity to retrieve applied knowledge to context-specific problems and to perform diagnostic reasoning [20].

Despite growing interest, a systematic approach to assessing diagnostic reasoning across multiple LLMs is still lacking. Most studies have not provided consistent or scalable scoring frameworks to capture diagnostic capabilities. The current tendency is to evaluate responses with metrics based on n-gram overlap (like BLEU or ROUGE), or to count the attempts until the “correct” diagnosis was delivered. Such approaches oversimplify the multifactorial nature of clinical decision making [21,22].

The use of LLMs for diagnosis and self-diagnosis has already become a reality in medical practice, and the LLM capabilities have been evolving at a very rapid pace. In light of the emerging interactions between physicians and AI, it has become necessary to assess the ability of current systems to appropriately respond to accurately presented medical cases. In this context, our study aimed to conduct a comparative analysis of four LLMs (in their free versions) by presenting them with a series of clinical cases and evaluating their responses, as if they had been provided by medical professionals.

2. Materials and Methods

LLM assessment was achieved using complex medical case presentations, followed by structured questions designed to assess two key capabilities: medical knowledge recall (such as identifying likely diagnoses and interpreting test results) and clinical reasoning capabilities (giving reasons for diagnosis choices and proposing the next steps in patient management). The methodological steps are presented in Figure 1.

The clinical cases used in this study were randomly selected from our university’s internal case database, which is routinely used in problem-based learning (PBL) for medical students. Each case was inspired by real clinical encounters and then carefully adapted by experienced medical educators to maximize educational value. These cases were structured using a standardized, staged format, where clinical information was disclosed progressively to simulate real-time diagnostic reasoning. After each stage, a series of questions specifically related to the newly introduced information was posed, assessing the student’s ability to integrate and apply context-specific data. Cases were selected based on their focus on diagnostic pathways, including primary diagnoses, complications, and adverse reactions, and were classified by difficulty level and medical specialty. This incremental and structured approach ensured consistency in our study, while enabling meaningful evaluation of both medical knowledge recall and applied clinical reasoning across different LLMs.

The available data included the patient’s current issue and symptoms, general context, medical history, laboratory results, medical imaging results, and follow-up results. Each case comprised an introductory overview and a staged disclosure (additional 6 to 10 contextual elements, gradually presented). At each stage, both the understanding of essential medical concepts and the ability to apply them were assessed through progressive questioning (4 to 10 questions). The complexity of each case increased progressively, as contextual data was incrementally added along with more clinical information (Figure 2).

For the AI assessment, the following generative LLMs were chosen: the Chat GPT 4o model (OpenAI, San Francisco, CA, USA), the Grok 3 model (xAI, San Francisco, CA, USA), the Gemini 2.0 flash model (Google DeepMind, London, UK), and the DeepSeek V3 model (DeepSeek, Hangzhou, China), using their free public interfaces. To ensure consistent responses, each model received identical prompts in English. The LLMs’ responses to the questions at each stage were expected to integrate all information provided up to that point, including both general and stage-specific contextual data. Specifically, the application-oriented questions required reference not only to the most recently introduced context, but also to the cumulative contextual information presented throughout the case.

This approach was chosen to evaluate the responses of LLMs in several potential use-case scenarios: first, a physician verifying the accuracy of their clinical reasoning by presenting the case to the AI system; second, a patient assessing whether their case has been appropriately managed by the medical team; and third, a potential future scenario in which AI systems may independently perform complex diagnostic reasoning.

2.1. Prompting Strategy

For the discussion of medical cases, each of the LLMs was asked to assume the role of a medical student, analyze the given clinical scenario, and respond to several questions. Next, the introductory overview was provided. Questions were delivered one at a time, with responses capped at 50 words. New contextual data were incrementally introduced, followed by related questions, with responses capped at 50 words. This strategy was applied repeatedly, according to the number of stages specific to each case. Each question was pre-classified as either a test of subject-specific medical knowledge or a test of the ability to apply that knowledge in medical reasoning.

2.2. Assessment of LLMs Responses

Two medical expert evaluators (experienced in problem-based learning and previously involved in using such clinical cases with medical students), who were blinded to the source of each set of answers, independently scored the responses on a 0–5 scale. For each question, the LLM response was evaluated and the performance scores were awarded based on the following criteria:

Question comprehension—the ability to correctly interpret the clinical query’s intent and scope;

Medical knowledge on the subject—the depth and accuracy of factual medical information presented by LLMs;

Understanding the medical context—the ability to appropriately apply medical knowledge to the specific case details;

Correctness—the accuracy of the final diagnosis and management recommendations;

Clarity in formulating the answer—organization, coherence, and readability of the response.

2.3. Data Analysis

Descriptive statistics for qualitative data were reported as counts and/or percentages. Given the ordinal nature of the scoring system (0–5 scale), non-parametric statistical methods were employed. Although the score distributions were skewed toward higher values—indicating generally strong LLM performance—means and standard deviation intervals (±SD) were also computed to enhance interpretability in the graphical representations.

Overall differences in LLM performance scores were assessed using the Kruskal–Wallis test. For post hoc pairwise comparisons, the Dwass–Steel–Critchlow–Fligner method was applied. Statistical significance was defined as p < 0.05.

Data analysis was performed using Microsoft Excel (Office Professional Plus 2021) and the open-source statistical software Jamovi (version 2.3.28) based on R language [23,24].

3. Results

A total of six complex medical cases were selected, encompassing a range of specialties, patient trajectories, and comorbidities. Each case included a series of questions designed to assess both medical knowledge (122 questions in total) and clinical reasoning capabilities (106 questions in total) (Figure 3). The number of questions per case varied between 28 and 51.

All 231 questions were presented in sequence, along with their corresponding clinical contexts, to the four LLMs described in the Materials and Methods section. The responses received were evaluated by two medical experts, who established a consensus score for each individual answer. Overall, all LLM systems received high scores for their responses, with the proportion of scores of 4 or 5 (good or very good responses) exceeding 80%.

Due to the scoring system and the distribution of scores being skewed toward the maximum, the data did not follow a normal distribution. The Kolmogorov–Smirnov test yielded a p-value of less than 0.001. For the comparison of LLMs, a non-parametric statistical test was used; however, the graphical representation employed mean and standard deviation for better interpretability, as box plots were affected by the ceiling effect in score distribution.

The comparative analysis followed three main directions: an overall comparison of the scores obtained, a comparison of scores for medical knowledge questions, and a comparison of scores for clinical reasoning questions among the four LLMs.

3.1. Overall Comparison

All LLMs scores assigned by experts were compared using a Kruskal–Wallis test, according to each of the five criteria for response evaluation, to establish whether there was any statistically significant difference in performance scores (Figure 4):

Since in all five criteria assessing medical reasoning, statistically significant differences were found according to the Kruskal–Wallis test (p < 0.001), pairwise comparisons were made using the Dwass–Steel–Critchlow–Fligner method (Table 1).

Across all five evaluation criteria, DeepSeek consistently outperformed each of the other models—ChatGPT, Grok, and Gemini—with statistically significant differences in every comparison. As shown in Figure 4, DeepSeek achieved the highest overall scores in each case.

3.2. Comparison for Medical Knowledge

The LLMs’ response data was then split according to each type of question. To comprehensively assess diagnostic capabilities, the mean performance score for responses to questions regarding medical knowledge was computed according to the five criteria for each LLM, and the results are represented graphically (Figure 5).

Given the statistically significant differences observed across all five key dimensions of diagnostic capabilities for questions regarding medical knowledge questions (Kruskal–Wallis test p < 0.001), pairwise comparisons were performed using the Dwass–Steel–Critchlow–Fligner method (Table 2).

When focusing on questions related to medical knowledge, DeepSeek scored significantly higher than ChatGPT, Grok, and Gemini across all five evaluation criteria. As illustrated in Figure 5 DeepSeek consistently achieved the highest scores in each comparison, highlighting its superior performance in this area.

3.3. Comparison for Medical Reasoning

Next, to assess diagnostic capabilities of LLMs, the mean performance score for responses to clinical reasoning questions was calculated based on the five criteria for each LLM, and the results are represented graphically (Figure 6).

Given the statistically significant differences observed across all five key dimensions of diagnostic capabilities regarding medical reasoning questions (Kruskal–Wallis test, p < 0.001), pairwise comparisons were conducted using the Dwass–Steel–Critchlow–Fligner method (Table 3).

When focusing on questions related to medical reasoning, DeepSeek scored significantly higher than ChatGPT, Grok, and Gemini across only the last four evaluation criteria. As illustrated in Figure 6, DeepSeek consistently achieved the highest scores in each comparison, highlighting its superior performance in this area.

4. Discussion

All four LLMs were chosen based on their user popularity and performance. The six randomly chosen cases varied in pathology and medical complexity, while their related questions evaluated both medical knowledge and reasoning capabilities in an almost similar manner (Figure 3). We conducted a comparative analysis of four leading generative LLMs, and the following key observations highlighted their performance in diagnosing medical cases.

For each of the five diagnostic evaluation criteria, significant differences were observed in the mean performance scores of the LLMs (Figure 4). Pairwise comparisons revealed that DeepSeek achieved statistically significantly higher scores than the other three LLMs (p < 0.05) for all five evaluation criteria, while no significant differences were observed among ChatGPT, Grok, and Gemini (Table 3).

Each LLM demonstrated higher mean scores for medical knowledge questions compared to medical reasoning questions (Figure 5 and Figure 6). This outcome was anticipated, as retrieving factual information is generally less complex for software models to perform than is applying knowledge to clinical scenarios.

Focusing solely on the medical knowledge questions, significant differences in mean performance scores were observed among the LLMs, for each of the five evaluation criteria (Figure 5). Among these criteria, “Question comprehention” consistently yielded the highest mean scores for each of the four LLMs. Pairwise comparisons indicated that DeepSeek achieved significantly higher scores than the other three LLMs (p < 0.05) across all five evaluation criteria, whereas no significant differences were found among ChatGPT, Grok, and Gemini (Table 2).

Focusing specifically on medical reasoning questions, “Question comprehention” consistently produced the highest mean scores for all four LLMs, without statistically significant differences between them. In contrast, for the remaining four evaluation criteria, significant differences in mean performance scores were found (Figure 6). Pairwise comparisons indicated that DeepSeek achieved significantly higher scores than the other three LLMs (p < 0.05) across the remaining four evaluation criteria, whereas no significant differences were found among ChatGPT, Grok, and Gemini (Table 3).

Assessing the diagnostic performance of large language models (LLMs) across diverse clinical scenarios has so far been approached using a range of evaluation metrics, including precision, accuracy, recall, and F1-score, to quantify diagnostic effectiveness. Additionally, in some studies, both qualitative and quantitative approaches were used to analyze LLM outputs—qualitative assessments focused on response quality, consistency, and structure, while quantitative metrics, such as top-1, top-3, and top-5 accuracy, evaluated the model’s ability to generate correct diagnoses within its leading predictions [22]. According to Shan G et al., between 2023–2025, many studies evaluated the diagnostic accuracy of several LLMs and concluded that it varied from 25% to 97.8% when compared to that of medical professionals, with a high risk of bias in the majority of studies [25]. On the other hand, our study focused on the capacity of LLMs to answer correctly, in order to check whether they could be safely used as AI assistants in student medical training on diagnostic reasoning. Our analysis showed mean scores starting from 4.25 up to 4.99 (on a 0–5 scale), which we consider good performances for all LLMs, with a better performance for DeepSeek.

The differences observed between LLMs are both objective and potentially subjective. Lower scores were awarded when key factors in case evaluation were overlooked, with systems often being drawn to the first abnormalities presented in the context. This behavior is possibly a consequence of the learning paradigm adopted by the LLMs.

Subjectively, we observed consistent stylistic differences among the models, despite the fact that no prompting directives were provided by the research team. ChatGPT, Grok, and Gemini generally produced narrative-style responses, whereas DeepSeek adopted a more structured format, typically beginning with a brief introductory statement, followed by bullet-pointed subcomponents. Our prompts did not specify whether the expected response should focus on medical knowledge or clinical reasoning, yet DeepSeek was the only LLM that consistently responded to both knowledge-based and reasoning questions, not only with the expected knowledge but also by explicitly connecting the information to the specific case data. In this regard, DeepSeek adhered to the initial instruction in the context-setting prompt to answer within that context, whereas the other systems tended to ignore it, unless the questions explicitly referenced content from the given context. DeepSeek’s superior performance may be attributed to its consistent formatting, stronger prompt adherence, and better contextual integration; further qualitative analysis of response quality is warranted in future research.

As Rider et al. demonstrated through a structured evaluation of six state-of-the-art models using 25 real-world primary immunodeficiency cases, LLMs vary substantially in regards to both diagnostic accuracy and clinical reasoning. While models like GPT-4o reached diagnostic accuracies as high as 96.2%, even the top-performing systems occasionally produced incorrect information or failed to integrate context effectively. These observations underscore the need for more robust evaluation frameworks that assess diagnostic competence beyond simple correctness metrics [15].

Recent progress in large language models (LLMs) has shown that they can help with medical diagnoses, but using structured reasoning methods like chain-of-thought (CoT) is still not common [16]. Some researchers like Liu et al. have used expert-guided CoT in kidney disease diagnosis, but it required training the model specifically for that area. This study takes a different approach by using CoT without extra training (zero-shot approach), which makes the model think more like a doctor and explain its reasoning more clearly [26]. Unlike other studies that focused on tasks like coding for sepsis or summarizing clinical notes, this method directly connects key medical signs—such as high blood pressure or kidney structure description—to possible diagnoses. This better matches how doctors actually work and makes the model’s decisions easier to understand and more useful in real-life situations [16].

When comparing our results with those of Hager P et al., we observed a substantial increase in diagnostic accuracy and response capability in our study. This improvement can be attributed to the use of prefabricated scenarios with high-quality educational context, as opposed to the presentation of real cases, which may include variability and potentially incomplete descriptions [27]. Moreover, the evaluation criteria used in Hager P et al. were extremely rigid, in contrast to the manual interpretation of responses based on five criteria, as implemented in our study [27]. Additionally, our evaluation was performed on numerous successive responses for each case, rather than relying on a single final diagnosis.

Some studies analyzed the use of LLMs specifically in medical education. These models can support patient triage, clinical decision making, and knowledge assessment; however, the accuracy does not always ensure the correct resolution of clinical cases [28]. These findings align with the same potential developmental trajectory as our present research—namely, the integration of LLMs into the training of future physicians—and they similarly conduct a response-level analysis of diagnostic interaction with LLMs [28]. It is also worth highlighting, in this same direction, the approach of Brügge et al., who have taken the next step in the use of LLMs for medical training by attempting to integrate ChatGPT into the education of future medical students [29]—an approach that our work will likely follow in the near future.

Future research should adopt comprehensive, multi-dimensional evaluation tools that capture not only the accuracy, but also the coherence, justification, and adaptability of reasoning. Additionally, systematic comparisons across multiple LLMs using real medical data are essential to gather information on their safe, ethical, and effective deployment in healthcare settings.

4.1. Strengths

Regarding the strengths of this study, we would first like to emphasize that the use of highly standardized educational cases—lacking the inherent variability of real-world scenarios—was an important choice for assessing the diagnostic capabilities of LLMs that were not trained on real clinical data. Second, the use of public LLM interfaces made our study easily reproducible and verifiable. Third and finally, the complexity of the cases employed—comprising a total of 228 questions applied across 4 LLMs—generated over 1000 responses, each of which was scored using five criteria, thereby enhancing the study’s statistical analysis power and enabling a robust comparison of the results.

4.2. Limitations

Regarding the limitations, we aim to highlight the most significant ones. First, the evaluation was conducted using only six complex clinical cases which, although diverse in terms of pathology and structure, may not fully capture the breadth of real-world clinical scenarios. Second, all cases were drawn from a single institutional database, which may introduce potential biases in medical decision-making approaches. Third, the cases were educational in nature and specifically designed to provide all necessary information; due to their idealized format, they are somewhat removed from real-world clinical complexity.

The testing environment, designed with a structured, staged disclosure of information and a 50-word limit on responses, does not fully replicate how generative language models are typically used in real-world settings, where longer and more dynamic interactions are possible. While this format ensured consistency, it may have constrained the depth of model reasoning and response formulation. In addition, differences in model architecture may have led to varied interpretations of the “medical student” prompt. Future work should explore prompt refinement and evaluate responses across different clinical roles (e.g., patient, consultant) for greater consistency.

Although each model was prompted identically, prompting itself remains a source of potential bias, as different LLM architectures may interpret the same instructions differently. The use of the “medical student” role may not standardize behavior across models, limiting the comparability of their performance.

The evaluation of responses by two medical experts with experience in problem-based learning could also be a source of error, as subjective scoring may have influenced the results.

Finally, the findings reflect the performance of specific LLM versions (GPT-4o, Grok 3, Gemini 2.0 Flash, and DeepSeek V3) at the time of the study. Future updates or changes to model architectures may lead to different outcomes, affecting the generalizability and reproducibility of our findings.

5. Conclusions

The comparative analysis of the evaluated LLMs demonstrated that current large language models are capable of generating highly accurate diagnostic responses when presented with well-structured, educationally optimized clinical cases. Among the four models tested, DeepSeek consistently outperformed the others, achieving significantly higher scores across all evaluation criteria. It was particularly distinguished by its superior performance on questions assessing both medical knowledge and applied clinical medical reasoning, although the greatest differences were observed in regards to knowledge-based tasks. While GPT-4o, Grok, and Gemini showed comparable performance levels, DeepSeek’s accuracy and contextual integration set it apart, suggesting its greater potential for supporting medical training and diagnostic decision making in controlled educational settings.

Author Contributions

Conceptualization, A.E.U.-C. and T.D.; methodology, A.E.U.-C. and D.-C.L.; validation, D.-C.L. and T.C.; formal analysis, A.E.U.-C.; investigation, C.D. and A.-G.D.; resources, T.C.; data curation, A.E.U.-C. and T.D.; writing—original draft preparation, A.E.U.-C.; writing—review and editing, D.-C.L., C.D., A.-G.D., T.C. and T.D.; visualization, A.E.U.-C.; supervision, T.D.; project administration, A.E.U.-C.; funding acquisition, A.E.U.-C. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data supporting the findings of this study are not publicly available due to institutional restrictions. The dataset used is the property of Iuliu Hațieganu University of Medicine and Pharmacy, and access is governed by internal policies and ethical guidelines.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Flowchart of the methodological steps used in the study.

Figure 2 Conceptual clinical case stage/context complexity and structure.

Figure 3 Question type distribution for each clinical case.

Figure 4 LLMs comparison of mean performance scores (±SD) according to the five criteria used to assess diagnostic capabilities (CG = Chat GPT, GK = Grok, GE = Gemini, and DS = DeepSeek).

Figure 5 LLMs comparison of mean performance scores (±SD) for questions regarding medical knowledge, according to the five criteria (CG = Chat GPT, GK = Grok, GE = Gemini, and DS = DeepSeek).

Figure 6 LLMs comparison of mean performance scores (±SD) for questions regarding medical reasoning, according to the five criteria (CG = Chat GPT, GK = Grok, GE = Gemini, and DS = DeepSeek).

Table 1

Pairwise comparisons of LLMs performance scores according to the five criteria to assess diagnostic capabilities.

LLMs Type	Understanding the Questionp-Value *	Medical Knowledge of the Subjectp-Value *	Understanding the Medical Contextp-Value *	Correctnessp-Value *	Clarity in Formulating the Answerp-Value *
CC-DS	<0.001	<0.001	<0.001	<0.001	<0.001
CC-GE	0.991	0.688	0.858	0.893	0.872
CC-GK	0.955	0.923	0.973	0.973	0.959
DS-GE	<0.001	<0.001	<0.001	<0.001	<0.001
DS-GK	0.002	<0.001	<0.001	<0.001	<0.001
GE-GK	0.996	0.965	0.984	0.992	0.994

* Pairwise comparisons, bold values were statistically significant; CG = Chat GPT, GK = Grok, GE = Gemini, and DS = DeepSeek.

Table 2

Pairwise comparisons of LLMs performance scores for questions regarding medical knowledge, according to the five criteria to assess diagnostic capabilities.

LLMs Type	Understanding the Questionp-Value *	Medical Knowledge of the Subjectp-Value *	Understanding the Medical Contextp-Value *	Correctnessp-Value *	Clarity in Formulating the Answerp-Value *
CC-DS	0.003	<0.001	<0.001	<0.001	<0.001
CC-GE	0.996	0.893	0.893	0.975	0.945
CC-GK	0.969	0.987	0.981	0.957	0.985
DS-GE	0.002	<0.001	<0.001	<0.001	<0.001
DS-GK	0.009	<0.001	<0.001	<0.001	<0.001
GE-GK	0.91	0.716	0.683	0.785	0.8

* Pairwise comparisons, bold values were statistically significant; CG = Chat GPT, GK = Grok, GE = Gemini, and DS = DeepSeek.

Table 3

Pairwise comparisons of LLMs performance scores for questions regarding medical reasoning, according to the five criteria to assess diagnostic capabilities.

LLMs Type	Understanding the Questionp-Value *	Medical Knowledge of the Subjectp-Value *	Understanding the Medical Contextp-Value *	Correctnessp-Value *	Clarity in Formulating the Answerp-Value *
CC-DS	0.124	<0.001	<0.001	<0.001	<0.001
CC-GE	0.868	0.838	0.978	0.942	0.957
CC-GK	0.993	0.617	0.747	0.695	0.721
DS-GE	0.455	<0.001	<0.001	<0.001	<0.001
DS-GK	0.202	<0.001	<0.001	<0.001	<0.001
GE-GK	0.957	0.977	0.926	0.95	0.946

* Pairwise comparisons, bold values were statistically significant; CG = Chat GPT, GK = Grok, GE = Gemini, and DS = DeepSeek.

References

1. Weidener, L.; Fischer, M. Artificial intelligence in medicine: Cross-sectional study among medical students on application, education, and ethical aspects. JMIR Med. Educ.; 2024; 10, e51247. [DOI: https://dx.doi.org/10.2196/51247] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38180787]

2. Chow, J.C.L.; Li, K. Developing effective frameworks for large language model–based medical chatbots: Insights from radiotherapy education with ChatGPT. JMIR Cancer; 2025; 11, e66633. [DOI: https://dx.doi.org/10.2196/66633] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39965195]

3. Qiu, J.; Li, L.; Sun, J.; Peng, J.; Shi, P.; Zhang, R.; Dong, Y.; Lam, K.; Lo, F.P.W.; Xiao, B. . Large AI models in health informatics: Applications, challenges, and the future. IEEE J. Biomed. Health Inform.; 2023; 27, pp. 6074-6089. [DOI: https://dx.doi.org/10.1109/JBHI.2023.3316750] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37738186]

4. Zhuang, S.; Zeng, Y.; Lin, S.; Chen, X.; Xin, Y.; Li, H.; Lin, Y.; Zhang, C.; Lin, Y. Evaluation of the ability of large language models to self-diagnose oral diseases. iScience; 2024; 27, 111495. [DOI: https://dx.doi.org/10.1016/j.isci.2024.111495]

5. Wang, S.; Ouyang, Q.; Wang, B. Comparative Evaluation of Commercial Large Language Models on PromptBench: An English and Chinese Perspective. Res. Sq.; 2024; preprint [DOI: https://dx.doi.org/10.21203/rs.3.rs-3987793/v1]

6. Wang, C.; Li, S.; Lin, N.; Zhang, X.; Han, Y.; Wang, X.; Liu, D.; Tan, X.; Pu, D.; Li, K. . Application of large language models in medical training evaluation—Using ChatGPT as a standardized patient: Multimetric assessment. J. Med. Internet Res.; 2025; 27, e59435. [DOI: https://dx.doi.org/10.2196/59435]

7. Bewersdorff, A.; Seßler, K.; Baur, A.; Kasneci, E.; Nerdel, C. Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters. Comput. Educ. Artif. Intell.; 2023; 5, 100177. [DOI: https://dx.doi.org/10.1016/j.caeai.2023.100177]

8. Liu, Q.; Yang, R.; Gao, Q.; Liang, T.; Wang, X.; Li, S. A review of applying large language models in healthcare. IEEE Access; 2025; 13, pp. 6878-6892. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3524588]

9. Zada, T.; Tam, N.; Barnard, F.; Van Sittert, M.; Bhat, V.; Rambhatla, S. Medical misinformation in AI-assisted self-diagnosis: Development of a method (EvalPrompt) for analyzing large language models. JMIR Form. Res.; 2025; 9, e66207. [DOI: https://dx.doi.org/10.2196/66207]

10. Liu, M.; Okuhara, T.; Dai, Z.; Huang, W.; Gu, L.; Okada, H.; Furukawa, E.; Kiuchi, T. Evaluating the effectiveness of advanced large language models in medical knowledge: A comparative study using Japanese national medical examination. Int. J. Med. Inform.; 2025; 193, 105673. [DOI: https://dx.doi.org/10.1016/j.ijmedinf.2024.105673]

11. Gao, Y.; Li, R.; Croxford, E.; Caskey, J.; Patterson, B.W.; Churpek, M.; Miller, T.; Dligach, D.; Afshar, M. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study. JMIR AI; 2025; 4, e58670. [DOI: https://dx.doi.org/10.2196/58670] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39993309]

12. Mansoor, M.; Ibrahim, A.F.; Grindem, D.; Baig, A. Large language models for pediatric differential diagnoses in rural health care: Multicenter retrospective cohort study comparing GPT-3 with pediatrician performance. JMIRx Med; 2025; 6, e65263. [DOI: https://dx.doi.org/10.2196/65263] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/40106452]

13. Song, M.; Wang, J.; Yu, Z.; Wang, J.; Yang, L.; Lu, Y.; Li, B.; Wang, X.; Wang, X.; Huang, Q. . PneumoLLM: Harnessing the power of large language model for pneumoconiosis diagnosis. Med. Image Anal.; 2024; 97, 103248. [DOI: https://dx.doi.org/10.1016/j.media.2024.103248] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38941859]

14. Mahyoub, M.; Dougherty, K.; Shukla, A. Extracting pulmonary embolism diagnoses from radiology impressions using GPT-4o: Large language model evaluation study. JMIR Med. Inform.; 2025; 13, e67706. [DOI: https://dx.doi.org/10.2196/67706]

15. Rider, N.L.; Li, Y.; Chin, A.T.; DiGiacomo, D.V.; Dutmer, C.; Farmer, J.R.; Roberts, K.; Savova, G.; Ong, M.S. Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders. J. Allergy Clin. Immunol.; 2025; in press [DOI: https://dx.doi.org/10.1016/j.jaci.2025.02.004]

16. Zhang, W.; Wu, M.; Zhou, L.; Shao, M.; Wang, C.; Wang, Y. A sepsis diagnosis method based on Chain-of-Thought reasoning using large language models. Biocybern. Biomed. Eng.; 2025; 45, pp. 269-277. [DOI: https://dx.doi.org/10.1016/j.bbe.2025.04.002]

17. Wilhelm, T.I.; Roos, J.; Kaczmarczyk, R. Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study. J. Med. Internet Res.; 2023; 25, e49324. [DOI: https://dx.doi.org/10.2196/49324]

18. Shan, X.; Xu, Y.; Wang, Y.; Lin, Y.; Bao, Y. Cross-Cultural Implications of Large Language Models: An Extended Comparative Analysis. HCI International 2024—Late Breaking Papers; Coman, A.; Vasilache, S.; Fui-Hoon Nah, F.; Siau, K.L.; Wei, J.; Margetis, G. Lecture Notes in Computer Science Springer: Cham, Switzerland, 2025; Volume 15375, pp. 106-118. [DOI: https://dx.doi.org/10.1007/978-3-031-76806-4_8]

19. Borna, S.; Gomez-Cabello, C.A.; Pressman, S.M.; Haider, S.A.; Sehgal, A.; Leibovich, B.C.; Cole, D.; Forte, A.J. Comparative Analysis of Artificial Intelligence Virtual Assistant and Large Language Models in Post-Operative Care. Eur. J. Investig. Health Psychol. Educ.; 2024; 14, pp. 1413-1424. [DOI: https://dx.doi.org/10.3390/ejihpe14050093]

20. Wu, C.; Qiu, P.; Liu, J.; Gu, H.; Li, N.; Zhang, Y.; Wang, Y.; Xie, W. Towards evaluating and building versatile large language models for medicine. NPJ Digit. Med.; 2025; 8, 58. [DOI: https://dx.doi.org/10.1038/s41746-024-01390-4]

21. Li, S.; Tan, W.; Zhang, C.; Li, J.; Ren, H.; Guo, Y.; Jia, J.; Liu, Y.; Pan, X.; Guo, J. . Taming large language models to implement diagnosis and evaluating the generation of LLMs at the semantic similarity level in acupuncture and moxibustion. Expert. Syst. Appl.; 2025; 264, 125920. [DOI: https://dx.doi.org/10.1016/j.eswa.2024.125920]

22. Almubark, I. Exploring the impact of large language models on disease diagnosis. IEEE Access; 2025; 13, pp. 8225-8238. [DOI: https://dx.doi.org/10.1109/ACCESS.2025.3527025]

23. Jamovi; version 2.3; the jamovi project. [Computer Software] Sydney, Australia, 2022; Available online: https://www.jamovi.org (accessed on 2 May 2025).

24. R Core Team. R: A Language and Environment for Statistical Computing; version 4.1; [Computer software] R Core Team: Vienna, Austria, 2021; Available online: https://cran.r-project.org (accessed on 2 May 2025).

25. Shan, G.; Chen, X.; Wang, C.; Liu, L.; Gu, Y.; Jiang, H.; Shi, T. Comparing diagnostic accuracy of clinical professionals and large language models: Systematic review and meta-analysis. JMIR Med. Inform.; 2025; 13, e64963. [DOI: https://dx.doi.org/10.2196/64963] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/40279517]

26. Liu, J.; Wang, Y.; Du, J.; Zhou, J.T.; Liu, Z. MedCoT: Medical chain of thought via hierarchical expert. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Miami, FL, USA, 12–16 November 2024; [DOI: https://dx.doi.org/10.18653/v1/2024.emnlp-main.962]

27. Hager, P.; Jungmann, F.; Holland, R.; Bhagat, K.; Hubrecht, I.; Knauer, M.; Vielhauer, J.; Makowski, M.; Braren, R.; Kaissis, G. . Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med.; 2024; 30, pp. 2613-2622. [DOI: https://dx.doi.org/10.1038/s41591-024-03097-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38965432]

28. Safranek, C.W.; Sidamon-Eristoff, A.E.; Gilson, A.; Chartash, D. The Role of Large Language Models in Medical Education: Applications and Implications. JMIR Med. Educ.; 2023; 9, e50945. [DOI: https://dx.doi.org/10.2196/50945] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37578830]

29. Brügge, E.; Ricchizzi, S.; Arenbeck, M.; Keller, M.N.; Schur, L.; Stummer, W.; Holling, M.; Lu, M.H.; Darici, D. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: A randomized controlled trial. BMC Med. Educ.; 2024; 24, 1391. [DOI: https://dx.doi.org/10.1186/s12909-024-06399-7]

Word count: 5757

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Background: In recent years, numerous artificial intelligence applications, especially generative large language models, have evolved in the medical field. This study conducted a structured comparative analysis of four leading generative large language models (LLMs)—ChatGPT-4o (OpenAI), Grok-3 (xAI), Gemini-2.0 Flash (Google), and DeepSeek-V3 (DeepSeek)—to evaluate their diagnostic performance in clinical case scenarios. Methods: We assessed medical knowledge recall and clinical reasoning capabilities through staged, progressively complex cases, with responses graded by expert raters using a 0–5 scale. Results: All models performed better on knowledge-based questions than on reasoning tasks, highlighting the ongoing limitations in contextual diagnostic synthesis. Overall, DeepSeek outperformed the other models, achieving significantly higher scores across all evaluation dimensions (p < 0.05), particularly in regards to medical reasoning tasks. Conclusions: While these findings support the feasibility of using LLMs for medical training and decision support, the study emphasizes the need for improved interpretability, prompt optimization, and rigorous benchmarking to ensure clinical reliability. This structured, comparative approach contributes to ongoing efforts to establish standardized evaluation frameworks for integrating LLMs into diagnostic workflows.

Details

Title

Assessing the Accuracy of Diagnostic Capabilities of Large Language Models

Author

Urda-Cîmpean Andrada Elena¹

; Daniel-Corneliu, Leucuța¹

; Drugan, Cristina²

; Alina-Gabriela, Duțu²; Tudor, Călinici¹

; Drugan Tudor¹

¹ Department of Medical Informatics and Biostatistics, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania
² Department of Medical Biochemistry, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania

First page

1657

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20754418

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/diagnostics15131657

ProQuest document ID

3229142438

Assessing the Accuracy of Diagnostic Capabilities of Large Language Models

Jump to:

Full text

1. Introduction

2. Materials and Methods

2.1. Prompting Strategy

2.2. Assessment of LLMs Responses

2.3. Data Analysis

3. Results

3.1. Overall Comparison

3.2. Comparison for Medical Knowledge

3.3. Comparison for Medical Reasoning

4. Discussion

4.1. Strengths

4.2. Limitations

5. Conclusions

Abstract

Details

Suggested sources