Content area
Background
Large language models (LLMs) such as GPT-4o have the potential to transform clinical decision-making, patient education, and medical research. Despite impressive performance in generating patient-friendly educational materials and assisting in clinical documentation, concerns remain regarding the reliability, subtle errors, and biases that can undermine their use in high-stakes medical settings.
Methods
A multi-phase experimental design was employed to assess the performance of GPT-4o on the Chilean anesthesiology exam (CONACEM), which comprised 183 questions covering four cognitive domains—Understanding, Recall, Application, and Analysis—based on Bloom’s taxonomy. Thirty independent simulation runs were conducted with systematic variation of the model’s temperature parameter to gauge the balance between deterministic and creative responses. The generated responses underwent qualitative error analysis using a refined taxonomy that categorized errors such as “Unsupported Medical Claim,” “Hallucination of Information,” “Sticking with Wrong Diagnosis,” “Non-medical Factual Error,” “Incorrect Understanding of Task,” “Reasonable Response,” “Ignore Missing Information,” and “Incorrect or Vague Conclusion.” Two board-certified anesthesiologists performed independent annotations, with disagreements resolved by a third expert. Statistical evaluations—including one-way ANOVA, non-parametric tests, chi-square, and linear mixed-effects modeling—were used to compare performance across domains and analyze error frequency.
Results
GPT-4o achieved an overall accuracy of 83.69%. Performance varied significantly by cognitive domain, with the highest accuracy observed in the Understanding (90.10%) and Recall (84.38%) domains, and lower accuracy in Application (76.83%) and Analysis (76.54%). Among the 120 incorrect responses, unsupported medical claims were the most common error (40.69%), followed by vague or incorrect conclusions (22.07%). Co-occurrence analyses revealed that unsupported claims often appeared alongside imprecise conclusions, highlighting a trend of compounded errors particularly in tasks requiring complex reasoning. Inter-rater reliability for error annotation was robust, with a mean Cohen’s kappa of 0.73.
Conclusions
While GPT-4o exhibits strengths in factual recall and comprehension, its limitations in handling higher-order reasoning and diagnostic judgment are evident through frequent unsupported medical claims and vague conclusions. These findings underscore the need for improved domain-specific fine-tuning, enhanced error mitigation strategies, and integrated knowledge verification mechanisms prior to clinical deployment.
Background
The advent of large language models (LLMs) marks a transformative era in healthcare by offering novel means to support clinical decision-making, patient education, and medical research. Their ability to generate patient-friendly content and assist with complex documentation has led to their rapid adoption in various medical applications, despite concerns about reliability and the risk of misinformation [1,2,3,4].
Background literature indicates that LLMs such as OpenAI’s GPT series and other contemporary models (e.g., PaLM, Bard) have demonstrated impressive proficiency in standardized examinations and clinical tasks [1, 5]. Recent evaluations have also revealed that these models can help bridge gaps in medical expertise, although their performance varies based on the cognitive demands of the task [5, 6]. However, the scattered nature of previous studies across different countries and languages calls for a more centralized and systematic assessment.
The rationale for this study stems from the need to address these inconsistencies by establishing a rigorous evaluation framework that not only incorporates traditional performance metrics but also scrutinizes error patterns such as unsupported medical claims and vague conclusions [3, 7]. Moreover, integrating diverse data from global medical exams can enhance our understanding of both the strengths and limitations of current LLMs, ultimately guiding improvements in model reliability and safety for clinical applications [4].
This study aims to evaluate the performance of GPT-4o on the Chilean anesthesiology examination (CONACEM) while conducting a detailed error analysis using a refined taxonomy. Specifically, our objectives are to [1] measure overall model accuracy and assess performance variation across cognitive domains; [2] identify and categorize recurrent error types; and [3] derive insights to inform improvements in LLM deployment in high-stakes clinical settings. By benchmarking these capabilities within a real-world exam context, we intend to advance the safe integration of LLMs in healthcare [6, 7].
Methods
Ethical considerations
This study utilized publicly available exam content and de-identified human performance data. As no human subjects were directly involved and the data were anonymized, institutional review board approval and informed consent were not required. This manuscript adheres to the appropriate EQUATOR [8] guidelines. No clinical trial registration was applicable for this study.
Study design
Our investigation utilized a multi-phase experimental design that focused on both quantitative performance metrics and qualitative error patterns across various cognitive domains. The study included two main phases: a systematic evaluation of model performance through multiple controlled simulations and a thorough analysis of error patterns via expert annotation of incorrect responses.
Dataset characteristics and preparation
The study corpus consisted of 183 questions from the CONACEM anesthesiology exam, covering essential areas such as pediatric anesthesia, clinical pharmacology, cardiovascular and respiratory physiology, advanced life support, perioperative monitoring, critical care management, regional anesthesia techniques, and pain management protocols. Following Bloom’s taxonomy [9], the questions were organized into four cognitive domains: Understand (n = 68), Recall (n = 48), Apply (n = 40), and Analyze (n = 27).
To mitigate potential position bias [10], we implemented a systematic randomization protocol for answer choices while preserving the necessary logical sequence.
Model evaluation protocol
We conducted 30 independent simulation runs with GPT-4o [11], the rationale for this temperature exploration was based on the hypothesis that different cognitive domains might benefit from varying levels of response determinism. We theorized that higher temperatures (approaching 1.0) could potentially enhance performance on complex analytical tasks by introducing greater variability and creative reasoning approaches, while lower temperatures (approaching 0.0) might optimize factual recall and straightforward knowledge retrieval through more deterministic responses. This systematic parameter variation enabled us to evaluate whether optimal temperature settings are domain-specific and to identify potential performance trade-offs between consistency and creativity in clinical reasoning tasks.
The detailed temperature variation provided insights into the balance between response determinism and diversity across different cognitive demands. A temperature of 0 indicates maximum determinism with consistent responses, while a temperature of 1 represents maximum creativity and diverse outputs. The intermediate values allowed us to map the performance landscape across this determinism-creativity spectrum and identify optimal parameter ranges for different types of clinical questions.
The evaluation utilized a zero-shot prompting approach [12] without any additional fine-tuning or domain adaptation. This method was specifically selected to simulate realistic scenarios where the model is assessed without prior domain-specific adaptation, reflecting conditions akin to its potential use in uncontrolled clinical environments. The zero-shot protocol involved presenting questions directly without exemplars or domain-specific context, ensuring a standardized prompt structure for all questions.
Response generation protocol
Each question was processed through the model’s API using a standardized prompt template. The template asked for the selection of the most appropriate answer choice, along with a detailed justification, confidence assessment, and identification of key factors in the decision-making process. This structured approach generated a total of 5,490 responses, creating a robust dataset for further analysis.
Error analysis framework
Our error analysis framework was designed to characterize GPT-4o’s failure modes using a multi-dimensional taxonomy, incorporating both knowledge-based and reasoning-based perspectives. This refined taxonomy draws from prior work on LLMs in clinical contexts [13] and was adapted to the specific cognitive demands of the CONACEM anesthesiology exam.
The framework defines a total of 7 distinct error types and 2 non-error types, grouped into four major classes:
*
Reasoning-based errors (e.g., Persistence in an incorrect diagnosis, Incorrect or vague conclusion),
*
Knowledge-based errors (e.g., Unsupported medical assertions, Non-medical factual errors),
*
Comprehension-related errors (e.g., Misunderstanding the task, Information hallucination), and.
*
Non-errors, which include cases where the model provided a reasonable but suboptimal answer or the response could not be categorized.
Each category captures a specific failure pattern, such as when the model adheres to an incorrect answer despite correct reasoning, misrepresents clinical facts, or invents information not present in the original question. The classification system was developed and validated by anesthesiology experts to ensure clinical relevance.
From the 1,370 incorrect responses generated across all simulations, we selected 120 for detailed annotation using stratified random sampling. This sample size achieved a minimum detectable effect size of 0.258, enabling detection of 15% differences between domains with 80% statistical power. The stratification ensured proportional representation across cognitive domains (Understand: n = 38, Recall: n = 27, Apply: n = 31, Analyze: n = 24) and temperature settings.
All annotations were performed by trained domain experts using standardized criteria. The full list of error types and example annotations are available in the supplementary materials (see Supplementary File: Annotated Error Taxonomy and Supplementary Figures S1–S2), which illustrate how each label was applied and provide visual highlights of key failure patterns.
Annotation protocol
The annotation process involved two board-certified anesthesiologists, each with over five years of clinical experience, serving as primary annotators. A third expert anesthesiologist, with additional training in medical education, served as the adjudicator in cases of disagreement.
Prior to formal annotation, the team underwent a structured training and calibration phase using a pilot subset of 15 questions. This phase included guided discussions to refine the operational definitions of each error category and resolve edge cases, resulting in the creation of a detailed annotation guideline.
Annotations were conducted using the Labelbox [14] platform, following a structured protocol that included individual review, quality control checks, and adjudication. For each sampled error, annotators reviewed the original question, GPT-4o’s response and justification, and applied the appropriate error category (or categories) from the established taxonomy. Regular calibration sessions were held throughout the annotation period to maintain consistency, and inter-annotator agreement was periodically evaluated to ensure reliability.
Statistical analysis
Our statistical analysis employed a comprehensive framework to evaluate model performance and error patterns across cognitive domains. We implemented a significance threshold of α = 0.05 with Holm correction for multiple comparisons to control family-wise error rate. Performance differences across cognitive domains were examined using one-way ANOVA followed by Tukey’s HSD post-hoc tests for pairwise comparisons. Given marginal violations of normality assumptions detected through Shapiro-Wilk testing, we supplemented parametric analyses with non-parametric Kruskal-Wallis tests and Dunn-Holm post-hoc comparisons to ensure robustness of findings.
Effect sizes were comprehensively reported to quantify the magnitude of observed differences. For ANOVA analyses, we calculated η² for overall effect size, ω² for population-level estimates, and Cohen’s f for standardized effects. Chi-square analyses were accompanied by Cramér’s V to assess the strength of categorical associations. Linear mixed-effects models were employed to account for the hierarchical structure of our data, treating cognitive domain as a fixed effect and simulation run as a random intercept, thereby modeling intra-run correlations and providing more precise estimates of domain-specific effects.
Error pattern analysis utilized Phi coefficients to quantify binary co-occurrence relationships between error types, with significance testing adjusted for multiple comparisons. For categorical error distributions with sparse cells, we implemented chi-square tests with Monte Carlo simulation (10,000 iterations) to generate empirical p-values, ensuring valid inference even with low expected frequencies. Temperature parameter effects were evaluated using repeated measures ANOVA with Greenhouse-Geisser correction for sphericity violations, analyzing both global performance trends and domain-specific temperature-accuracy relationships.
Sample size adequacy was validated through post-hoc power analysis, calculating minimum detectable effect sizes and conducting bootstrap resampling (1,000 iterations) to estimate confidence intervals for error proportions. Inter-rater reliability for expert annotations was quantified using Cohen’s kappa with bootstrap-estimated 95% confidence intervals. All statistical analyses were performed using Python 3.10, employing pandas and numpy for data manipulation, scipy.stats and statsmodels for hypothesis testing, pingouin for mixed-effects modeling and repeated measures analysis, scikit-posthocs for non-parametric post-hoc tests, and scikit-learn for reliability metrics.
Results
Overall model performance by cognitive domain
Across 30 independent simulation runs using GPT-4o on the test, the model achieved an overall mean accuracy of 4,595/5,490 (83.67%). However, performance varied significantly depending on the cognitive domain of the questions. The highest accuracy was observed in the Understand domain 1,838/2,040 (90.10%), followed by Recall 1,215/1,440 (84.38%), Apply 922/1,200 (76.83%), and Analyze 620/810 (76.54%).
A one-way ANOVA confirmed a significant effect of cognitive domain on model performance, F(3, 116) = 179.49, p < 0.001, with a large effect size (η² = 0.823; ω² = 0.817; Cohen’s f = 2.155). Assumptions of homogeneity of variances were met (p = 0.451), although normality was marginally violated (p = 0.00007), justifying the use of non-parametric tests. Post-hoc Tukey HSD tests revealed significant pairwise differences between all domains (p < 0.001), except between Apply and Analyze (p = 0.975). These results were corroborated by Kruskal-Wallis tests (H = 98.76, p < 0.001, η² = 0.805) and Dunn-Holm pairwise comparisons, which showed the same significance pattern. A linear mixed-effects model further confirmed significant fixed effects of cognitive domain (p < 0.001) while accounting for variability across simulation runs.
Table 1 presents the distribution of questions by domain, mean model accuracy for each domain, and the results of parametric and non-parametric post-hoc comparisons.
[IMAGE OMITTED: SEE PDF]
Inter-rater reliability
The inter-rater reliability for each error category was assessed using Cohen’s kappa coefficient. The overall mean kappa was 0.73 (95% CI: 0.69–0.83), reflecting substantial agreement between the raters. The highest level of agreement was observed for the category of unsupported medical claims (κ = 0.82), while more subjective categories, such as incorrect or vague conclusions, showed moderate agreement (κ = 0.52). Any disagreements between raters were adjudicated by a third expert following pre-established consensus guidelines. These findings support the reliability and reproducibility of the error taxonomy applied in this study.
Error distribution and frequency
Among our sample of 120 incorrect responses (randomly selected from a total of 1,370 errors across all simulations), the most common error was an unsupported medical claim (n = 59, representing 40.69% of the total 145 error instances identified across the 120 sampled responses), followed by incorrect or vague conclusion (n = 32, 22.07%) and persistence with an incorrect diagnosis (n = 22, 15.17%). Figure 1 displays the complete distribution of error types, emphasizing the predominance of knowledge-related inaccuracies such as unsupported claims and vague conclusions.
[IMAGE OMITTED: SEE PDF]
Significance of error types across cognitive domains
To evaluate whether specific error types were disproportionately distributed across cognitive domains, we conducted chi-square tests with Holm correction to control for multiple comparisons. The results revealed that only one error type exhibited statistically significant differences at the domain level. Unsupported medical claims were significantly more prevalent in the Understanding domain, as indicated by a chi-square value of 17.47 (p = 0.0006, corrected p = 0.0045), with a medium effect size (Cramér’s V = 0.382). Although reasonable responses—defined as plausible but ultimately incorrect answers—showed a pattern of higher occurrence within the Application domain (χ² = 11.98, p = 0.0074), this difference did not reach statistical significance after correction for multiple comparisons (corrected p = 0.0521), despite showing a medium effect size (Cramér’s V = 0.316).
Table 2 summarizes the chi-square test results, corrected p-values, and effect sizes for each error type. In parallel, Fig. 2 visually presents the distribution of error types across cognitive domains.
[IMAGE OMITTED: SEE PDF]
[IMAGE OMITTED: SEE PDF]
Error co-occurrence patterns
Phi coefficient analysis revealed significant associations between error types. Unsupported medical claims and incorrect or vague conclusions co-occurred in five instances (ϕ = 0.386, 95% CI: 0.207–0.565, p = 0.0002). Unsupported medical claims also co-occurred with sticking to the wrong diagnosis in three cases (ϕ = 0.315, 95% CI: 0.136–0.494, p = 0.0003). When unsupported medical claims occurred, there was a 10.2% probability of concurrent vague conclusions. Reasonable responses showed mutual exclusivity with other error types (all ϕ = 0, p > 0.05).
Figure 3 shows the full co-occurrence matrix with both raw counts and percentages, clearly highlighting these associations and their relative frequencies. The significance indicators and confidence intervals for all Phi coefficients offer a thorough view of error interdependencies.
[IMAGE OMITTED: SEE PDF]
Sample size and power analysis
The annotated sample of 120 incorrect responses achieved a minimum detectable effect size of 0.258. Post-hoc power analysis showed 91% power for medium effects (Cramér’s V = 0.3) and 19% power for small effects (Cramér’s V = 0.1). Chi-square goodness-of-fit testing confirmed non-uniform error distribution (χ² = 54.72, p < 0.001). Bootstrap analysis (1,000 iterations) provided stable proportion estimates for major error categories: unsupported medical claims 40.69% (95% CI: 32.1%−49.8%), incorrect or vague conclusions 22.07% (95% CI: 15.2%−30.1%), and sticking with wrong diagnosis 15.17% (95% CI: 9.4%−22.5%).
Temperature parameter effects
Repeated measures ANOVA showed significant effects of temperature on accuracy (F(29, 5278) = 1.773, p = 0.0065, ηG² = 0.0017). Global accuracy reached its peak at temperature 0.23 (86.3%), with secondary peaks at 0.52 (85.2%) and 1.0 (85.8%). Performance ranged from 80.5% at temperature 0.1 to 86.3% at temperature 0.23.
Domain-specific analysis revealed distinct temperature-accuracy profiles. Understanding domain maintained over 85% accuracy across most temperatures, peaking at 94.1% (temperature 0.03). Recall domain achieved optimal performance of 87.5% at temperature 0.17. Application and Analysis domains showed greater volatility, with Application peaking at 82.5% (temperature 0.45) and Analysis at 82.5% (temperature 0.2), but both experienced drops below 70% at multiple temperature points (Fig. 4).
[IMAGE OMITTED: SEE PDF]
Discussion
Principal findings and clinical context
This study assessed GPT-4o’s performance on the Chilean CONACEM anesthesiology board examination, highlighting notable domain-specific performance differences with important implications for clinical use. The model achieved an overall accuracy of 83.69%, performing best in Understanding (90.10%) and Recall (84.38%) compared to Application (76.83%) and Analysis (76.54%). Our structured error classification showed that unsupported medical claims were the most common error (40.69%), followed by vague or incorrect conclusions (22.07%).
The co-occurrence analysis revealed a significant link between unsupported claims and vague conclusions (ϕ = 0.386, p = 0.0002), indicating that initial reasoning mistakes may lead to compounded failures, especially in complex analytical tasks. This pattern matches recent findings from extensive clinical evaluations, where error propagation in LLMs has been recognized as a major safety issue [15].
When compared to human performance benchmarks from CONACEM examination cohorts (2014–2018), GPT-4o’s 83.69% accuracy places it at the top level of human achievement, similar to the best-performing human groups. This is especially notable because the model reached this performance without domain-specific training or the years of clinical experience that human candidates typically have. However, the model’s error rate of 16.33% falls within the range seen in qualified human candidates, emphasizing that similar performance still requires careful safety precautions.
Contextualizing model performance against human benchmarks
To provide necessary context for understanding the clinical significance of GPT-4o’s 83.69% accuracy rate, we examined historical data from human candidates taking the CONACEM anesthesiology exam. Analyzing examination cohorts from 2014 to 2018 (Table 3) highlights important benchmarks for comparison.
[IMAGE OMITTED: SEE PDF]
Human performance on the CONACEM examination varied notably across different groups, with accuracy rates ranging from 60% to 83% (median: 67%). Specifically, the 2014–2016 cohorts demonstrated relatively consistent performance with accuracy rates between 60 and 67%, while the 2017 groups experienced a significant increase (81–83% accuracy). This upward trend likely indicates improved preparation strategies and greater familiarity with the exam format.
Compared to these human benchmarks, GPT-4o’s performance of 83.69% places it at the top of human achievement, similar to the best-performing human cohorts (2017 semester 2: 83% accuracy). This is especially important since the model achieved this without domain-specific training or the years of clinical experience that human candidates have.
For clinical deployment considerations, our analysis indicates that an accuracy threshold of 80% or higher is suitable for educational support applications, as this level reflects performance comparable to well-prepared human candidates. For direct clinical decision support, however, we recommend a higher threshold of 90% accuracy, especially for high-stakes decisions, with human oversight mandated regardless of the model’s performance.
The pass rates among human cohorts provide additional context: even in the top-performing cohorts, 15–20% of candidates did not meet certification standards. GPT-4o’s error rate of 16.33% is within this range, indicating that while the model performs similarly to qualified human candidates, it shows similar limitations that require careful consideration in clinical use.
Comparison with contemporary literature
Our findings align with recent large-scale assessments of LLMs in medical settings while expanding the evidence in several key areas. A recent Nature Medicine randomized controlled trial involving 92 practicing physicians showed that GPT-4 usage improved clinical management reasoning by 6.5% compared to traditional resources, with no significant performance difference between LLM-augmented physicians and LLM alone [16]. These results support our findings regarding GPT-4o’s clinical usefulness and highlight the need for structured evaluation frameworks.
Our error taxonomy expands on previous work by Roy et al. (2024), who created detailed error categories for GPT-4 responses to USMLE questions. While their study identified “sticking with wrong diagnosis” and “incorrect or vague conclusion” as common error types, our anesthesiology-specific analysis showed that unsupported medical claims are the main error in specialty-specific contexts. This suggests that error patterns may depend on the domain, emphasizing the need for validation protocols tailored to each specialty.
Recent systematic reviews have found hallucination rates of 1.47% in clinical summaries, with 44% classified as major errors [17].Our finding of 40.69% unsupported medical claims, although measured with different methodologies, reinforces concerns about accuracy in clinical reasoning tasks. The agreement of these findings across various evaluation frameworks strengthens the case for implementing strong fact-checking mechanisms in clinical LLM deployment.
Contemporary research also highlights significant performance gaps across medical specialties. Analysis of current literature shows that surgery receives the most research attention (28.2% of studies), while critical specialties like cardiology (1.9%) and emergency medicine (2.7%) are underrepresented [18].Our anesthesiology-focused evaluation addresses this specialty-specific gap and offers a framework for similar assessments in other domains.
Clinical implementation framework and actionable recommendations
Specific clinical scenarios for safe GPT-4o deployment
Based on our performance analysis, GPT-4o demonstrates specific strengths that may prove suitable for certain clinical applications, though each would require careful evaluation in context. The model’s higher performance in Understanding (90.10%) and Recall (84.38%) domains suggests potential utility for knowledge retrieval and educational applications, but deployment would require rigorous validation protocols. Administrative and documentation support represents a potentially promising application area that warrants further investigation. GPT-4o’s factual recall abilities may prove useful for clinical coding assistance, initial report drafting, and procedure documentation support. However, these applications would require comprehensive validation studies to ensure accuracy and compliance with healthcare documentation standards before clinical implementation. Patient education materials represent another area where the model may provide value, with appropriate safeguards. GPT-4o could potentially assist in creating initial educational content about anesthetic procedures and perioperative care, though all materials would require thorough clinical review and validation before patient use. Training and simulation environments offer controlled settings for potential GPT-4o deployment that merit careful evaluation. The model could potentially serve as a knowledge resource for anesthesiology education, but would require systematic assessment of educational effectiveness and learning outcome impacts before integration into formal training programs.
Concrete safeguards for risk mitigation
The dominance of unsupported medical claims (40.69%) calls for specific safeguards to address this major failure point. Healthcare systems should introduce real-time fact-checking tools that verify clinical assertions against trusted medical databases and current guidelines. This approach should identify claims without supporting evidence and require physician confirmation before clinical application.
Multi-layered verification protocols are crucial safeguards for high-stakes applications. Primary verification should be through automated knowledge base comparison, secondary verification via peer review systems, and tertiary verification through attending physician approval for all clinical recommendations. This method addresses both the cascading error patterns we identified and the uncertainty inherent in LLM reasoning.
Temperature parameter optimization emerges as a critical technical safeguard. Our analysis shows that optimal performance depends on task-specific calibration, with Understanding tasks exhibiting robustness across temperature ranges, while Application and Analysis tasks are highly sensitive. Clinical deployment should include dynamic adjustment of parameters based on query classification, using conservative settings (0.0–0.1.0.1) for factual queries and carefully tuned parameters for complex reasoning tasks.
Error monitoring and feedback systems are crucial parts of safe deployment. Healthcare organizations must set up continuous tracking systems that observe error trends, detect new failure modes, and enable quick action on safety issues. The American Medical Association’s 2024 principles require proof of benefit through ongoing assessment, which involves systematically gathering and analyzing error data.
Human oversight protocols for different clinical tasks
Evidence-based human oversight protocols must be stratified according to task complexity and patient safety risk. For low-risk tasks like literature searches and creating patient education materials, supervised review by qualified clinical staff offers sufficient oversight. These protocols should feature structured checklists to verify content accuracy, clinical suitability, and compliance with institutional standards.
Moderate-risk applications, such as clinical documentation support and preliminary diagnostic assistance, require stricter oversight protocols. Board-certified physicians must review all AI-generated content before it is added to patient records, paying close attention to diagnostic reasoning chains and treatment suggestions. The review process should include clear verification of clinical logic, fact-checking of medical statements, and evaluation of the appropriateness of recommendations.
High-risk applications involving direct patient care recommendations require the strictest oversight protocols. These applications should enforce mandatory dual physician review, with the primary review conducted by the attending physician and a secondary review by a designated AI oversight specialist. Real-time monitoring systems must oversee all high-risk AI interactions, with immediate escalation procedures for detected errors or safety issues.
The human-in-the-loop framework should incorporate both System 1 and System 2 thinking patterns, where AI assists intuitive pattern recognition while maintaining analytical oversight for complex reasoning tasks. This approach preserves physician autonomy while leveraging AI capabilities for improved clinical decision-making. Training programs must prepare clinicians to interact effectively with AI systems, recognize potential errors, and retain clinical judgment in AI-augmented environments.Continuous education protocols represent critical components of effective oversight. Healthcare institutions must provide ongoing training for clinical staff on AI system capabilities, limitations, and appropriate use cases. This training should include recognition of common error patterns, strategies for effective AI interaction, and protocols for reporting safety concerns.
Generalizability beyond anesthesiology
Cross-Specialty performance considerations
The error patterns seen in our anesthesiology review reveal broader limitations of LLMs that probably affect other medical fields as well. The high number of unsupported medical claims highlights a core challenge in verifying medical facts that goes beyond individual specialties. Still, the way these errors appear can differ greatly across various clinical areas, depending on the type of clinical reasoning involved.
Emergency medicine poses specific challenges for LLM deployment due to urgent decision-making needs and high patient safety stakes. The quick reasoning required in emergency situations may worsen the cascading error patterns identified, where unsupported initial claims cause further reasoning failures. Emergency applications would need even stricter real-time verification and streamlined oversight protocols.
Radiology and pathology may be better suited for LLM deployment because they focus on pattern recognition and systematic interpretation protocols. These specialties could benefit from GPT-4o’s strong performance in Understanding and Recall areas, while being less affected by the Application and Analysis limitations we noted. However, the visual interpretation aspects of these fields add extra complexity that isn’t reflected in our text-based assessment.
Internal medicine and family practice face unique challenges because of the wide range of knowledge needed and the complexity of multi-system clinical reasoning. The error patterns we identified may be especially problematic in these settings, where thorough differential diagnosis and treatment planning require advanced clinical reasoning across multiple domains.
Psychiatry and behavioral health have distinct factors to consider, as clinical reasoning often relies on subjective judgment and cultural awareness that may not translate well across the multilingual contexts where LLMs show performance drops. Our results on cross-linguistic performance challenges are especially relevant for mental health applications in diverse populations.
Adaptation framework for error taxonomy
The error taxonomy developed for anesthesiology can serve as a foundation for specialty-specific adaptations while maintaining core conceptual frameworks. The fundamental categories of unsupported medical claims, vague conclusions, and diagnostic persistence errors appear sufficiently generic to apply across medical specialties, though their relative frequency and clinical significance may vary.
Specialty-specific modifications should focus on domain-relevant error types while preserving the underlying classification structure. For instance, radiology applications might require additional categories for image interpretation errors and measurement accuracy failures. Emergency medicine might benefit from time-sensitive decision errors and resource allocation mistakes. Surgical specialties could incorporate procedural planning errors and anatomical reasoning failures.
The co-occurrence patterns we identified between different error types provide a template for understanding error propagation across specialties. The relationship between unsupported claims and vague conclusions likely represents a universal pattern in LLM reasoning that should be monitored across all clinical applications. However, specialty-specific error interactions may emerge that require targeted mitigation strategies.
Validation protocols for adapted taxonomies should include specialty-specific pilot studies with domain experts, similar to our anesthesiology approach. These validations should assess inter-rater reliability for new error categories, evaluate the clinical significance of identified error patterns, and establish appropriate error frequency thresholds for safe deployment.
Implementation of adapted taxonomies should maintain consistency with the core framework to enable cross-specialty comparison and meta-analysis of LLM performance patterns. Standardized reporting structures will facilitate the development of universal safety protocols while accommodating specialty-specific requirements.
Temperature parameter optimization and technical implementation
The temperature analysis revealed minimal practical impact on overall model performance, with accuracy variations remaining within a narrow range (80.5% to 86.3%) across the full parameter spectrum. While statistically significant effects were detected (F(29, 5278) = 1.773, p = 0.0065), the effect size was negligible (ηG² = 0.0017), suggesting that temperature parameter optimization may have limited clinical relevance for GPT-4o deployment.
This finding has important implications for clinical implementation, as it suggests that healthcare systems may not need to invest significant resources in complex parameter optimization protocols. The relative stability of performance across temperature settings indicates that GPT-4o maintains consistent reasoning capabilities regardless of the creativity-determinism balance, simplifying deployment requirements and reducing the technical complexity of clinical integration.
However, the domain-specific volatility observed in Application and Analysis tasks suggests that while global temperature effects are minimal, certain cognitive domains may still benefit from parameter optimization. This pattern warrants further investigation to determine whether the observed variations represent meaningful performance differences or statistical noise within acceptable performance ranges.
Limitations and future research directions
Study limitations and methodological considerations
Several limitations constrain the generalizability of our findings and highlight areas for future investigation. The retrospective human performance comparison, while providing valuable context, lacks the rigor of direct head-to-head evaluation under controlled conditions. Future studies should prioritize prospective comparisons between AI models and human physicians using identical questions under standardized time constraints and environmental conditions.
The error taxonomy, while developed with clinical expert input, requires validation across broader anesthesiology contexts and different cultural settings. The Chilean healthcare context may introduce specific factors that limit generalizability to other healthcare systems, particularly regarding clinical practice patterns, examination content, and educational approaches.
The sample size for error analysis, while statistically powered for medium effect sizes, may have limited sensitivity for detecting rare but clinically significant error patterns. Larger-scale evaluations would provide more precise estimates of error frequencies and enable detection of subtle but important failure modes.
The temperature parameter analysis, while novel in medical contexts, requires validation across different model versions and clinical applications. The optimal parameters identified for CONACEM questions may not generalize to real-world clinical scenarios with different complexity and time pressure characteristics.
Priority research areas
Cross-specialty transferability studies represent the most critical research gap for advancing clinical implementation. Systematic evaluation of LLM performance across medical specialties using standardized protocols would inform deployment priorities and resource allocation decisions. These studies should employ consistent error taxonomies and evaluation metrics to enable meaningful comparison across domains.
Long-term clinical outcome studies must move beyond examination performance to evaluate patient safety, clinical effectiveness, and healthcare quality metrics. Prospective cohort studies tracking patient outcomes in AI-augmented versus conventional care settings would provide essential evidence for deployment decisions. These studies should include diverse patient populations and healthcare settings to ensure broad applicability.
Multilingual evaluation frameworks require urgent development to address global healthcare equity concerns. Our findings regarding cross-linguistic performance limitations highlight the need for systematic evaluation of LLM performance in non-English languages and diverse cultural contexts. These evaluations should include culturally appropriate medical concepts and region-specific healthcare practices.
Human-AI collaboration optimization represents an emerging research priority with significant clinical implications. Studies should investigate optimal interaction patterns, training protocols for healthcare professionals, and interface design principles that maximize the benefits of AI augmentation while minimizing risks. This research should include cognitive load assessment and workflow efficiency analysis.
Real-world deployment studies in healthcare settings would provide essential evidence about implementation challenges, organizational factors, and long-term sustainability. These studies should evaluate different deployment strategies, governance structures, and training approaches to identify best practices for healthcare system implementation.
Emerging technological considerations
Multimodal AI integration represents a significant future direction that could address some limitations identified in our text-based evaluation. Systems that combine natural language processing with medical imaging, laboratory data, and physiological monitoring could provide more comprehensive clinical support while potentially reducing error rates through cross-modal validation.
Federated learning approaches may address privacy and data sharing concerns while enabling larger-scale training and evaluation studies. These approaches could facilitate collaboration across healthcare institutions and geographic regions while maintaining patient privacy protection. The development of federated learning protocols for medical AI represents a critical research priority.
Explainable AI development specifically for medical applications requires focused research attention. Current LLMs provide limited insight into their reasoning processes, creating challenges for clinical oversight and error detection. Research into interpretable medical AI could enhance clinical trust and improve error identification capabilities.
Integration with emerging regulatory frameworks requires ongoing research to ensure that technical development aligns with evolving safety and efficacy requirements. Collaboration between researchers, healthcare institutions, and regulatory bodies will be essential for developing evidence-based implementation guidelines.
Implications for healthcare policy and practice
The findings from this study have significant implications for healthcare policy development and clinical practice guidelines. The demonstrated performance of GPT-4o at levels comparable to qualified human candidates suggests that appropriately deployed AI systems could address healthcare workforce shortages and improve access to specialized knowledge. However, the identified error patterns and safety concerns underscore the need for careful regulatory oversight and implementation protocols.
Healthcare policy should address liability frameworks for AI-augmented clinical care, particularly regarding responsibility attribution when errors occur. The American Medical Association’s 2024 principles specify that developers of autonomous AI systems must accept liability for system failures, while physicians should not be liable when lacking knowledge of AI quality or safety concerns. These frameworks require further development and legal clarification [19].
Educational implications include the need for medical school and residency curricula to address AI literacy and human-AI collaboration skills. Healthcare professionals must understand AI capabilities and limitations to effectively utilize these tools while maintaining clinical judgment. Professional certification and continuing education requirements should incorporate AI competency assessment.
Economic considerations include the cost-effectiveness of AI implementation, infrastructure requirements, and resource allocation for training and oversight. Healthcare systems must balance the potential efficiency gains from AI deployment against the substantial investment required for safe implementation. Cost-benefit analyses should include long-term considerations about healthcare quality and patient safety outcomes.
International coordination and standardization represent critical policy priorities for ensuring safe global deployment of medical AI. The development of international standards for medical AI evaluation, safety protocols, and deployment guidelines would facilitate technology transfer while maintaining appropriate safety oversight.
Conclusion
This study highlights GPT-4o’s promising capabilities in factual recall and basic comprehension tasks within high-stakes anesthesiology examinations, achieving robust performance particularly in lower cognitive domains. However, significant limitations emerged in more complex analytical and application tasks, characterized primarily by unsupported medical claims and vague diagnostic conclusions. These findings emphasize the necessity for targeted improvements, such as enhanced domain-specific fine-tuning and integrated knowledge verification mechanisms, before GPT-4o can be reliably and safely deployed in clinical anesthesiology practice.
Data availability
The datasets generated and/or analyzed during this study are partially available. The raw performance data (accuracy scores, execution times). Access to these data requires a formal request outlining the intended use and adherence to data usage terms. This study did not involve human subjects or clinical trials; therefore, trial registration is not applicable.
Aydin S, Karabacak M, Vlachos V, Margetis K. Large language models in patient education: a scoping review of applications in medicine. Front Med. 2024;11:1477898.
Carchiolo V, Malgeri M. Trends, challenges, and applications of large language models in healthcare: a bibliometric and scoping review. Future Internet. 2025;17(2):76.
Mirzaei T, Amini L, Esmaeilzadeh P. Clinician voices on ethics of LLM integration in healthcare: a thematic analysis of ethical concerns and implications. BMC Med Inform Decis Mak. 2024;24(1):250.
Zong H, Wu R, Cha J, Wang J, Wu E, Li J, et al. Large language models in worldwide medical exams: platform development and comprehensive analysis. J Med Internet Res. 2024;26:e66114.
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and evaluation of health care applications of large language models. JAMA. 2025;333(4):319–28.
Wang D, Zhang S. Large language models in medical and healthcare fields: applications, advances, and challenges. Artif Intell Rev. 2024;57(11):299.
Brügge E, Ricchizzi S, Arenbeck M, Keller MN, Schur L, Stummer W, et al. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial. BMC Med Educ. 2024;24(1):1391.
Simera I, Moher D, Hirst A, Hoey J, Schulz KF, Altman DG. Transparent and accurate reporting increases reliability, utility, and impact of your research: reporting guidelines and the EQUATOR network. BMC Med. 2010;8:1–6.
Forehand M. Bloom’s taxonomy. Emerg Perspect Learn Teach Technol. 2010;41(4):47–56.
Blunch NJ. Position bias in multiple-choice questions. J Mark Res. 1984;21(2):216–20.
OpenAI. GPT-4o. 2024. Disponible en: https://openai.com/index/hello-gpt-4o/.
Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large Language models are zero-shot reasoners. Adv Neural Inf Process Syst. 2022;35:22199–213.
Roy S, Khatua A, Ghoochani F, Hadler U, Nejdl W, Ganguly N. Beyond accuracy: Investigating error types in GPT-4 responses to USMLE questions. En: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024. pp. 1073-82.
Labelbox. The fastest way to build computer vision apps. Labelbox; 2025. Disponible en: https://www.labelbox.com.
Hager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, Knauer M, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;30(9):2613–22.
Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80.
Asgari E, Montaña-Brown N, Dubois M, Khalil S, Balloch J, Yeung JA, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. Npj Digit Med. 2025;8(1):274.
Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, Golpira R, Tara M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med Inform Decis Mak. 2025;25(1):117.
American Medical Association. Augmented Intelligence Development, Deployment, and Use in Health Care. Chicago, IL: American Medical Association; p. 22. Report No.: 24-1228151:11/24. Disponible en: https://www.ama-assn.org/system/files/ama-ai-principles.pdf.
© 2025. This work is licensed under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.