Content area
Background
Overcrowding in emergency departments (EDs) leads to delayed treatments, poor patient outcomes, and increased staff workloads. Artificial intelligence (AI) and machine learning (ML) have emerged as promising tools to optimize triage. Objective: This systematic review evaluates AI/ML-driven triage and risk stratification models in EDs, focusing on predictive performance, key predictors, clinical and operational outcomes, and implementation challenges.
MethodsFollowing PRISMA 2020 guidelines, we systematically searched PubMed, CINAHL, Scopus, Web of Science, and IEEE Xplore for studies on AI/ML-driven ED triage published through January 2025. Two independent reviewers screened studies, extracted data, and assessed quality using PROBAST, with findings synthesized thematically.
ResultsTwenty-six studies met inclusion criteria. ML-based triage models consistently outperformed traditional tools, often achieving AUCs > 0.80 for high acuity outcomes (e.g., hospital admission, ICU transfer). Key predictors included vital signs, age, arrival mode, and disease-specific markers. Incorporating free-text data via natural language processing enhances accuracy and sensitivity. Advanced ML techniques, such as gradient boosting and random forests, generally surpassed simpler models across diverse populations. Reported benefits included reduced ED overcrowding, improved resource allocation, fewer mis-triaged patients, and potential patient outcome improvements.
ConclusionAI/ML-based triage models hold substantial promise in improving ED efficiency and patient outcomes. Prospective, multi-center trials with transparent reporting and seamless electronic health record integration are essential to confirm these benefits.
Implications for Clinical PracticeIntegrating AI and ML into ED triage can enhance assessment accuracy and resource allocation. Early identification of high-risk patients supports better clinical decision-making, including critical care and ICU nurses, by streamlining patient transitions and reducing overcrowding. Explainable AI models foster trust and enable informed decisions under pressure. To realize these benefits, healthcare organizations must invest in robust infrastructure, provide comprehensive training for all clinical staff, and implement ethical, standardized practices that support interdisciplinary collaboration between ED and ICU teams.
Emergency departments (EDs) play a pivotal role in the global healthcare system by providing immediate medical attention to individuals with acute and often life-threatening conditions [ 1]. Despite their critical importance, EDs worldwide continue to grapple with persistent overcrowding, lengthy patient waiting times, and resource strain—factors that collectively compromise the quality and timeliness of emergency care [ 1, 2]. This chronic state of overcrowding is particularly troubling because timely triage and intervention can significantly impact patient outcomes, such as mortality, length of hospital stay, and overall morbidity [ 1–3]. Consequently, effective and efficient triage mechanisms have become central to efforts aimed at optimizing ED operations.
Triage is the systematic process of rapidly categorizing patients based on the urgency of their condition to prioritize immediate care. Globally, emergency departments use rigorously validated systems—such as the U.S. five-level Emergency Severity Index (ESI), Europe's Manchester Triage System (MTS), as well as the Canadian Triage and Acuity Scale (CTAS) and the Australasian Triage Scale (ATS)—to swiftly stratify patient acuity. Despite these regional variations, all these systems share the goal of early identification of high-risk patients to ensure timely, effective intervention [ 4–7]. These systems classify patients into discrete categories corresponding to acuity, guiding resource allocation and the order in which patients are seen. However, conventional triage scales do not fully account for broader operational factors, such as fluctuating ED bed capacity or surge demands [ 8].
While the traditional systems provide a structured approach to patient assessment, their static nature and reliance on predefined categories often limit their ability to dynamically adapt to fluctuations in patient volume and the evolving complexity of clinical presentations. In response to these challenges, recent advancements in artificial intelligence (AI) and machine learning (ML) have emerged as promising avenues to enhance triage processes [ 9].
In response to these challenges, recent advancements in AI and ML have emerged as promising avenues to enhance triage processes. By harnessing vast amounts of structured data (e.g., vitals, labs) and unstructured data (e.g., chief complaints, nursing notes), AI-driven models offer the potential to improve prediction accuracy and operational efficiency, thereby addressing some of the inherent limitations of traditional approaches [ 10–13].
Despite these promising developments, the evidence base remains fragmented due to heterogeneity in study quality, settings, and patient populations, underscoring the urgent need for a systematic synthesis of quantitative performance metrics and qualitative implementation insights. Decisions made during ED triage can directly influence ICU admission rates and patient outcomes, demonstrating a critical link between emergency care and the intensive care environment [ 14].Timely identification of critically ill patients at triage enables urgent intervention and appropriate ICU admission, which is central to prevent deterioration and reducing mortality[ 15, 16]. Moreover, these triage decisions have downstream effects on critical care workflows, as appropriate ED triage triggers the mobilization of ICU resources and personnel, whereas mis-triage or capacity strain can compromise rapid response efforts and intensive care preparedness[ 17–19]. Our review provides critical care nurses with valuable insights into how AI-driven triage innovations can facilitate early ICU admissions, optimize resource allocation, and ultimately enhance outcomes for severely ill patients.
AimTo critically evaluate and synthesize the current evidence on the validation and clinical impact of AI and ML–based models for triage and risk stratification in emergency departments—encompassing both general and disease-specific applications—with an emphasis on performance metrics, workflow feasibility, and patient outcomes.
Objectives- To compare the accuracy and discrimination of AI/ML-based triage and risk stratification models with conventional triage methods (e.g., ESI, MTS) or scoring tools.
- To identify the most influential predictors (structured and unstructured data) used in AI/ML triage models across various patient populations.
- To evaluate the reported effects of AI/ML-driven triage tools on ED workflows (e.g., impact on decision-making efficiency, wait times, and resource allocation) and on key patient outcomes (e.g., hospital admission, ICU transfer, mortality).
- To assess challenges to implementation—such as interpretability, clinician acceptance, integration with electronic health records (EHRs), and cost considerations—and highlight gaps in current research.
This systematic review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines [ 20], ensuring transparency and replicability. Because literature features both quantitative and mixed-methods studies, we used a thematic synthesis approach with thematic grouping to capture performance metrics and broader contextual insights.
We employed the SPIDER (Sample, Phenomenon of Interest, Design, Evaluation, Research type) framework [ 21], to guide inclusion/exclusion decisions, aiming to capture a broad spectrum of research on AI/ML-based ED triage. However, because most included studies were quantitative (retrospective or prospective observational), the final synthesis is predominantly quantitative in nature. Table 1 outlines the SPIDER framework, providing a conceptual overview of the key elements—Sample, Phenomenon of Interest, Design, Evaluation, and Research type—that guided our study selection. In contrast, Table 2 provides a detailed breakdown of the specific inclusion and exclusion criteria derived from that framework. This dual presentation ensures transparency by clearly distinguishing between the overall methodological framework ( Table 1) and the operationalized criteria used for screening studies ( Table 2).
Table 1 outlines the key components of the SPIDER framework used to guide study selection in our systematic review. It summarizes the target population (Sample), the phenomenon under investigation, the study designs (including qualitative designs such as focus group discussions, semi-structured interviews, ethnographic studies, case studies, and phenomenological/grounded theory approaches), the evaluation metrics, and the research types.
Table 2 summarizes the criteria used for selecting studies for the review. The inclusion criteria specify the essential characteristics that a study must have, while the exclusion criteria outline the types of studies that were not considered, including those outside the ED context or that lack empirical data. These criteria ensure that only relevant studies contributing to the understanding of AI/ML applications in ED triage and their clinical impact are included.
Information sources and search strategyA comprehensive literature search was executed across multiple electronic databases to ensure exhaustive retrieval of relevant studies from inception up to January 2025. The databases searched are PubMed, CINAHL, Scopus, Web of Science, and IEEE Xplore. Additionally, the reference lists of included studies and pertinent reviews were hand-searched to identify any additional relevant studies. The search strategy employed a combination of keywords and Medical Subject Headings (MeSH) terms related to AI and ML in ED triage, ensuring broad coverage of the relevant literature. The detailed search terms are outlined in Table 3 .
Selection processThe records identified through comprehensive database searches were imported into Rayyan, a specialized systematic review screening platform [ 22]. To ensure methodological rigor and minimize bias, two independent reviewers conducted the screening process. Any discrepancies were addressed through consensus discussions. The protocol for resolving discrepancies involved detailed discussions to reach consensus, ensuring that inclusion decisions were unbiased and based on predefined criteria. The initial screening involved evaluating titles and abstracts against the predefined inclusion and exclusion criteria. Subsequently, the full texts of potentially eligible studies were retrieved and meticulously assessed for final inclusion based on the established criteria. The study selection process is detailed in the PRISMA flow diagram ( Fig. 1 ).
Data extraction and analysisData extraction was performed using a standardized form. This form was designed to capture comprehensive details from each study and included the following components: (1) Reference Details – including authors, publication year, and journal; (2) Research Design – specifying whether the study was retrospective, prospective, cross-sectional, etc.; (3) Study Aim and Objectives; (4) Population and Setting – detailing the sample size, geographical context, and specific ED characteristics; (5) AI/ML Model(s) – describing the type of algorithms used and how data (both structured and unstructured) were integrated; (6) Key Performance Metrics – such as area under the curve (AUC), accuracy, sensitivity, and specificity; (7) Key Findings and Implementation Details – including clinical impact and operational outcomes; and (8) Limitations – noting any biases or constraints reported by the study. Data extraction was independently performed by two reviewers, with any discrepancies initially resolved through discussion; when consensus could not be reached, a third reviewer was consulted to ensure rigorous resolution. The synthesis of the extracted data was conducted using thematic synthesis [ 23], which allowed the identification of key themes and subthemes that address the review’s aim and objectives, providing a comprehensive understanding of the current evidence landscape. We began by immersing ourselves in the extracted data, carefully reading and identifying meaningful ideas and concepts that were transformed into initial codes. These codes captured the fundamental elements of each study’s findings and served as the building blocks for further analysis. Through an iterative and reflective process, we systematically examined these codes, grouping those with similar meanings into related subthemes. This collaborative and dynamic process—whereby disagreements were resolved through discussion—fostered the formation of coherent subthemes that naturally converged into broader analytical themes. These overarching themes directly address the central focus areas of our review, including the comparative performance of AI/ML-based triage models, the key drivers behind their effectiveness, the clinical and operational impacts, and the challenges encountered during their implementation. By interweaving these insights, our synthesis provides a comprehensive interpretation that aligns with our study’s aims and objectives, ultimately informing recommendations for future research and system-wide adoption
Risk of bias and quality of included studiesTo systematically evaluate the methodological rigor and risk of bias of the included studies, the Prediction Model Risk of Bias Assessment Tool (PROBAST) was employed [ 24]. PROBAST is specifically designed for assessing the quality and applicability of studies developing, validating, or updating prediction models. It evaluates four critical domains: Participants, Predictors, Outcomes, and Analysis, focusing on potential sources of bias and the generalizability of findings.
Each included study was assessed using tailored signaling questions based on PROBAST guidelines to address the unique characteristics of AI/ML-based models used in ED triage. These questions evaluated factors such as the representativeness of the study population, the appropriateness and consistency of predictors, the objectivity and clinical relevance of outcomes, and the statistical rigor of model development and validation. The risk of bias for each domain was categorized as low, moderate, or high, with an overall rating assigned based on the synthesis of domain-level assessments.
The applicability of each study was also examined, focusing on whether the predictors, outcomes, and population align with real-world ED settings. Studies were further evaluated for key considerations relevant to AI/ML research, including the use of external validation datasets, handling of missing data, and efforts to ensure model interpretability. This systematic approach ensures a comprehensive assessment of the reliability and clinical relevance of the findings.
Results Study selection and characteristicsThe initial search yielded 334 records. After removing duplicates and screening titles/abstracts, 56 full-text articles were assessed for eligibility, and 26 met the final inclusion criteria ( Fig. 1). The included studies span 2018 to late 2024, drawn from diverse geographical regions (North America, Asia, Europe, Africa, Middle East, South America). Designs were predominantly retrospective, with a few prospective or partially prospective validations. (See Appendix 1for a summary of each included study).
Evaluating the risk of bias and quality of included studiesThe systematic evaluation of included studies using PROBAST reveals a predominantly moderate risk of bias across research focused on ED triaging using ML and Large Language models (LLMs). This assessment spans four key domains: Participants, Predictors, Outcomes, and Analysis.
ParticipantsMost studies exhibited a moderate risk of bias in participant selection. Single-center designs were prevalent, limiting the generalizability of findings to broader or more diverse ED settings. Selection bias was introduced by excluding pediatric populations, patients with incomplete records, or those deemed low-risk. For instance, Colakca et al. [ 25] and Klang et al. [ 26] relied on single-institution datasets, reducing their applicability to heterogeneous populations.
PredictorsThe predictors domain consistently demonstrated a low risk of bias across studies. Predictors included standardized clinical parameters, such as vital signs, demographics, and chief complaints, which are routinely collected during ED triage. Furthermore, advanced techniques like Natural Language Processing (NLP) were employed in several studies to analyze free-text chief complaints, ensuring relevance and reliability. The consistent use of these robust and readily available predictors across studies supports the applicability of the developed models in diverse ED settings.
OutcomesThe Outcomes domain showed a moderate risk of bias in most studies. While clinically meaningful outcomes—such as hospital admissions, ICU admissions, and mortality—were used, their definitions often depended on institution-specific practices and protocols. For example, disparate criteria for hospital admissions or critical care allocation likely introduced variability, limiting the consistency and comparability of outcomes. Furthermore, the retrospective nature of most studies compounded the risk, as historical triage decisions were used as proxies for clinical outcomes, potentially introducing subjectivity.
AnalysisIn the Analysis domain, a moderate risk of bias was identified across studies. While most studies employed robust methodologies, including cross-validation, multiple algorithm comparisons, and comprehensive performance metrics (e.g., AUC, sensitivity, specificity), the lack of external validation was a pervasive limitation. For example, Lin et al. [ 27] conducted external validation to enhance generalizability, whereas Leonard et al. [ 28] and Ivanov et al. [ 29] relied solely on internal datasets. Additionally, overfitting risks were evident in studies utilizing complex ML models without adequate safeguards.
Overall risk of bias and applicabilitySynthesizing the results across domains, the overall risk of bias for the included studies was predominantly moderate. Single-center designs, selection biases, and the lack of external validation were the main contributors. These limitations highlight the need for future studies to prioritize multi-center collaborations, standardized outcome definitions, and external validation to enhance the generalizability and clinical utility of AI/ML-based triage models. (See Appendix 2).
Thematic synthesisThe comprehensive analysis of the included studies revealed four primary themes that encapsulate the integration and impact of AI and ML technologies in emergency department ED triage and risk stratification: Predictive Performance, Key Model Drivers, Clinical and Operational Impact, and Implementation Feasibility. These themes collectively illustrate the multifaceted benefits and challenges associated with deploying AI/ML-based triage tools in diverse ED settings. Table 4 provides a concise summary of the key findings from our review. It categorizes the evidence into four overarching themes—Predictive Performance, Key Model Drivers, Clinical and Operational Impact, and Implementation Feasibility.
Theme 1: Predictive performance Accuracy and discrimination across diverse modelsOut of the 26 studies, 14 reported AUC values above 0.80 [ 26, 28, 30–41] for outcomes such as hospital admission or critical care need, often exceeding the performance of tools like ESI or Quick Sequential Organ Failure Assessment (qSOFA). However, these AUCs reflect retrospective data and may not translate directly to prospective, real-time accuracy.
ChatGPT-based triage [ 25] recorded a Cohen’s Kappa of 0.828 for high-acuity categories and a 94.9 % accuracy for ESI-1 and ESI-2 patients, underscoring AI’s potential to match or exceed expert committees. Neural networks [ 31, 32, 37] and XGBoost [ 27, 39, 42] also demonstrated strong performance (AUCs often ranging from 0.75 to > 0.90), particularly when integrating vital signs with additional structured data. Studies focusing on pediatric ED settings [ 28, 30, 36] consistently noted improved predictions for hospital admission or critical illness, with some models surpassing conventional triage scales (e.g., Pediatric Korean Triage and Acuity Scale [pedKTAS]).
Comparison with conventional triage systemsFour studies showed that ML-based methods could outperform standard triage scales (ESI, MTS)[ 27, 29, 38, 41], but these comparisons often involved single-center datasets. Large-sample or multi-center validations generally confirmed the advantage of AI but noted potential overfitting and data quality issues.
ESI vs. ML: Raita et al. [ 41] showed that ML approaches yielded higher AUCs for predicting both critical care (0.84–0.85 vs. 0.74 for ESI) and hospitalization (0.78–0.80 vs. 0.69). Another example is Ivanov et al. [ 29], whose KATE model significantly outperformed triage nurses (accuracy 75.7 % vs. 59.8 %).
Disease-Specific Scores vs. ML: In Wu et al.’s chest pain study [ 38], an ML-driven LASSO model outperformed the HEART, GRACE, and TIMI scores (AUC = 0.953 vs. 0.735–0.754). Similarly, Lin et al. [ 27] showed that XGBoost surpassed qSOFA and SIRS in detecting sepsis.
Structured vs. Unstructured data integrationIncorporating free-text chief complaints or triage narratives enhanced model sensitivity for critical conditions. However, LLM-based triage systems (e.g., ChatGPT [ 25]) remain in very early, exploratory stages. They demonstrated high accuracy in small or simulated samples but lack extensive prospective validation. Klang et al. [ 26] and Fernandes et al. [ 33] reported incremental performance gains when text data was combined with structured features (vital signs, lab results). Deep learning models that integrated free-text chief complaints [ 31], consistently showed increased sensitivity for critical illness detection.
Theme 2: Key model driversVital signs (heart rate, respiratory rate, blood pressure, temperature, oxygen saturation) were near-universal predictors. Demographics (especially age) and arrival mode (ambulance vs. walk-in) often emerged as robust features. In disease-specific contexts (sepsis, chest pain, TBI), lab markers (e.g., CRP, creatinine) or clinical scales (e.g., Glasgow Coma Scale) offered added predictive power.Textual features extracted via NLP also improved classification. For instance, keywords like “stroke” or “chest pain” in triage notes aided early recognition of critical cases.
Disease-Specific predictorsDisease-specific predictors play a crucial role in enhancing the accuracy of ML models in ED triage. For sepsis, key predictors include C-reactive protein (CRP), sodium levels, and systolic blood pressure, which were identified as top-ranking features in the study by Lin et al. [ 27]. In the case of traumatic brain injury (TBI), mortality predictions relied heavily on the Glasgow Coma Scale (GCS), Injury Severity Scale, and systolic blood pressure thresholds, as highlighted by Hsu et al. [ 43]. Meanwhile, in patients presenting with syncope, comorbidities like heart failure and diabetes, along with age, were significant determinants of length-of-stay predictions, as demonstrated in the study by Lee et al. [ 35]. These findings underscore the importance of tailoring ML models to disease-specific contexts to optimize their predictive performance.
Textual and NLP inputsStudies like Ivanov et al. [ 29], Fernandes et al. [ 33], and Klang et al. [ 26] demonstrated the importance of chief complaints or triage narratives. NLP-derived keywords (e.g., “stroke,” “chest pain”) improved early identification of high-risk conditions (e.g., NSICU admissions, cardiopulmonary arrest).
Theme 3: Clinical and operational impact Effects on ED overcrowding and resource allocationML models aimed to reduce ED overcrowding by proactively identifying admission risk or critical illness, thereby expediting bed assignment and streamlining care pathways. Out of the 26 included studies, 9 studies [ 26, 28–30, 32, 35, 42, 44, 45] explicitly reported that early admission prediction or improved triage accuracy optimized workflow by enabling targeted resource allocation, reducing ED boarding, or decreasing overcrowding.
Decision-Making efficiency and staff workloadAI-based triage solutions, including ChatGPT [ 25] and random forest ensemble models [ 30, 44], demonstrated the potential to reduce under- and over-triage. This improvement may alleviate the burden on triage nurses and enhance the accuracy of initial decision-making, particularly in high-volume departments.
Patient Outcomes: Mortality, ICU Admissions, and length of stayNumerous studies cited reduced mortality risks [ 41, 43], fewer life-threatening mis-triage events, and improved identification of critically ill children [ 34, 36]. Personalized approaches, like those modeling length of stay for syncope patients [ 32], underscored how ML can refine patient pathways and outcome predictions.
Theme 4: Implementation feasibility Integration with EHRs and Real-Time systemsAlthough most studies were retrospective, a few [ 25, 40, 46] demonstrated partial prospective components or near-real-time validation. EHR integration remains a critical facilitator; however, variable data quality and missing values often limited immediate adoption [ 26, 29]. Limited infrastructure and data quality remain major barriers.
Interpretability and explainable AISeveral teams employed SHapley Additive exPlanations (SHAP) [ 40, 45] or rule-based decision trees to enhance interpretability [ 34, 43]. Such transparency fosters clinician trust. Studies using graph neural networks [ 47] or deep neural networks [ 31] noted that interpretability remains challenging and must be addressed before widespread clinical integration.
Clinician acceptance and workflow AdaptationsThe studies did not include formal cost analysis. Implementation cost, ongoing model maintenance, and potential for AI-driven biases were only mentioned qualitatively. Clinician surveys suggest willingness to adopt AI if models demonstrate clear, interpretable benefits and integrate seamlessly into existing workflows.
ChatGPT-based triage and LLM-driven domestic triage solutions [ 25, 46] highlighted the potential for AI to assist both clinical and home-based decision-making. However, concerns about clinical reliability and the need to adapt human-centric triage scales for AI remain [ 25]. Triage nurses, and physicians are more likely to embrace these tools if they seamlessly fit existing workflows and demonstrate clear benefits in workload management, patient safety, and cost-effectiveness [ 29, 40, 44, 48–50].
DiscussionThis discussion synthesizes findings from 26 studies organized into four themes: predictive performance, key model drivers, clinical and operational impact, and implementation feasibility. Our review found that AI/ML-based triage models frequently report high predictive accuracy (often with AUCs above 0.80), suggesting improved identification of high-acuity patients and more efficient resource allocation compared to traditional systems. However, the predominance of retrospective, single-site studies highlight challenges in generalizability and underscores the need for prospective, multi-center validations and standardized protocols. Importantly, implementing AI/ML-based triage in the ED has significant implications for critical care nursing, as early recognition of critically ill patients streamlines ICU admissions and facilitates prompt intervention, ultimately improving survival and preventing deterioration[ 14, 51, 52].By linking emergency triage acuity to critical care resource allocation and workflows, our findings emphasize the critical role of interdisciplinary collaboration between ED and ICU teams—including critical care nurses—in ensuring timely and effective care for severely ill patients.
AI/ML-based triage models consistently outperform conventional systems, including the widely used Emergency Severity Index (ESI) and clinical risk scores such as qSOFA and SIRS, which are commonly employed as benchmarks for patient risk stratification in the ED; however, these findings are preliminary and require confirmation through prospective, multi-center studies to ensure their generalizability and to address the risk of overfitting inherent in single-dataset optimizations. This review highlights those models employing gradient boosting and random forest algorithms achieved AUC values exceeding 0.80, often surpassing traditional scoring tools. For example, Raita et al. [ 41] demonstrated that ML models achieved AUCs of 0.86 for critical care prediction compared to 0.74 for ESI, a finding echoed by multiple studies that consistently report superior discrimination for hospital and ICU admissions [ 53–56].
Among the 26 included studies, none reported positive predictive value (PPV) or similar measures to quantify the proportion of patients identified by AI tools who ultimately benefit from timely intervention. For instance, Colakca et al. [ 25] reported a high accuracy of 94.9 % for high-acuity patients (ESI 1–2) and a Cohen’s Kappa of 0.828, yet no PPV data were provided. Similarly, studies by Choi et al. [ 42] and Hwang & Lee [ 30] focused exclusively on AUC and discrimination metrics (e.g., AUCs ranging from 0.718 to 0.991) without reporting PPV. This complete omission is critical because, without PPV, it remains unclear how many patients flagged as high-risk truly require urgent intervention versus representing false positives, which could lead to overtriage and unnecessary strain on ED resources.
Specific applications of AI, such as LASSO regression for chest pain management [ 38], demonstrated an AUC of 0.953, outperforming HEART, GRACE, and TIMI scores (AUC 0.735–0.754). Similarly, Lin et al. [ 27] reported that XGBoost models surpassed qSOFA and SIRS in detecting sepsis. These findings mirror those of Wang et al. [ 57] who reported enhanced predictive accuracy for ML models in sepsis detection, underscoring the scalability and adaptability of AI tools across diverse conditions. Nevertheless, these studies also highlight that AI algorithms optimized for single institutions may overfit local data, emphasizing the necessity for multi-center and prospective validation to truly establish clinical superiority.
It is important to acknowledge that while AI/ML algorithms optimized for individual hospitals may perform well within those specific settings, their deployment across diverse sites introduces several challenges. Different hospitals often operate with varying patient demographics, data collection practices, clinical workflows, and IT infrastructures. This variability can lead to significant quality assurance issues, as there would be no uniform standard for performance evaluation or operational integration [ 58]. Moreover, the adoption of site-specific algorithms would necessitate the development of standardized guidelines and protocols, as well as comprehensive education and training programs for clinical staff, to ensure that these tools are used effectively and safely. In order to mitigate these challenges, future research should focus on developing adaptive, multi-center validated AI/ML models or centralized frameworks that can be calibrated to local conditions while maintaining consistent performance [ 58]. Such an approach will be essential to harmonize the benefits of AI-driven triage across healthcare systems and to ensure system-wide reliability and equity in patient care.
The integration of unstructured data, such as free-text chief complaints, via NLP has significantly enhanced the performance of AI models. This review identifies studies, such as Fernandes et al. [ 33] and Klang et al. [ 26], illustrate that incorporating free-text chief complaints increases sensitivity for detecting critical conditions such as cardiopulmonary arrest. However, this review critically notes that the quality and consistency of unstructured inputs remain problematic due to data completeness issues inherent in retrospective designs. As noted by Moreno-Sánchez et al. [ 45], standardizing data collection and integrating NLP seamlessly into EHRs is critical for widespread adoption. Evaluating the accuracy of AI-containing digital triage systems presents unique epistemological, ontological, and methodological challenges. Epistemologically, there is an inherent difficulty in defining ‘accuracy’ in the context of digital triage where multiple dimensions—such as sensitivity, specificity, and decision-support utility—must be considered simultaneously [ 59–61]. Ontologically, the representation of patient data and clinical outcomes in these systems is complex, as digital triage algorithms often integrate heterogeneous data sources (e.g., structured vital signs and unstructured text), which complicates the definition of a ‘ground truth’ for clinical decision-making [ 62, 63].
Across studies, vital signs (e.g., heart rate, respiratory rate, blood pressure) consistently emerged as the most influential predictors in AI models. This finding aligns with the work of Aldhoayan et al. [ 64] who identified vital signs as primary drivers of ML performance in acute care. Disease-specific predictors, such as Glasgow Coma Scale (GCS) for traumatic brain injury [ 43] and C-reactive protein (CRP) for sepsis [ 27], further demonstrate the adaptability of ML tools to specialized clinical contexts. This convergence on key predictors reinforces the argument that while AI models can replicate and extend traditional triage factors, they also benefit from integrating richer contextual data, particularly from unstructured sources.
Comparatively, Liu et al. [ 40] found that incorporating demographic variables, such as age and mode of arrival, alongside clinical markers improved model discrimination for ED triage. This overlap underscores the universality of certain predictors while highlighting the need for context-specific tailoring of AI algorithms.
AI/ML models have shown potential to alleviate ED overcrowding by enabling early identification of high-risk patients. Leonard et al. [ 28] and Hong et al. [ 39] reported that accurate predictions of admission risk allowed for targeted resource allocation, reducing wait times and optimizing workflows. In addition, AI models significantly reduce under- and over-triage rates, thereby enhancing clinician efficiency. For instance, Ivanov et al.’s KATE model improved triage accuracy by 27 % compared to nurse-led assessments [ 29]. Cost-effectiveness data were also notably absent, a gap that future studies should address. The adoption of site-specific algorithms would necessitate the development of standardized guidelines and protocols, as well as comprehensive education and training programs for clinical staff, to ensure that these tools are used effectively and safely [ 65, 66]. Despite these promising findings, the absence of comprehensive cost-effectiveness analyses and prospective outcome data (e.g., on mortality or complication rates) means that the operational impact of these systems remains largely theoretical.
StrengthsThis systematic review employed a broad and robust approach by searching multiple major databases—PubMed, CINAHL, Scopus, Web of Science, and IEEE Xplore—and by hand-searching reference lists, thereby reducing the likelihood of missing important studies. The use of the SPIDER framework facilitated an inclusive yet structured process, accommodating both quantitative and qualitative research designs relevant to AI/ML-based triage. In addition, risk of bias was evaluated using the PROBAST tool, specifically tailored to prediction model research, ensuring a methodologically sound assessment. By integrating thematic (narrative) synthesis, the review captured both quantitative performance metrics such as area under the curve (AUC) and accuracy, as well as qualitative insights related to clinician acceptance, workflow patterns, and ethical considerations. The range of studies included also provided a global perspective, drawing from a variety of health systems and geographic contexts. Collectively, these features strengthen the review’s ability to present an interdisciplinary understanding of AI/ML implementation in emergency department triage.
LimitationsDespite the breadth of the search strategy, most of the included research comprised single-center, retrospective studies with moderate overall risk of bias, thus limiting the generalizability of the findings. In our review, although studies originated from multiple regions, 9 of 26 studies (35 %) were conducted in the United States and 5 of 26 (19 %) in China, indicating that more than half of the evidence is derived from these two countries. This geographic concentration may limit the generalizability of the findings due to significant differences in healthcare organization, ED infrastructure, and population demographics. These disparities highlight the need for additional research in a broader array of healthcare settings to confirm the applicability of AI/ML models across diverse environments.
The challenges associated with translating algorithms developed in one institution to other settings include variability in patient populations, differences in clinical practices, and disparate levels of technological infrastructure. This raises concerns regarding the consistency of algorithm performance and underscores the need for multi-center prospective studies. Moreover, system-level issues such as quality assurance, guideline standardization, and the education of healthcare staff are paramount for the successful integration of AI/ML tools into routine clinical practice. Moreover, the scarcity of cost-effectiveness analyses hampered the ability to draw robust conclusions regarding the financial feasibility of adopting AI/ML tools in routine triage processes. Although some studies proposed notable improvements in patient outcomes or reductions in overcrowding, the lack of prospective, real-time evaluations makes it difficult to confirm causality. In addition, there was substantial heterogeneity in outcome definitions, care pathways, and institutional protocols, complicating direct comparisons and potentially reducing external validity. Negative or neutral findings were largely underrepresented, which may reflect publication bias. While a subset of studies incorporated large language models, their clinical utility and integration remain largely exploration, necessitating cautious interpretation until more extensive, real-world data become available. One of the limitations of this review is that only English-language publications were included. This criterion may introduce language bias and potentially exclude relevant studies published in other languages, thereby limiting the comprehensiveness and generalizability of our findings. Future reviews should consider including non-English publications to provide a more inclusive perspective on AI/ML-based triage systems.
Recommendations and implicationsWhile the promise of AI/ML in ED triage is clear, we must temper optimism with the lessons learned from past health IT implementations, especially with EHRs. Historically, even highly accurate models have struggled when integrated into the complex, human-driven clinical environment. Thus, innovative algorithms alone will not transform triage; their success hinges on careful, systematic integration into clinical workflows. Importantly, these triage decisions directly affect ICU admissions and subsequent critical care delivery, underscoring the relevance of our findings for critical care nurses who manage patient transitions and resource allocation in intensive care units (ICUs).
EHR integration – Necessary but not SufficientIntegrating AI triage models into existing EHR systems is a vital step toward real-time clinical decision support. However, this integration alone will not resolve all challenges. Many ED datasets suffer from variable data quality, completeness, and standardization. For example, several studies noted that missing or unstructured data hindered model performance. Therefore, future efforts must emphasize robust data exchange protocols and uniform data standards—such as adopting common ontologies and consistent variable definitions—to ensure that even well-designed AI models deliver reliable results across different hospitals.
Addressing publication biasA notable gap in the literature is the absence of reporting on positive predictive value (PPV) or related metrics that inform how many patients flagged as high-risk actually benefit from urgent intervention. None of the 26 studies provided these measures, which are essential for assessing the real-world clinical utility of these tools. The selective publication of studies with positive findings likely overestimates the efficacy of AI models. To mitigate this bias, future reviews should extend searches to include preprint servers and trial registries (e.g., ClinicalTrials.gov) and encourage the publication of negative or null findings. Transparent reporting of all outcomes will offer a more balanced view of the technology’s true impact.
Real-World implementation challengesBeyond integration and publication bias, several practical challenges must be addressed for AI-based triage systems to succeed:
- • Generalizability: Many algorithms were developed and validated in single-hospital settings. Future research must include prospective, multi-center trials that test these models in various ED environments—academic and community hospitals across different regions and patient demographics.
- • Clinician Trust and Acceptance: Critical care nurses and other frontline clinicians are pivotal to the success of triage systems. These providers need to understand and trust AI recommendations before adopting them. Incorporating explainable AI techniques, such as SHapley Additive exPlanations (SHAP), can elucidate which factors (e.g., vital signs and chief complaints) drive a “high-risk” prediction, thereby fostering trust. Engaging critical care nurses early in the design and implementation process will ensure that the AI tools align with clinical workflows and decision-making processes.
- • Cost-Effectiveness: Although improved triage accuracy is promising, economic evaluations are largely missing. Hospitals and payers require robust cost-effectiveness analyses that demonstrate tangible operational benefits—such as reduced ED wait times or lower unnecessary admissions—relative to the implementation costs of the AI system. Without such evidence, even highly accurate models may struggle to gain traction in resource-constrained environments.
- • Hybrid Data Approaches: Effective models must integrate structured data (e.g., vital signs, demographics) with unstructured data (e.g., chief complaints, clinical notes). However, achieving seamless integration requires overcoming significant challenges in data harmonization and standardization.
Based on our findings, we propose the following key strategies:
- Prospective, Multi-Center Validation: Future studies must conduct rigorous prospective trials across multiple ED sites, adhering to guidelines such as CONSORT-AI and SPIRIT-AI, to validate model performance in varied real-world conditions.
- Comprehensive Performance Reporting: Researchers should expand reporting beyond AUC, sensitivity, and specificity to include clinically actionable metrics such as PPV, false positive rates, and real-time workflow impact. This will clarify how often AI models correctly identify high-risk patients without triggering unnecessary alerts.
- Transparency and Explainability: Utilizing explainable AI techniques will help demystify “black box” models and build clinician trust. Detailed reporting of model development using standards like TRIPOD-AI is essential for reproducibility and ongoing improvement.
- Equity and Bias Mitigation: Ensure that training datasets represent diverse populations. Future research must evaluate and report model performance across subgroups to detect and mitigate any disparities.
- Post-Deployment Monitoring: Continuous monitoring, including regular performance audits and clinician feedback, is necessary to promptly identify and correct any degradation in model accuracy or emergent biases. This iterative process should be complemented by ongoing economic evaluations.
- Rigorous Economic Evaluations: Future studies must assess the cost-effectiveness of AI triage systems by comparing operational benefits (e.g., reduced wait times, optimized resource allocation) against the costs of software, integration, and training.
As ED triage directly influences ICU admissions, the integration of AI models has significant implications for critical care nursing. Improved triage accuracy can streamline the transition of critically ill patients to the ICU, reduce delays in intervention, and ultimately enhance patient outcomes. Engaging critical care nurses in the design and implementation process is essential to ensure that AI tools support their clinical workflows, reduce workload, and foster interdisciplinary collaboration.Achieving the transformative potential of AI/ML in ED triage requires a balanced approach that addresses technical, operational, and ethical challenges. By pursuing prospective, multi-center validations, comprehensive performance reporting, data standardization, and continuous post-deployment monitoring, we can develop sustainable AI triage solutions. These strategies will not only optimize patient outcomes and resource allocation but also ensure that critical care nurses and other frontline providers are equipped to safely and effectively integrate these tools into everyday practice. Ultimately, targeted investments in technology, research, and workforce training can revolutionize emergency care delivery on a global scale, setting a new standard for quality, efficiency, and equity in healthcare.
ConclusionAI and ML–based triage models show promise in retrospective comparisons with conventional triage, often reporting superior predictive metrics for critical outcomes. By incorporating structured and unstructured data, these tools could potentially reduce overcrowding, optimize resource allocation, and improve patient outcomes. However, the current evidence base lacks robust prospective validations and multi-center trials, making it premature to assert definitive impacts on mortality or ED throughput in real-world practice.
Future research should prioritize systematic, prospective evaluation of these models, including cost and implementation feasibility analyses. Collaborative efforts among clinicians, data scientists, and policymakers are vital to address ethical issues, ensure transparency, and standardize data collection protocols. With these steps, AI-driven triage solutions may gradually advance from experimental applications to an integral component of next-generation emergency care worldwide.
CRediT authorship contribution statementRabie Adel El Arab: Writing – original draft, Methodology, Investigation, Funding acquisition, Formal analysis, Conceptualization. Omayma Abdulaziz Al Moosa: Writing – review & editing, Supervision, Software, Methodology, Investigation, Formal analysis, Conceptualization.
Declaration of competing interestThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
AcknowledgmentsThe authors wish to thank Joel Somerville and Fuad Abuadas for their invaluable contributions as third reviewers during the review process. Their support, provided exclusively in a technical capacity, is gratefully acknowledged.
Appendix A Supplementary dataSupplementary data to this article can be found online at https://doi.org/10.1016/j.iccn.2025.104058.
Appendix A Supplementary dataThe following are the Supplementary data to this article: Supplementary Data 1
Supplementary Data 2
Supplementary Data 3
| Component | Details |
| Sample | Patients presenting to the ED (adult or pediatric), including disease-specific populations (e.g., sepsis, TBI). |
| Phenomenon of Interest | AI/ML model development, validation, and outcomes for ED triage/risk stratification. |
| Design | Quantitative, qualitative, and mixed-methods studies. |
| Evaluation | Performance metrics (e.g., AUC, sensitivity, specificity), clinical/operational outcomes (e.g., admission rate, ICU transfer, time to treatment), and feasibility (e.g., EHR integration, interpretability, clinician acceptance). |
| Research Type | Empirical studies (RCTs, cohort, cross-sectional, retrospective/prospective observational). |
| Inclusion | Exclusion |
| Studies involving patients presenting to the emergency department, including adults and children, across various clinical conditions. | Studies focusing on non-ED settings or other healthcare domains without specific relevance to ED triage. |
| Research evaluating the development, validation, or clinical impact of AI and ML–based triage and risk stratification models. | Studies not reporting on predictive performance, key model drivers, clinical impact, or implementation feasibility. |
| Quantitative, qualitative, and mixed-methods empirical studies examining AI/ML models for ED triage and risk stratification. | Reviews,
|
| Publications available in English. | Publications in languages other than English. |
| Concept | Keywords and MeSH Terms |
| Artificial Intelligence | “Artificial Intelligence” OR “AI” OR “machine learning” OR “deep learning” OR “natural language processing” OR “computer vision” OR “predictive modeling” OR “data mining” |
| Emergency Department | “Emergency Department” OR “Emergency Room” OR “ED” OR “ER” |
| Triage | “Triage” OR “Emergency Severity Index” OR “ESI” OR “Manchester Triage System” OR “pedKTAS” OR “NTAS” |
| Risk Stratification | “Risk Stratification” OR “Predictive Modeling” OR “Risk Assessment” |
| Clinical Outcomes | “Hospital Admission” OR “ICU Admission” OR “Mortality” OR “Length of Stay” OR “Critical Care” OR “Cardiopulmonary Arrest” |
| Data Types | “Vital Signs” OR “Laboratory Findings” OR “Clinical History” OR “Chief Complaints” OR “Free-text Data” OR “Electronic Health Records” |
| Theme | Key Findings |
| Predictive Performance | AI/ML models consistently outperformed traditional triage systems, achieving high AUCs (>0.80) for predicting hospital admissions, ICU transfers, and specific conditions like sepsis and bacteremia.
Integration of NLP with structured data enhanced model sensitivity and specificity. |
| Key Model Drivers | Vital signs, patient age, and mode of arrival were universal predictors.
Laboratory markers (CRP, creatinine, WBC count) and disease-specific indicators (GCS for TBI, inflammatory markers for Crohn’s) significantly enhanced risk stratification. NLP-derived textual data further improved predictions. |
| Clinical and Operational Impact | AI/ML-based triage tools reduced ED overcrowding by enabling efficient bed allocation and resource use, decreased patient wait times, and mitigated under- and over-triage.
Enhanced decision-making efficiency and the reduced workload for ED staff were also notable, alongside improved patient outcomes |
| Implementation Feasibility | Challenges include limited generalizability due to single-center and retrospective study designs, complexity of AI models affecting interpretability, and integration hurdles with existing EHR systems.
Explainability frameworks like SHAP and prospective multi-center validations are essential for overcoming these barriers and achieving widespread clinical adoption. |
©2025. Elsevier Ltd