Content area
Abstract
Objectives
This systematic review aims to evaluate the effectiveness of automated methods using artificial intelligence (AI) in conducting systematic reviews, with a focus on both performance and resource utilization compared to human reviewers.
Study Design and SettingThis systematic review and meta-analysis protocol follows the Cochrane Methodology protocol and review guidance. We searched five bibliographic databases to identify potential studies published in English from 2005. Two independent reviewers will screen the titles and abstracts, followed by a full-text review of the included articles. Any discrepancies will be resolved through discussion and, if necessary, referral to a third reviewer. The risk of bias (RoB) in included studies will be assessed at the outcome level using the revised Cochrane risk-of-bias tool for randomized trials and the RoB In Non-randomized Studies - of Interventions for non-randomized studies. Where appropriate, we plan to conduct meta-analysis using random-effects models to obtain pooled estimates. We will explore the sources of heterogeneity and conduct sensitivity analyses based on prespecified characteristics. Where meta-analysis is not feasible, a narrative synthesis will be performed.
ResultsWe will present the results of this review, focusing on performance and resource utilization metrics.
ConclusionThis systematic review will evaluate the effectiveness of automated methods, especially AI tools in systematic reviews, aiming to synthesize current evidence on their performance, resource utilization, and impact on review quality. The findings will inform evidence-based recommendations for systematic review authors and developers on implementing automation tools to optimize review efficiency while maintaining methodological rigor. In addition, we will identify key research gaps to guide future development of AI-assisted systematic review methods.
Plain Language SummaryA systematic review is a thorough and organized summary of all relevant studies on a specific topic. These reviews are important for gathering evidence to guide health care decisions, but they often take a lot of time and effort. Recently, tools using artificial intelligence (AI) have been developed to speed up this process. We will conduct a systematic review to see how well these AI tools perform compared to human reviewers. We will examine studies from 2005 that have used AI to conduct systematic reviews. We will assess how well AI tools find the right information, how much time and work they save, and how easy and reliable they are for users. This study aims to help researchers choose the best AI tools to make systematic reviews faster and more efficient without losing quality.
Full text
What is new?
- Key findings
- • We will review how well AI tools work in systematic reviews compared to human efforts at every stage of the process. Our goal is to see if these tools can make the review process faster and more accurate.
- What this study adds to what was known?
- • Systematic reviews are essential for synthesising evidence but are often labor-intensive and time-consuming.
- • Recent improvements in AI have created new automation tools that make processes easier. However, how well these tools work can be very different. This is because of variations in their computer programs, the data they were trained on, and their specific goals.
- What is the implication and what should change now?
- • As AI tools are becoming more common in our everyday lives, it’s important to evaluate how well they work and what their possible drawbacks are when used for systematic reviews.
1.1 Description of the issue
Systematic reviews synthesize empirical evidence meeting predetermined eligibility criteria to produce reliable findings to inform decision-making [ 1]. High-quality systematic reviews guide health policy and clinical decisions. The number of systematic reviews has significantly increased in recent years. Nearly, 80 systematic reviews were published daily, according to data from 2000 to 2019 [ 2], with substantial growth still observed between 2021 and 2022 [ 3].
The systematic review process is labour-intensive and time-consuming, requiring collaboration among topic and methodology experts. The growing volume of health literature increases the time and effort needed to search and screen citations. Review authors often need to screen thousands of records, processing an average of two abstracts per minute for simple topics. This means that reviewing 5000 abstracts demands a skilled reviewer for 40 continuous hours, and more complex subjects lengthen this timeframe [ 4]. On average, completing and publishing a systematic review takes 67.3 weeks [ 5]. The cost of each systematic review is significant due to the labour-intensive processes involved [ 6].
Automation methods have emerged to address these challenges, offering the potential to conduct systematic reviews more efficiently. Various studies, including those registered in the Study Within A Review (SWAR) Repository Store [ 7], investigate the utility of automated methods in systematic reviews. SWARs are a resource-efficient approach to investigate approaches to improve the efficiency of planning, conducting and sharing reviews [ 8]. SWARs have, for example, focused on using artificial intelligence (AI) to aid abstract screening [ 9], search strategy building [ 10], and data extraction [ 11]. These advancements demonstrate AI's growing interest and potential in transforming systematic reviews.
1.2 Description of the methods being investigatedOver the past decade, machine learning (ML) techniques, which automate the learning processes of algorithms by training on data, have grown tremendously in data-driven AI [ 12]. These rapid advancements have catalysed the development of AI-powered tools for systematic review automation, prompting increased collaboration and standardization efforts within the research community.
A key milestone in this evolution was the establishment of the International Collaboration for the Automation of Systematic Reviews (ICASR) in 2015 [ 13]. Since then, ICASR has held annual meetings with themes ranging from interoperability and information extraction to the user experience and the practical application of these technologies. Papers documenting discussions from most of these yearly meetings have been published, providing a valuable record of progress in the field. Researchers have presented advancements in natural language processing (NLP), text mining [ 14], and ML [ 15], highlighting their potential to enhance the efficiency and accuracy of systematic reviews.
1.3 How these methods might workAI-based automation in systematic reviews employs specialized algorithms to enhance efficiency across multiple review stages. These algorithms can reduce the manual workload of time-intensive tasks while maintaining methodological rigor. The key applications of AI methods in the systematic review workflow are as follows.
1.3.1 Searching for the literatureLiterature searching automation encompasses tools for query expansion, study design filters and ML-based relevance assessment [ 16]. For example, randomized controlled trial (RCT) taggers can identify and classify RCTs from PubMed [ 17]. Thalia (Text mining for Highlighting, Aggregating and Linking Information in Articles) is a semantic search engine that automatically discovers relevant articles in the biomedical domain [ 18]. Algorithms like Automatic Query Expansion, which expands original queries with other terms that better capture the actual user intent, are used to adapt search strategies for different databases, with the aim of improving recall and precision [ 19].
1.3.2 ScreeningScreening studies to determine if they meet prespecified eligibility criteria is essential. Supervised ML algorithms are used to classify abstracts as include or exclude using labeled data to train the model. Ranking algorithms then order titles and abstracts by relevance. However, all the abovementioned systems remain reliant on human reviewers, maintaining a “human-in-the-loop” framework, which integrates human interaction into the process. Examples of software in this area include, Rayyan [ 20], DistillerSR [ 21], ResearchScreener [ 22], Abstrackr [ 23], RobotAnalyst [ 24], and Sciome Workbench for Interactive computer-Facilitated Text-mining (SWIFT)-Review [ 25]. These AI tools have the potential to mitigate the resource demands of dual screening by acting as a second reviewer during the screening phase, potentially offering a practical balance that maintains transparency and reliability within a limited budget and timeframe [ 26].
1.3.3 Data extractionData collection in systematic reviews aims to extract relevant information from published studies for synthesis. The evolution of data extraction methods in this field can be categorized into three main phases. The earliest and most commonly used approaches were nonmachine-learning methods, including rule-based systems and Application Programming Interfaces [ 27]. These laid the foundation for automated data extraction.
As technology advanced, researchers adopted traditional ML classifiers such as Naïve Bayes and Support Vector Machines, offering improved automation capabilities and more sophisticated data extraction.
Since 2020, there has been a significant shift toward deep learning models, marking the third phase in this evolution. The use of neural network models, especially transformer-based architectures, has risen dramatically. Bidirectional Encoder Representations from Transformers and Efficiently Learning an Encoder that Classifies Token Replacements Accurately have demonstrated particularly strong data extraction performance [ 28].
An example of these advanced tools is RobotReviewer, which exemplifies the potential of modern AI in systematic reviews. This system can automatically retrieve population, intervention, comparator, and outcomes and bias information (based on the Cochrane Risk of Bias [RoB] tool), showcasing AI-driven data extraction's increasing sophistication and specificity in systematic reviews [ 29].
1.3.4 Risk of bias (RoB) assessmentRoB assessment determines whether methodological decisions might introduce bias into the study results. Manual RoB assessment typically requires two trained reviewers and can take 10 to 60 minutes per study [ 30]. More complex RoB tasks may take longer. Tools like RobotReviewer leverage NLP and ML models trained on clinical trial reports to automatically generate RoB results, classifying studies into low or unclear/high RoB categories for key questions from the Cochrane RoB tool [ 31, 32].
1.3.5 Multiple stagesAI tools can be applied at various stages of systematic reviews. For example, large language models (LLMs) such as GPT-4 by OpenAI or Claude 2 by Anthropic have broad applications in multiple stages of systematic reviews, from literature selection [ 33] to data extraction [ 34] and RoB assessment [ 35]. These tools can structure input text as instructions (ie, “prompts”) and demonstrate significant capabilities in understanding and generating human-like text, processing data with minimal supervision.
1.4 Why our systematic review is importantWhile systematic review automation tools demonstrate potential for reducing workload and improving consistency, their effectiveness varies based on algorithmic design, training data quality, and intended use case [ 36]. Critical evaluation of these tools' performance and reliability across different review contexts remains essential. Comparative evaluations are needed to identify the most effective tools for specific review contexts and stages. Although there is an abundance of small-scale evaluations, few studies directly compare AI with human performance. Therefore, the key contribution of this study is to focus on evaluation results that can directly guide and improve practice, moving beyond theoretical potential to practical implementation.
The available AI tools in systematic reviews have limitations [ 27, 36]. Many tools are not user-friendly, often lack formal validation, and face licensing challenges, including restricted access to source code. In addition, support is often inadequate, especially in open-source software where the absence of a revenue stream limits funding for support. Effective use of AI tools in systematic reviews also depends on familiarity with computer science and technical considerations, such as integrating multiple steps into larger systems and ensuring the tool's capacity to handle large datasets [ 37]. Current guidance from the Cochrane Rapid Reviews Methods Group advises that automation software should assist human reviewers while ensuring that human judgment remains integral to the process [ 38].
Most researchers have focused on automating individual components of the systematic review process, such as literature retrieval [ 39], literature screening [ 22, 40], data extraction [ 27], and RoB assessment [ 32, 41]. While some reviews have evaluated automation tools more broadly — such as Yao et al's assessment of AI tools in cancer reviews [ 42] and Blaizot et al's narrative review in health sciences [ 43] — no comprehensive evaluation exists of AI-based automation effectiveness across the entire systematic review workflow.
This review addresses this gap by evaluating AI-based automation tools across all systematic review stages, aiming to guide implementation decisions and future development priorities.
2 Objectives- To evaluate the performance of AI-based automated methods compared to manual approaches across systematic review stages.
- To assess the resource utilization of AI-based automated methods compared to manual approaches across systematic review stages.
This protocol has been developed in line with the Cochrane protocol and review guidance [ 44] and reported as per the Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) reporting standards [ 45]. This protocol has been registered in the International Prospective Register of Systematic Reviews [CRD420250636484].
3.1 Criteria for considering studies for this review3.1.1 Type of studies
We define “AI-based automated methods” as those using specialized software that incorporates a range of AI algorithms, such as ML, NLP, LLMs, etc. Specifically, we will consider studies that describe any AI-based automated methods applied to any stage of conducting a systematic review, including, but not limited to, literature searching, screening, data extraction, RoB assessment, synthesis, grading the certainty of evidence, and reporting. In subsequent sections, we use the term “AI tools” to represent “AI-based automated methods”.
Our review focuses on prospective designs, where the technology was evaluated in an ongoing context during the review process. Prospective designs offer more insights into how AI technologies might work when implemented in real-world settings. They provide a more realistic representation of the challenges and benefits of AI integration in actual systematic review workflows.
This review will include study designs as specified by the Cochrane Effective Practice and Organization of Care (EPOC) Group [ 46]. The eligible EPOC study designs for evaluating health system interventions, ensuring methodological rigor in comparing AI and human performance are:
- RCTs, including cluster RCTs (where clusters, such as citations needing screening, are randomly allocated to either the AI or human group) and cross-over RCTs (where the systematic review process, such as citation screening, is randomly assigned to either manual or AI-assisted method and then switched) [ 46].
- Controlled before after studies involve implementing outcome assessments both before and after using the intervention (such as RoB assessment) to compare the results between the AI and Human groups [ 46].
- Non-randomized trials include quasi-experimental studies where allocation to AI or human groups is not strictly randomized but follows a systematic, prespecified approach [ 46].
- Interrupted time series (ITS) studies will be included if they use observations before and after a clearly defined intervention point, such as the release of ChatGPT. Eligible ITS studies must include at least three data points both before and after the intervention [ 46].
We acknowledge that this may exclude some potentially informative post hoc simulation studies. However, maintaining EPOC design criteria ensures methodological consistency and robustness of our findings. Future research may consider expanding to include well-designed simulation studies to complement findings from these primary study designs.
3.1.2 Type of dataEligible studies must report the effectiveness of AI tools in human health care–related systematic reviews, including performance or resource utilization metrics.
3.1.3 Type of methodsThis review will compare AI vs conventional human-only methods in systematic review tasks. We define the following three levels of AI implementation:
- AI-only: Processes predominantly conducted by AI with minimal or no human oversight
- AI-assisted: Processes combining AI and human input (eg, AI screening with human verification)
- Human-only (comparator): Conventional systematic review methods without AI support
We will examine the following head-to-head comparisons:
- AI-only methods vs human-only methods
- AI assisted vs human-only methods
- Different AI method groups compared to human-only methods
- a. NLP tools
- b. ML tools
- c. Text mining tools
- d. Combination-technique groups
Comparisons by specific systematic review tasks:
- a. Literature searching
- b. Screening
- c. Data extraction
- d. RoB assessment
- e. Synthesis
- f. Grading the certainty of evidence
- g. Reporting
AI tools (the intervention) will be categorized into the following four groups based on the main characteristics of the tools used:
- • NLP group: Tools that enable computers to understand and generate human language, including syntactic analysis, semantic analysis, discourse analysis, sentiment analysis, machine translation, information extraction, and text generation LLMs [ 47], such as ChatGPT by OpenAI, are significant components of this group.
- • ML group: Tools that autonomously learn by analysing data. This encompasses supervised, unsupervized, reinforcement, and deep learning [ 48]. Examples include Abstrackr [ 23] and RobotSearch [ 49].
- • Text mining group: Tools discovering knowledge and structure from unstructured data (ie, text). Most studies employed text mining to assist with screening in systematic reviews. Two main approaches are used to exclude items [ 50]: one involves using a classifier that makes explicit inclusion or exclusion decisions, while the other uses a ranking or prioritization system based on a threshold or specific criteria. Evidence for Policy and Practice Information and Co-ordinating (EPPI)-Reviewer [ 51] is noted as a supportive tool for text analysis processes. Meanwhile, PubReMiner [ 10] is a text-mining word frequency tool that can be used to build a search strategy.
- • Combination group: Tools that use several AI techniques without a dominant one. For example, Rayyan uses technologies like NLP, ML, and text mining [ 52], while SWIFT-Review [ 25] combines text mining and ML, which can be useful during the scoping and problem formulation stages of a systematic review.
We will evaluate the following primary and secondary outcome measures comparing the performance of AI vs human methods (see Appendix A for detailed definitions and calculations).
3.1.4.1 Primary outcomesThese are based on data from a 2 × 2 contingency table defined as true positive (TP) = elements correctly identified as relevant; false positive (FP) = elements incorrectly identified as relevant; true negative (TN) = elements correctly identified as irrelevant and false negative (FN) = elements incorrectly identified as irrelevant.
- Accuracy - the proportion of all correct data (both relevant data correctly identified and irrelevant data correctly excluded) out of all data assessed: (TP + TN)/(TP + TN + FP + FN)
- Sensitivity (recall or TP rate) - the proportion of correctly identified relevant data out of the total observed relevant data: TP/(TP + FN)
- Precision (positive predict value) - the proportion of correctly identified relevant data out of all identified relevant data: TP/(TP + FP)
- Specificity - the proportion of irrelevant data correctly excluded from all observed irrelevant data: TN/(TN + FP)
- F1 score - the harmonic mean of precision and sensitivity.
- Missed rate - the proportion of relevant data that failed to be identified out of all observed relevant data: FN/(TP + FN)
- Error rate - the proportion of incorrect data (both relevant data failed to be identified and irrelevant data failed to be excluded) out of all data assessed: (FP + FN)/(TP + TN + FP + FN)
- Error classification - We will categorize errors using the following taxonomy adapted from Gartlehner et al [ 34].
- • Major error: could lead to erroneous conclusions if uncorrected.
- • Minor error: less severe than a major error but still influences the quality of the existing data. For instance, small calculation or rounding errors do not significantly impact the overall utility of data.
- • Fake data: fictitious data generated by AI or humans.
- • Missed or omitted data: data found in full text but omitted by AI or humans.
Resource utilization measures:
- Person-time required - total time spent on the systematic review.
- Workload savings - the reduction in manual tasks across various systematic review stage, for example, the number of citations screened, the amount of data extracted, or the effort required for RoB assessment.
- Time savings - the estimated time saved by using AI tools.
3.2.1 Electronic searches
We used the Peer Review of Electronic Search Strategies 2015 guideline statement [ 53] to develop a search strategy in collaboration with an information specialist, structured as recommended by Haddaway et al [ 54] (see Appendix B). Our research question has been divided into a three-concept search strategy: “key steps of systematic review” AND “artificial intelligence” AND “EPOC study design”. We will search for potential eligible studies in five digital bibliography databases, that is, Medline through Ovid, Embase (Elsevier), Web of Science (core collection), Cochrane Register of Controlled Trials, and Association for Computing Machinery Digital Library. Evidence indicates a significant increase in AI tools for systematic reviews, particularly from 2005, when ML-based algorithms were evaluated [ 27, 50, 55]. Hence, we will limit our search to studies published in English after 2005.
3.2.2 Searching other resourcesWe will search for gray literature using the same search terms through the ProQuest Dissertations and Theses Citation Index via the Web of Science. Backwards and forward citation search for relevant articles will be done to ensure no potential articles are missed. In addition, we will check the eligibility of each article using the Digital Evidence Synthesis Tool Evaluations available in the EPPI Visualiser database. This digital database is designed to develop detailed recommendations on priority areas for applying digital evidence synthesis tools in the climate and health domain [ 56]. Currently, 228 records are stored on the site. The search was conducted on January 14, 2025.
We will contact the authors of the included studies to ask if they are aware of any other papers we have not identified in our search.
3.3 Data collection and analysis3.3.1 Selection of studies
The screening process will consist of two stages, and it will be conducted by two independent reviewers. In the title and abstract screening stage, citations will be judged as include, exclude or unclear (need further information). Citations judged as either included or unclear will progress to the full-text screening stage. During full-text review, reviewers will document detailed reasons for exclusion. Any disagreements between independent reviewers will be discussed and resolved, with a third reviewer consulted if needed. Both screening stages will be pilot-tested before full implementation.
We will use Covidence (Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia; available at www.covidence.org) for screening [ 57]. The PRISMA flow diagram will illustrate the search and screening results and the reasons for exclusion at the full-text screening stage.
3.3.2 Data extraction and managementTwo reviewers will independently extract data from each included study (including studies reported across multiple publications). Discrepancies will be discussed and, if unresolved, referred to a third reviewer if necessary. Data extraction will be recorded in purposefully designed, pilot-tested data collection forms.
The following information will be extracted from each included study:
- • General information about the study (eg, author, year of publication, origin/country, research question, study design, study size, intervention and comparator, the specific AI tools employed, the AI methods or algorithms based on, study findings, and study limitations reported by original authors).
- • Performance metrics: 2 × 2 contingency table data (TP, TN, FP, and FN) to calculate aforementioned outcome measures, reference standard, classification criteria or threshold for defining data as relevant, and error type.
- • Resource utilization metrics: person-time spent, workload savings, time savings – any reported mean differences (MDs), SDs.
Two reviewers will independently assess the RoB at the outcome level for all included studies. Any disagreements will be resolved by consulting a third reviewer.
The revised Cochrane risk-of-bias tool for randomized trials [ 58] will be used to assess the RoB of included RCTs. This tool assesses five domains for each specific outcome as outlined in the Cochrane Handbook for systematic reviews of interventions: bias arising from the randomisation process, bias due to deviations from intended interventions, bias due to missing outcome data, bias in the measurement of the outcome, and bias in the selection of the reported result. Responses to signaling questions (Yes, Probably yes, Probably no, No, No information) will lead to different risks of bias judgements (Low, High, Some concerns).
For non-randomized studies, we will use the RoB In Non-randomized Studies - of Interventions to assess bias across seven domains for each outcome: bias due to confounding, bias in the selection of participants into the study, bias in classification of interventions, bias due to deviations from intended interventions, bias due to missing data, bias in the measurement of outcomes, and bias in the selection of the reported result [ 59]. Each domain will be judged as having a Low, Moderate, Serious, Critical RoB, or No information.
3.3.4 Measure of the effect of methodsFor effect measures related to dichotomous data, refer to the primary outcome measures listed in section 3.1.4. These are based on data from a 2 × 2 contingency table defined as TP, FP, TN, and FN.
For continuous data (person-time required, workload savings, time savings), the effect measure will be the MD or standardized mean difference (SMD).
3.3.5 Unit of analysis issuesUnit-of-analysis errors arise when the statistical analysis does not align with the level at which data were collected or randomized, potentially leading to a higher risk of drawing artificially inflated or erroneous conclusions about the estimation of the intervention's true effect. This error can occur across various study designs:
- • Cluster designs: Errors might arise from ignoring cluster effects in the analysis. For instance, when reviewers are allocated into groups, but the analysis treats each reviewer as an independent individual without accounting for cluster effects. Similarly, clustering may occur in scenarios where citations are screened or records extracted, and shared factors such as review topics, complexity, or other common characteristics are not considered.
- • Crossover or repeated measures designs: Errors might involve confounding due to carryover effects if randomization and adjustment are inadequate. For example, in scenarios where reviewers are allocated to both AI and human groups, the sequence and washout time might influence the effects of intervention. Differences could reflect reviewer preferences, learning or fatigue effects rather than the intervention.
We will address unit-of-analysis errors following the Cochrane Handbook guidance [ 60].
For cluster-randomized trials, we will evaluate whether the original analysis appropriately accounts for the cluster design. Where possible, we will extract cluster-adjusted effect measures for the meta-analysis. Where studies failed to report cluster-adjusted measures, we will apply corrections by reducing the trial size to its “effective sample size” using the “design effect” [ 60]. The design effect is calculated using the formula, 1 + (M – 1) × Intracluster Correlation Coefficient (ICC), where M is the average cluster size and ICC is the intracluster correlation coefficient. ICC is a measure of the similarity of observations within clusters. For dichotomous outcomes, both participant numbers and event counts will be adjusted, while for continuous outcomes, only the sample size will be reduced. An alternative approach will be multiplying the standard error of the measure by the square root of the design effect to inflate it [ 60]. The choice of approach will depend on the availability of information such as the ICC. Furthermore, the impact of excluding these studies on the robustness of the main results will be assessed in a sensitivity analysis.
For crossover trials, we will assess and account for potential carry-over or period effects, and where possible, adjust for these (eg, by using data from the first period only if carry-over is thought to be a problem). If adjustment is not feasible, we will consider excluding studies from the meta-analysis.
In multiarm studies, to avoid the unit-of-error caused by double counts, we will identify relevant intervention and control groups. All relevant intervention groups will be combined into a single intervention group, and all relevant comparator groups will be combined into a single comparator group. This approach will allow a single pair-wise comparison for meta-analysis. Irrelevant intervention or control groups will be excluded from the analysis.
3.3.6 Dealing with missing dataIf key data such as the 2 × 2 contingency data is missing, contacting the corresponding authors will be attempted. The extent of missing data will be reported and the impact of this on the certainty of the evidence and any potential biases will be discussed.
3.3.7 Assessment of reporting biasesTo investigate the potential reporting bias, we will follow the guidance in Chapter 13 of the Cochrane Handbook [ 61].
We will consider the following approaches:
- Characteristic assessment: We will examine the characteristics of included studies and consider factors that might lead to non-reporting of results.
- Assessment of selective non-reporting: We will compare study protocols and registry entries (where available), with published reports to identify any discrepancies in reported outcomes.
- Graphical methods: If we have 10 or more studies in a meta-analysis, we will consider using funnel plots [ 62]. We will use contour-enhanced funnel plots to help differentiate asymmetry due to reporting biases from other causes of asymmetry.
- Statistical methods: In meta-analyses with at least 10 studies, we will consider statistical tests for funnel plot asymmetry. For continuous outcomes, we will use the Egger test [ 63]. We will interpret these results cautiously, recognizing their limitations.
We will carefully differentiate between small-study effects identified by funnel plots and reporting biases. If we suspect reporting biases, we will conduct sensitivity analyses to evaluate the robustness of the meta-analysis conclusions, considering the potential impact of missing studies or results.
We acknowledge that in the field of AI tools for systematic reviews, factors such as rapid technological advancement, varying study sizes, and differing methodological approaches may contribute to asymmetry in results that are unrelated to reporting biases. We will consider these factors in our interpretation.
For the influence of small-study effects, where applicable, we will compare fixed-effect and random-effects estimates of the intervention effect when between-study heterogeneity exists ( I 2 > 0). Similar estimates between methods would suggest minimal influence of small-study effects on the intervention effect estimate. If the random-effects estimate differs markedly, particularly shifting toward null or extreme effect sizes, we will investigate potential explanations including reporting biases.
3.3.8 Data synthesisStudies that are sufficiently similar in terms of AI tools tested, outcome measures, and reference standards will be identified and grouped for analysis. We will calculate the effect measures for dichotomous data such as sensitivity and specificity along with 95% CIs using data from the 2 × 2 contingency tables. For continuous outcomes, SMD will be calculated if the effect measure is reported on different scales across studies, otherwise MD will be used if it is on the same scale. We will visually assess the data by plotting the estimates on forest plots or in receiver operating characteristics space.
Where appropriate meta-analysis will be conducted, and this will be in accordance with the Cochrane Handbooks for Systematic Reviews of Diagnostic Test Accuracy [ 64] and Intervention [ 61].
We will fit random-effects models using the inverse-variance method to obtain pooled estimates of the outcome measures along with 95% CIs. For sensitivity and specificity, summary estimates with 95% CIs will be derived using hierarchical random-effects models. These models, in addition to pooling estimates of sensitivity and specificity, will also account for the correlation between them. For studies applying the same threshold for classification to “relevant data”, bivariate models will be fitted. For studies reporting different thresholds, the hierarchical summary receiver operating characteristic (SROC) model will be fitted and the SROC curve generated. If there are too few studies, hierarchical models will be simplified as per Takwoingi et al [ 65].
Given the F1 score's nature as a harmonic mean, which presents unique challenges for meta-analysis, we will instead conduct a descriptive analysis as part of a narrative synthesis. We will also put other performance metrics that might not be suitable for meta-analysis in the descriptive analysis.
If meta-analysis is not feasible either due to excessive heterogeneity or insufficient data, then narrative synthesis will be performed as per the Synthesis Without Meta-analysis (SWiM) guidelines [ 66]. Following the SWiM guidelines, studies will be grouped by AI tool type, review stage, and outcome measures. Results will be presented using structured tables and, where appropriate, forest plots without pooled estimates.
3.3.9 Subgroup analysis and investigation of heterogeneityGiven the diverse nature of AI tools in systematic reviews, we anticipate that heterogeneity will be a key feature of our analysis.
We will assess heterogeneity using multiple approaches:
- Visual inspection of the forest plot of included studies, including point estimates distribution, overlap and width of CI.
- To quantify the extent of heterogeneity, we will calculate the I 2 statistic, and use the following thresholds to interpret them [ 67].
- • 75%–100%: considerable statistical heterogeneity;
- • 50%–90%: substantial heterogeneity;
- • 30%–60%: moderate heterogeneity;
- • 0%–40%: might not be important.
From the random-effects meta-analysis, statistical heterogeneity will be measured using the between-study variance measure, tau-squared [ 68].
We plan to examine the predetermined characteristics of the included studies through subgroup analysis. The following are our a priori subgroups:
- • Type of review (eg, intervention reviews, diagnostic test accuracy reviews, prognostic reviews).
- • Complexity of the review topic.
- • Volume of literature being reviewed within the host review.
- • Disciplinary areas within health care: While our focus is on human health-related areas, we will consider subgroups based on broad disciplinary categories.
Our interpretation of heterogeneity statistics will consider this context.
- • Different processes of systematic reviews (eg, literature searching, screening, data extraction, and RoB assessment).
- • Different AI groups (eg, ML group, NLP group, text mining group, and combination group).
- • Different study designs (eg, randomized studies and nonrandomized studies).
We acknowledge that some of these subgroups, particularly those related to disciplinary areas, may have limitations in terms of meaningful differences, so we will undertake cautiously and only if subgroup analysis findings are likely useful in informing decisions. We will interpret the results of these subgroup analyses cautiously and consider their practical significance in the context of AI tools in systematic reviews.
3.3.10 Sensitivity analysisWe will conduct sensitivity analyses to evaluate how excluding the following types of studies impacts the robustness of the results from the analysis:
- • Studies with a high RoB.
- • Studies that did not account for clustering or carryover effect.
We will assess the certainty of the evidence for our primary outcomes using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach for intervention studies [ 69] (a body of evidence from randomized trials begins with a high certainty rating).
The GRADE approach considers five domains: study limitations (RoB), imprecision, inconsistency, indirectness, and publication bias. We will create a summary of findings tables (SoFs) according to different systematic review stages (eg, literature searching, screening, and data extraction).
We will present the following information for each outcome:
- • The estimated effect and its 95% CI
- • The number of studies and participants contributing to the outcome
- • A GRADE rating of the overall certainty of the evidence
- • An explanation of the GRADE rating
Evidence certainty will be graded as “high,” “moderate,” “low,” or “very low” by outcome. All judgements other than “high” certainty will be explained in the footnotes.
We will use the GRADEpro Guideline Development Tool software to create the SoFs.
4 DiscussionWe propose a protocol for a systematic review to assess the effectiveness of AI-based automated methods in conducting systematic reviews. To our knowledge, no systematic review has been conducted on this topic. This review aims to provide a comprehensive overview of AI tools in systematic reviews, assess their performance and resource utilization compared to conventional human-mediated manual methods, identify potential areas for improvement and future research directions, and inform decisions about the reliability and applicability of AI tools in systematic reviews.
The application of AI tools is expected to become a major trend. Understanding the latest and most reliable technologies will help researchers improve the quality and transparency of their studies by reducing error and bias.
This review has several strengths, including a comprehensive search strategy covering multiple databases and gray literature, rigorous methodology following Cochrane guidelines.
However, this review may face limitations due to restricting included studies to English language publications. In addition, resource and access restrictions may prevent us from covering all types of automation tools, and significant heterogeneity among included studies may reduce the generalisability of our findings. Other potential limitations include the rapid evolution of AI technologies, which may outpace the review process, variability in the quality and reporting of primary studies, and potential bias in the evaluation of AI tools, as studies may be more likely to report positive results.
The findings will inform best practices for integrating both AI-only and AI-assisted technologies into the systematic review process, potentially improving the efficiency and quality of evidence synthesis in health-care research.
Declaration of generative AI and AI-assisted technologies in the writing processDuring the preparation of this work, the authors used Chat GPT-4 of Open AI to aid in building a search strategy and checking the grammar. After using this tool, the authors reviewed and edited the content as needed and took full responsibility for the content of the published article.
CRediT authorship contribution statementXuenan Pang: Conceptualization, Data curation, Methodology, Writing – original draft, Writing – review & editing. KM Saif-Ur-Rahman: Conceptualization, Methodology, Supervision, Writing – review & editing. Sarah Berhane: Methodology, Writing – review & editing. Xiaomei Yao: Writing – review & editing. Kavita Kothari: Methodology, Writing – review & editing. Petek Eylül Taneri: Writing – review & editing. James Thomas: Conceptualization, Methodology, Writing – review & editing. Declan Devane: Conceptualization, Methodology, Project administration, Supervision, Writing – review & editing.
Declaration of competing interestThere are no competing interests for any author.
AcknowledgmentsThe authors are grateful for the financial support provided for Xuenan Pang's PhD studies by the Chinese Scholarship Council, China and the College of Medicine, Nursing, and Health Sciences at the University of Galway, Ireland. Additionally, the authors acknowledge funding support for the publication cost from the OVPRI Academic Publication Support as part of the University of Galway's commitment to Research Excellence.
Supplementary dataAppendix A and B
Supplementary dataSupplementary data to this article can be found online at https://doi.org/10.1016/j.jclinepi.2025.111738.
©2025. The Authors