Content area
ABSTRACT
Assessment of the risk of engaging in a violent radicalization/extremism trajectory has evolved quickly in the last 10 years. Guided by what has been achieved in psychology and criminology, scholars from the field of preventing violent extremism (PVE) have tried to import key lessons from violence risk assessment and management, while bearing in mind the idiosyncrasies of their particular field. However, risk tools that have been developed in the PVE space are relatively recent, and questions remain as to their level of psychometric validation. Namely, do these tools consistently and accurately assess risk of violent extremist acting out? To answer this question, we systematically reviewed evidence on the reliability and validity of violent extremism risk tools. The main objective of this review was to gather, critically appraise, and synthesize evidence regarding the appropriateness and utility of such tools, as validated with specific populations and contexts. Searches covered studies published up to December 31, 2021. They were performed in English and German across 17 databases, 45 repositories, Google, other literature reviews on violent extremism risk assessment, and references of included studies. Studies in all languages were eligible for inclusion in the review. We included studies with primary data resulting from the quantitative examination of the reliability and validity of tools used to assess the risk of violent extremism. Only tools usable by practitioners and intended to assess an individual's risk were eligible. We did not impose any restrictions on study design, type, method, or population. We followed standard methodological procedures outlined by the Campbell Collaboration for data extraction and analysis. Risk of bias was assessed using a modified version of the COSMIN checklist, and data were synthesized through meta‐analysis when possible. Otherwise, narrative synthesis was used to aggregate the results. Among the 10,859 records found, 19 manuscripts comprising 20 eligible studies were included in the review. These studies focused on the Terrorist Radicalization Assessment Protocol (TRAP‐18), the Extremism Risk Guidance Factors (ERG22+), the Multi‐Level Guidelines (MLG‐V2), the Identifying Vulnerable People guidance (IVP guidance), and the Violent Extremism Risk Assessment (VERA)—all structured professional judgment tools—as well as Der Screener—Islamismus, an actuarial scale. Studies mostly involved adult male participants susceptible to violent extremism (N = 1106; M = 58.21; SD = 55.14). The types of extremist ideologies endorsed by participants varied, and the same was true for ethnicity and country/continent of provenance. Encouraging results were found concerning the inter‐rater agreement of scales in research contexts (kappas between 0.76 and 0.93), but one of the two studies that examined it in a field setting obtained disappointing results (kappas ranging between of 0.47 and 0.80). Content validity studies indicated that PVE risk tools adequately cover the risk factors and offending processes of individuals who go on to commit extremist violence. Construct validity analyses were few and far between, with results indicating that empirical divisions of scales did not match their conceptual divisions. The internal consistency of subscales was lackluster (Cronbach's alphas between 0.19 and 0.85), whereas full scales demonstrated acceptable internal consistency when assessed (0.80 for the ERG22+ and 0.64 for the IVP guidance). Only one study examined convergent validity, and it revealed a lack of convergence, primarily due to particularities of the scale under study (the MLG‐V2). Discriminant validity analyses were exploratory in nature, but suggested that PVE risk tools might not be ideology‐specific and may apply to both group and lone actors. Finally, although the TRAP‐18 showed a relatively strong postdictive effect size (pooled r = 0.62 [0.35–0.77], p = 0.000), the results were highly heterogeneous (I2 = 86%), and all studies used retrospective designs, meaning the outcome was already known at the time of assessment. As such, no included study evaluated true predictive validity (i.e., the ability to forecast future violent extremist outcomes based on prospective risk assessment). This represents a significant evidence gap. Threats to validity were substantial: (a) Many studies were case studies or had very small samples, (b) nearly all samples were constituted through the triangulation of publicly available data, and (c) convenience outcome measures were often used. Although having imperfect data is better than having no data, the current state of empirical validation precludes the recommendation of one tool over another for specific populations and contexts, and calls for higher‐quality validation studies for PVE risk assessment tools. Nevertheless, these tools constitute useful checklists of relevant risk and protective factors that could be taken into account by evaluators who wish to assess the risk of violent extremism and identify intervention targets.
Plain Language Summary
Risk assessment tools for violent extremism may be useful in some contexts; however, there is a need for higher quality evaluation studies.
The Review in Brief
This systematic review looks at the validation of tools used to assess the risk of violent extremism. It finds that none of these tools currently meet the standards expected in the field of correctional psychology.
What Is This Review About?
Risk assessment helps people in the justice system make decisions about supervision, early release, and who should get certain services. These tools have been used in prisons and probation systems around the world since the 1980s. But in the area of preventing violent extremism (PVE), the tools are newer, and it's unclear if they are accurate and consistent. This review brings together studies that tested whether these tools work as intended. We looked at how well they measure what they claim to measure (validity) and how consistently they give the same results (reliability).
What Is the Aim of This Review?
This Campbell systematic review examines the reliability and validity of risk tools for the assessment of violent extremism. It is based on 20 studies that tested these tools and looked at how strong the research evidence is.
What Are the Main Findings of This Review?
The 20 studies looked at six tools: the Terrorist Radicalization Assessment Protocol (TRAP-18), the Extremism Risk Guidance (ERG22+), the Multi-Level Guidelines (MLG-V2), the Identifying Vulnerable People guidance (IVP guidance), the Violent Extremism Risk Assessment (VERA), and Der Screener—Islamismus. Studies mostly comprised adult men who adhered to various extremist ideologies (far right, Islamist, nationalist, incel, etc.) and came from multiple countries and continents.
Many studies had major limitations. Some had very small samples or used publicly available data (like news articles, biographies, or databases). Many also used convenience outcome measures, like whether an attack was stopped or not. Importantly, none of the studies used a “prospective” design—meaning that none tested whether the tools could actually predict future violence based on an assessment done before anything happened. Instead, all the studies looked backward in time, after outcomes were already known. These are called postdictive studies.
There were, however, some positive results. In research settings, different experts often gave similar scores when using the same tool (inter-rater agreement), though this was not always true in real-world settings. Studies on content validity found that most tools include risk factors linked to extremist violence. Discriminant validity results suggested that these tools might work for both individuals and groups, and for different types of ideologies. But while some tools showed strong results in postdictive validity studies, there was a lot of variation across studies, and none truly tested predictive validity.
What Do the Findings of This Review Mean?
Right now, we cannot say that one tool is better than another. These tools should not be used as the only source of information to make important security decisions. However, they can still help professionals think about relevant risk and protective factors and plan support or interventions. More high-quality research is needed to test how well these tools work in real-world situations.
How Up-to-Date Is This Review?
The review includes studies published up to December 31, 2021.
Introduction
Background
The Problem, Condition, or Issue
Assessment of the risk of engaging in or desisting from a violent radicalization trajectory has evolved quickly in the last 10 years. “Standing on the shoulders of giants” (Logan and Lloyd 2019) of what has been achieved in psychology and criminology over the last 50 years, scholars from the field of PVE have tried to import key lessons from violence risk assessment and management while taking into account the idiosyncrasies of their particular field (e.g., Borum 2015).
The advantages of having reliable and valid tools to anticipate and mitigate the risk of violent extremism cannot be underestimated. Law enforcement and intelligence professionals must assess persons of concern before they become involved in planning or executing attacks (threat assessment), and the criminal justice system must determine when and under what circumstances inmates may be released (risk assessment; Borum 2015). Mental health and psychosocial professionals are now often called to perform violent extremism risk assessments in the same way they would for suicidal or homicidal risks among their clients (Borum 2015; Logan and Lloyd 2019). While PVE risk tools are often positioned as useful for both disruption and rehabilitation purposes (e.g., Meloy 2017), in the broader field of criminology, instruments such as the Level of Service/Case Management Inventory (LS/CMI; Andrews et al. 2004) and the Static-99R (Helmus et al. 2022) are usually reserved for tertiary prevention spaces—that is, to assess release viability and conditions, and to guide prevention and rehabilitative interventions (Bonta and Andrews 2024; Mullen 2000).
In the field of correctional psychology, rehabilitative approaches emphasize the relevance of risk assessments for correctional intervention and service delivery or, in other words, risk reduction via treatment, capacity building, and social reinsertion (Brouillette-Alarie and Lussier 2018). Initially dominated by a “nothing works” mindset (Martinson 1974), the fields of psychology and criminology funded research on the determinants of effective correctional programming, leading to the identification of the risk-need-responsivity principles1 and the development of interventions that are able to reduce the risk of recidivism among judicialized persons (Bonta and Andrews 2024). Meta-analyses have shown that correctional interventions not based on risk-need-responsivity principles are generally ineffective and can sometimes lead to iatrogenic effects (Bonta and Andrews 2024; Hanson et al. 2009). By contrast, interventions that respect all three principles can achieve effect sizes comparable to those of psychological and medical interventions (Bonta and Andrews 2024; Marshall and McGuire 2003). The risk-need-responsivity principles of effective correctional intervention are rooted in reliable and valid assessments of the risk posed by judicialized individuals—particularly in the sources or causes of that risk (i.e., individuals' criminogenic needs). Even though the lessons of correctional psychology cannot be imported “as is” in the field of PVE, they nevertheless highlight the importance of effective risk assessment tools to help structure prevention and intervention efforts.
Unfortunately, as of now, there are no gold standards in the risk assessment of violent extremism. Multiple authorities in the field are critical of the viability of risk assessment, especially when it is of the actuarial type2 (Borum 2015; Corner and Taylor 2023; Monahan 2012; Sarma 2017). Five obstacles are commonly mentioned: (1) empirical research on risk and protective factors of violent extremism is not sufficiently developed to provide a sound empirical basis for what to include and not include in tools; (2) research on risk factors comes from commonalities between individuals who have committed terrorist attacks, but appropriate validation would require that these characteristics be relatively absent of control groups that have not committed such attacks; (3) the low base rate of recidivism among violent extremist offenders complicates predictive validity analyses and inflates the risk for false positives; (4) violent radicalization trajectories can lead to multiple outcomes (i.e., radicalization of ideas, joining and participating in extremist group activities, committing acts of violence, etc.) and the same predictors might not apply to the same outcomes; and (5) risk of violent extremism might not be cumulative, in contrast to risk of general violence or criminal recidivism (Borum 2015; Conley 2019; Monahan 2012; RTI International 2018; Sarma 2017; van der Heide et al. 2019).
To overcome these limitations, scholars have advocated for the development, validation, and use of structured professional judgment (SPJ) protocols over actuarial scales to assess the risk of violent extremism (Borum 2015; Monahan 2012). At present, most publicly known tools that are designed to assess the risk of violent extremism and that are used operationally by practitioners consist of SPJ protocols (Scarcella et al. 2016): the ERG22+ (National Offender Management Service 2011); the IVP guidance (Egan et al. 2016); the MLG-V2 (Cook et al. 2013, 2015); the TRAP-18 (Meloy 2017); and the VERA-2R (Pressman 2009; Pressman et al. 2016; Pressman and Flockton 2012).
Although these tools are grounded in sound conceptual frameworks, concerns have long been raised about their empirical validation. A key reference in this regard is the systematic review by Scarcella et al. (2016), which reflected the state of the field at the time by highlighting the absence of predictive validity analyses—arguably the most important criterion for evaluating risk assessment tools. While some studies compared the scores of attackers and non-attackers (e.g., Meloy and Gill 2016), they often relied on convenience samples and retrospective coding of open-source information rather than prospective designs with real-world participants. Scarcella et al. (2016) also noted that several widely used tools, including the VERA, had only been evaluated in terms of face/content validity and interrater agreement.
That being said, research in the field of PVE is evolving quickly, making assumptions that were considered “true” 10 years ago not so clear-cut today. For example, most of the authors critical of violent extremism risk assessment cited the lack of empirical research on risk and protective factors and the differential predictive validity of factors depending on the outcome of interest. On that, a recent meta-analysis of risk and protective factors of violent radicalization/extremism (Wolfowicz et al. 2020) included aggregated effect sizes separated by the following outcomes: radical attitudes, intention to act, and violent extremist behaviors (e.g., attacks). Results indicated that not only did similar risk and protective factors predict both attitudes and behaviors, but also that sociodemographic characteristics had less explanatory power than psychological- and personality traits-related factors commonly found in violence risk assessment. This finding contradicts many assumptions held in the field, namely that risk factors for general violence alone may be insufficient or irrelevant in cases of violent extremism. Similarly, the assumption that risk of violent extremism is not cumulative has been challenged by studies that found risk and protective factors had incremental validity in the prediction of violent extremist attacks towards persons (Jensen and LaFree 2016). Hence, risk tools for violent extremism that were created 10 years ago might rely on evidence that is now being challenged, as data increasingly accumulates on types of violent radicalization processes beyond Islamist extremism. Similarly, the level of validation of some risk tools, as reviewed by Scarcella et al. (2016), might have significantly evolved since these authors published their systematic review. This rapid evolution of scientific knowledge of tools to assess and monitor factors relevant to risk of violent extremism warrants the production of a new systematic review, which will examine whether currently available tools are sufficiently reliable and valid to recommend their use, depending on the context, type of case assessed, and practitioner conducting the assessment.
Why It Is Important to Do the Review
The current systematic review will be of major relevance for PVE practitioners receptive to the idea of using risk assessment tools but unsure of which to choose and for which context. Surveys of practitioners working in the field have noted that many are open to using tools but uncertain if the available ones were adapted to their sector, setting, or clients (Hassan et al. 2020; Madriaza et al. 2018). Furthermore, the lack of validation of most tools, as well as their significant monetary costs, has led some teams to create internally developed scales or rely purely on professional judgments. Knowing the potential pitfalls of assessing risk using unstructured clinical judgment (Dawes et al. 1989; Grove et al. 2000; Viljoen et al. 2025), it would be wise to answer practitioners' questions concerning risk tools for violent extremism. Our systematic review could also enable assessors to avoid the potential iatrogenic effects associated with the use of tools that are not fit for purpose for certain populations or contexts.
The costs associated with misevaluating risk are numerous. Risk overestimation can lead to more surveillance, stigmatization, unjustified repressive practices, longer than necessary sentences, and waste of funds on interventions that are not only unnecessary but also potentially harmful (Bonta and Andrews 2024; Brouillette-Alarie and Lussier 2018). Risk underestimation, in turn, can result in premature releases and new victims (Douglas et al. 2017; Gendreau et al. 1996; Hanson 2009). Even though it is unrealistic to assume that each recidivism case could have been prevented with better assessment or decision-making, it is important for clinicians, practitioners, and evaluators to be able to attest that their decisions were based on empirically validated procedures and high ethical standards in risk assessment.
Potential risk overestimation is especially important in the context of violent extremism because base rates are so low compared to other types of outcomes such as criminal recidivism (Borum 2015; Sarma 2017). This makes prediction especially challenging, as statistical models usually underperform when base rates are very low. Therefore, an investigation of the potential false positives of violent extremism risk tools is paramount.
In sum, the current systematic review could have implications for practitioners and decision-makers in the field of PVE, as well as for public safety and judicialized persons. Evidence permitting, it will advise practitioners and deciders concerning which tools to use, which tools to avoid, and in which context. It will also potentially ease clinicians' concerns regarding risk tools or, conversely, raise their vigilance towards tools that are problematic. In both cases, the endeavor should contribute to better assessments of risks in the field of PVE.
How This Review Might Inform or Supplement What Is Already Known in This Area
Searches for existing relevant systematic reviews and meta-analyses were conducted on Google. Five systematic reviews (Desmarais et al. 2017; Gill et al. 2021; Lösel et al. 2018; Misiak et al. 2019; Vergani et al. 2020) and two meta-analyses (Emmelkamp et al. 2020; Wolfowicz et al. 2020) on risk and protective factors for violent extremism were found. Although these systematic reviews and meta-analyses were of relevance when examining the face and content validity of risk tools, they were beyond the scope of our systematic review, which focuses on tools rather than individual factors.
Two systematic reviews on tools that assess the risk of violent extremism were found: Scarcella et al. (2016) and Clesle et al. (2025). The first explored the level of psychometric validation of instruments that identify risk factors of terrorism, extremism, radicalization, authoritarianism, and fundamentalism (Scarcella et al. 2016). The authors found four instruments that specifically assess violent extremism risk (e.g., VERA) and 17 research measures/instruments that assess attitudes related to violent extremism (e.g., Religious Fundamentalism Scale [RFS], Right-Wing Authoritarianism Scale [RWA]; Altemeyer and Hunsberger 1992). The authors concluded that the empirical validation of most scales, especially those designed specifically to assess the risk of violent extremism, was in its infancy. Even though they concluded that more validation was needed, they did not provide recommendations for practitioners who could be looking to use violent extremism risk assessment tools. Moreover, some risk tools (e.g., MLG) were omitted, while others may have been published since then. Considering the fast pace with which the field is evolving (i.e., many studies on violent extremism risk tools have been published in the last 9 years), proposing recommendations based solely on evidence collected in 2016 may no longer be current or reliable. Finally, Scarcella et al. (2016) only briefly discussed criterion-validity findings (e.g., predictive, concurrent, and convergent validity)—arguably the most important validation criteria for professionals looking for guidelines on which tools to use and with whom.
Clesle et al. (2025) subsequently published a systematic review of violent extremism risk assessment tools at roughly the same time as the present study, reflecting the growing recognition of the need to synthesize evidence in this field. Their review offers a valuable overview of available instruments and highlights several conceptual and practical challenges faced by evaluators in the PVE space. While highly relevant, their synthesis focuses on summarizing the available information associated with each tool individually, including instruments for which no psychometric validation studies were available, explaining why the two reviews do not entirely overlap in the set of tools and studies covered. The present Campbell review, in contrast, applied stricter inclusion criteria, retaining only studies that reported empirical data on the psychometric properties of tools. In addition, their review did not include (or declare) a formal risk of bias assessment or conduct quantitative pooling of results. The present review applies standardized risk of bias procedures and reports meta-analytic estimates where possible, while adopting a cross-cutting approach that synthesizes psychometric properties across tools and validation domains.
Finally, although it is neither a systematic review nor a meta-analysis, the contribution of Logan and Lloyd (2019) must also be noted. In their paper, the authors first contextualize the tasks of assessing, understanding, and managing risk in the field of PVE based on the progress made in the fields of correctional psychology and criminology. Then, they map the risk assessment tools used by PVE practitioners, describe their intended use, and relate studies attesting to their validation. Finally, they suggest guidance on how to ethically conduct risk assessment with individuals on violent radicalization trajectories, appraise the quality of existing evidence concerning the validation of available risk tools, and plan future evaluations of risk tools in the field. A similar effort can be found in the Extremism Risk Assessment Directory (Lloyd 2019), which presents concise, practitioner-oriented summaries of multiple tools, including their intended use, structure, and supporting evidence. However, like most summaries of violent extremism risk tools (e.g., Conley 2019; RTI International 2018; van der Heide et al. 2019), the methods used to search for relevant literature or collect and analyze data were not systematic. Therefore, the current systematic review aims to improve on their important work by structuring data collection and analysis, as well as ensuring that the literature search is up to date.
Objectives
The main objective of this project was to gather, critically appraise, and synthesize evidence about the psychometric properties of tools used to assess the risk of violent extremism. The specific questions of our systematic review were as follows:
- 1.
What are the tools used to assess the risk of violent extremism?
- 2.
What is the reliability of these tools?
- 3.
What is the validity of these tools?
- 4.
Based on the tools' psychometric properties, as validated with specific populations and in specific contexts, are they fit for purpose for such populations and contexts? In other words, what are the advantages and disadvantages associated with the use of these tools for public safety, practitioners, and individuals on a trajectory towards violent extremism?
Methodology
This study followed a protocol approved by the Campbell Collaboration (Hassan et al. 2022).
Criteria for Considering Studies in This Review
The inclusion and exclusion criteria that were used to identify eligible studies of risk tools for violent extremism can be found in Table 1.
Table 1 Inclusion and exclusion criteria.
| Included | Excluded |
| Risk tools designed to assess the risk of violent extremism | Risk tools from other fields intended for other purposes |
| Tools operationally usable by clinicians for cases involving individuals |
|
| Primary data, including indirect primary data (e.g., triangulation of publicly available data) | Secondary data (i.e., literature reviews, systematic reviews, meta-analyses—references were checked, however) |
| Quantitative studies |
|
| Comprises data on the types of reliability and validity eligible in this systematic review (see the Outcomes section) | Does not comprise data on eligible types of reliability and validity (see the Outcomes section) |
We included studies with primary data resulting from the quantitative examination of the reliability and validity of tools used to assess the risk of violent extremism. We initially planned to include qualitative research designs, but an overview of initial search results indicated that very few or none were available in the context of risk tool validation. Furthermore, only tools usable by practitioners aiming to assess the risk of individuals were eligible. This means that the following were ineligible: (a) tools designed to assess the risk of a building being targeted; (b) tools that operate at a group level only; (c) tools that require access to large databases to function; (d) scales developed for theory validation only; and (e) self-report questionnaires assessing constructs related to violent extremism.
Beyond limiting ourselves to studies comprising primary quantitative data (including the quantitative sections of mixed-method studies) about the reliability and validity of eligible risk tools, we did not impose any restrictions on study design, type, method, or date (up to December 31, 2021) because the state of the literature is such that doing so could lead to the inclusion of only a small number of studies that do not give a clear picture of what is being done in the field. We excluded manuscripts where the authors reflected on a tool, as such reflections do not comprise primary empirical data. We did, nevertheless, take note of the authors' main conclusions, should they prove relevant for our discussion. The same was done for qualitative studies.
To reduce “publication bias” (Tanguy et al. 2014), our review included not only articles published in peer-reviewed journals but also gray literature found by searching the Web. Our review excluded systematic and literature reviews, as these constitute secondary data. We did, however, harvest the references of such reviews to ensure that our search strategy found all the relevant studies.
Types of Studies
Eligible studies were categorized into the following groups (in decreasing order of methodological robustness):
- 1.
Prospective data on reliability and validity using samples recruited in clinical and/or prison settings;
- 2.
Retrospective data on reliability and validity using samples recruited in clinical and/or prison settings; or
- 3.
Retrospective data on reliability and validity using samples built by compiling publicly available information about individuals involved in violent extremism (e.g., already existing terrorist databases or datasets put together by compiling journal articles about terrorist cases).
Population and Context
To be eligible for this review, studies needed to be about assessment tools designed to assess the risk of violent extremism. Violent extremism is here defined as the endorsement or use of violence in support of political, ideological, or religious causes (Neumann 2013; Schmid 2013, 2014). While violent extremism is sometimes conflated with radicalization, we distinguish between the two insofar as radicalization refers to a process—often nonlinear and individualized—through which a person may or may not come to justify or support violence (Borum 2011; Neumann 2013; Schmid 2013, 2014). In contrast, violent extremism refers to the potential outcome of that process, manifested through violent behavior (including terrorism; Schmid 2012) or explicit support for its use.
In the context of this systematic review, all included risk tool validation studies used outcomes involving actual or intended acts of extremist violence (e.g., planning, attempts, convictions), rather than the mere presence of radical or extremist attitudes. As such, the validation focus was clearly behavioral. Throughout the review, we aim to use the term “violent extremism” to refer to this behavioral outcome, and reserve “radicalization” for instances where we explicitly discuss processual or attitudinal dimensions. That said, we acknowledge the broader lack of definitional clarity in the field, a challenge repeatedly noted in the literature (Bartlett and Miller 2012; Neumann 2013; Schmid 2013, 2014). The terms violent extremism and radicalization are often used interchangeably in both academic and policy literatures, with terminological choices sometimes shaped more by political context or institutional trends than by analytic precision.
To be eligible, studies needed to comprise samples of individuals evaluated on a PVE risk tool because of their potential or actual involvement in extremist violence. Assessments could be conducted prospectively or retrospectively, by either practitioners or researchers. In the context of our systematic review, “risk” refers to the presence of multiple risk factors and/or the absence of protective factors, as defined by the tool. Participants across all levels of the risk spectrum were eligible, as this variation is necessary to adequately test predictive validity. Since most PVE tools follow a SPJ model, risk could be determined either through summative scores (often used in research) or through professional judgments guided by the tool's framework.
PVE risk tools can be used in tertiary prevention settings, that is, with individuals who have already committed acts of extremist violence or been involved with extremist groups and are now in the process of disengagement and reintegration. In such contexts, the relevant outcome would shift from the occurrence of extremist violence to the risk of recidivism, either in the form of further extremist violence or broader criminal reoffending. However, our review found no validation studies in which this type of outcome was examined using a prospective design. As such, while this use of PVE tools is conceptually possible and practically relevant, it was not reflected in the empirical evidence base available at the time of this review.
Finally, if a study only comprised practitioners or stakeholders rather than clients (e.g., a study asking evaluators about their experience using a tool), it was eligible on the condition that the tool was designed to assess the risk of violent extremism. Other than that, no other exclusion criteria concerning participants and contexts were applied.
Outcomes
In this review, the main outcome of interest was the level of psychometric evidence for each tool. Psychometric evidence was divided into the following dimensions: reliability (interrater agreement, internal consistency) and validity (content, convergent, discriminant, predictive, construct). Readers seeking to familiarize themselves with psychometric principles and their application to risk tool validation can consult Furr (2021) and Hanson (2021), respectively.
Reliability: Degree to which the measure of a construct is consistent or dependable.
- 1.
Interrater agreement: Measure of consistency between two or more independent raters (e.g., assessor) of the same construct.
- 2.
Internal consistency: The extent to which items within a scale are correlated with one another, indicating that they reflect the same underlying construct.
Validity: Extent to which a measure adequately represents the underlying construct it is supposed to measure.
- 1.
Content: Whether the operationalization of a construct adequately covers its content.
- 2.
Construct: Construct validity can be construed as an overarching type of validity that is defined by the extent to which scores on the instrument are indicative of the theoretical construct. In the context of this systematic review, construct validity will mainly encompass examinations of the latent structure of a tool as obtained by, for example, factor analyses.
- 3.
Convergent: Closeness with which a measure relates to (or converges on) other measures of the same or closely related constructs.
- 4.
Discriminant: Degree to which a measure does not measure (or discriminate from) other constructs it is not supposed to measure. In the context of this systematic review, studies of discriminant validity primarily involved testing group differences in risk scores, such as comparisons between individuals adhering to different ideological motivations or between lone and group-based actors.
- 5.
Predictive: Degree to which a measure successfully predicts a future outcome that it is theoretically expected to predict. Note: Several studies in this review used retrospective designs, where the outcome was already known at the time of assessment. While such designs can provide preliminary indications of predictive potential, they do not constitute true predictive validity tests because they cannot demonstrate the capacity of a tool to forecast future outcomes. In line with terminology used by authors, we refer to these as “postdictive” validity studies.
Search Methods for Identification of Studies
In consultation with a library science expert, we developed a search strategy aimed at targeting an array of bibliographic databases and gray literature resources. The bibliographic search was conducted in three phases. The first one was done at the end of 2020 and included studies published up to November 2020. At the time, we did not use proximity indicators and included self-report questionnaires such as the RWA or the Activism and Radicalism Intention Scales (ARIS; Moskalenko and McCauley 2009).
Second, according to suggestions made by Campbell Collaboration editors and reviewers, modifications were integrated into the search strategy. Proximity indicators were added, some keywords were slightly modified, and self-report questionnaires were excluded from the review. Most importantly, the search period was extended to include papers published until December 31, 2021.
Third, during the second half of 2022, Public Safety Canada asked our team to translate the search strategy into German to make use of databases specific to the German context. Thus, two researchers from Germany were recruited to our team, translated the search strategy, compiled a list of known German tools on which there might be evaluation research, and conducted official and gray literature searches. The German literature search used the same parameters as the second phase of our literature search, albeit in German and with risk tools that are specific to the German context. The end date remained December 31, 2021.
Electronic Searches
We conducted searches in a variety of bibliographic databases, both subject-specific databases and general multidisciplinary databases. While the searches employed standard Boolean logic, they were tailored to the features of each database, making use of available controlled vocabulary and employing proximity operators where possible. The databases used are as follows:
Academic Search Complete (EBSCO).
Criminal Justice Abstracts (EBSCO).
Education Source (EBSCO).
Education Resources Information Center (ERIC; EBSCO).
Krimdok ().
Medline (EBSCO).
National Criminal Justice Reference Service (NCJRS; ProQuest).
ProQuest Central (ProQuest).
ProQuest Dissertations and Theses Global (ProQuest).
PsycINFO (Ovid).
Sociological Abstracts (ProQuest).
Web of Science platform's core collection:
- ◦
Science Citation Index Expanded.
- ◦
Social Sciences Citation Index.
- ◦
Arts & Humanities Citation Index.
- ◦
Conference Proceedings Citation Index – Science.
- ◦
Conference Proceedings Citation Index – Social Science & Humanities.
- ◦
Emerging Sources Citation Index.
- ◦
Our search strategy is divided into three blocks. The first block is a proximity search for terms that represent “risk” and “assessment,” which needed to be a maximum of three words apart. The second block is the “extremism/radicalization” and accompanying synonyms block. The third block comprises the names of risk tools relevant to the field of PVE that were found in preliminary searches. To be captured by the search, a manuscript needed to satisfy (BLOCK 1 AND BLOCK 2) OR BLOCK 3.
Searches were conducted in English and German, but no languages were excluded from the results. The search fields included title, abstract, keywords, and subject/indexing. Endnote was used to store the search results. The full search record can be found in Appendix A, along with the detailed logs of all official literature searches (listed by search phase).
Searching Other Resources
The gray literature search strategy was divided into three parts. Part 1 consisted of inputting the following search string in Google: “risk assess AND (radicalization OR radicalization OR extremism) filetype:pdf.” Google was preferred over Google Scholar to ensure a better coverage of manuscripts not published in indexed scientific journals. Search results were scoured until five pages of irrelevant results came up (with 10 results per page). We also conducted Google searches using the names of each violent extremism risk tool, including German tools. An example of a search string for the ERG22+ was: “extremism risk guidance filetype:pdf.”
Part 2 involved searching webpages, research repositories, and non-indexed journals relevant to violent radicalization/extremism. These sources were examined using variants of the Google search strings, including those containing the names of PVE risk tools. The list of webpages, repositories, and journals was based on another Campbell Collaboration systematic review led by members of this team and on suggestions made by our German colleagues. The list is as follows:
Alliance for Peacebuilding—P/CVE Digest Resource Library: .
American Bar Association—Rule of Law Initiative: .
Brennan Center for Justice: .
Campbell Collaboration: .
Center for Evidence Based Crime Policy: .
Center for Strategic and International Studies: .
Center on International Cooperation: .
CLEEN Foundation: .
Comité interministériel de prévention de la délinquance et de la radicalization: .
Council of Europe—Counter-terrorism: .
COWI: .
Defense: .
Department of Homeland Security: .
Educate Against Hate: .
Geneva Center for Security Policy: .
Georgetown Security Studies Review: .
Global Center on Cooperative Security: .
Global Counterterrorism Forum: .
GOV.UK: .
Hedayah: .
Institute for Security Studies: .
Institute for Strategic Dialog: .
International Center for Counter-Terrorism: .
International Center for the Study of Radicalization: .
International Crisis Group: .
Journal for Deradicalization: .
KrimLit (German): .
KrimPub (German): .
Ministère de l'Intérieur: .
Ministry of Foreign Affairs of Denmark: .
National Consortium for the Study of Terrorism and Responses to Terrorism: .
Organization for Security and Co-operation in Europe: .
PSYNDEX (German): .
Public Safety Canada: .
Publications Office of the European Union: .
RAND—Homeland Security and Public Safety: .
Royal United Services Institute: .
Search for Common Ground: .
Social Science Open Access Repository: .
UNESCO—Preventing Violent Extremism: .
United Nations Development Program: .
United Nations Office on Drugs and Crime—Terrorism Prevention Branch: .
United States Agency for International Development: .
Violent Extremism Evaluation Measurement Framework: .
Violence Prevention Network: .
Part 3 consisted of thoroughly searching the reference sections of literature and systematic reviews of risk assessment tools in the PVE space (Lloyd 2019; Logan and Lloyd 2019; RTI International 2018; Scarcella et al. 2016; van der Heide et al. 2019), as well as the references of all included studies in this systematic review.
The gray literature searches of Phase 1 were conducted in February 2021, those of Phase 2 in November 2023, and those of Phase 3 in February 2023. The eligibility end date for both official and gray literature was December 31, 2021. We did not contact authors or organizations to obtain studies beyond those identified in the aforementioned searches, nor did we contact them to obtain missing or supplementary data.
Data Collection and Analysis
Selection of Studies
Selecting admissible studies was performed by three research assistants who independently screened the abstracts of manuscripts identified by the systematic search. Each manuscript was rated as “no,” “maybe,” or “yes” by at least two research assistants, according to a screening tool based on our inclusion/exclusion criteria (see Hassan et al. 2022). Disagreements (i.e., one research assistant having a different response than the others) were dealt with in a meeting with lead researchers until consensus was reached.
Training was provided to research assistants before the start of the selection process. After the first 400 articles (8.6% of the records identified in the November 2020 search) were coded, the “initial” interrater agreement was computed using Fleiss' (1971) kappa and reached 0.468 (SE = 0.029). Because that result was problematic, another round of training was provided to research assistants before computing the “final” interrater agreement on another set of articles (the Web of Science results, which amounted to 18.5% of the November 2020 search records). The interrater agreement of that round reached a Fleiss's kappa of 0.746 (SE = 0.020), which indicates moderate agreement (Shrout 1998).
All studies rated as “maybe” or “yes” after disagreements were resolved were selected for full-text screening. Full-text screening was also conducted independently by two research assistants. During this stage, assistants confirmed that the studies met all the inclusion/exclusion criteria and comprised data relevant to our outcomes of interest—something that is not easy to determine at the initial screening stage. Each full-text study marked for exclusion was also screened by lead researchers to ensure that no relevant data was left out of the systematic review.
Data Extraction and Management
Each retained study was processed in an Excel coding sheet, based on a coding manual (see Appendix B), in which three research assistants extracted the following information:
- 1.
Document ID, title, authors, year of publication, and place of publication.
- 2.
Relevant risk tools studied in the paper.
- 3.
Data source (meetings with participants, private institutional records, triangulation of publicly available data).
- 4.
Study design (cross-sectional vs. longitudinal, retrospective vs. prospective).
- 5.
Sample characteristics (N, gender, age, country, ethno-racial group, education, employment status, religious/ideological affiliation).
- 6.
Outcomes
- a.
Reliability (interrater agreement, internal consistency).
- b.
Validity (content, construct, convergent, discriminant, predictive/postdictive).
- a.
- 7.
Recommendations of authors (concerning the tool, for practitioners, for future research).
- 8.
Limitations mentioned by the authors.
- 9.
Coder ID and coding date.
Data extraction was conducted by a single research assistant per study. However, all coding sheets were thoroughly reviewed and verified by lead researchers before being entered into the summary of evidence tables, ensuring consistency and accuracy. If a study assessed the psychometric properties of multiple eligible tools, it was listed in multiple rows (once per tool). If relevant data could not fit in the predetermined structure of the coding sheet, it was added in an “other relevant info” column.
Assessment of Risk of Bias in Included Studies
Risk of bias was assessed through a modified and shortened version of the COSMIN Risk of Bias checklist (Mokkink et al. 2018; Prinsen et al. 2018; Terwee et al. 2018). Because the COSMIN Risk of Bias checklist was created for the medical field, trying to use it “as is” in an emergent field such as that of PVE would make the tool unfit for purpose (i.e., too many items would be rated N/A due to lack of information). Therefore, the lead researchers, who possess substantial expertise in psychometry and quantitative research, conducted a comprehensive review of the full COSMIN Risk of Bias checklist. Through a series of meetings, they systematically evaluated each item in relation to the types of study designs typically found in the validation literature on violent extremism risk assessment tools. Items deemed systematically inapplicable—such as those requiring experimental manipulation, responsiveness testing, or large, randomly sampled cohorts—were excluded from the adapted version. The objective was to retain only those COSMIN items capable of meaningfully differentiating between stronger and weaker evidence, given the methodological constraints and empirical realities of the PVE field. The final checklist emphasized the clarity of the tool's intended use (target population, construct being measured, context of use), followed by general methodological features (sample size adequacy, appropriateness of the study design, suitability of data analysis methods). It then included items related to each type of reliability and validity assessed in this systematic review. The modified version of the COSMIN tool can be found in Appendix C.
The risk of bias checklist used in this systematic review was scored by research assistants under the supervision of lead researchers. Training in psychometry and quantitative research was provided to research assistants beforehand. Research assistants were encouraged to ask for help with any item that might prove difficult to score. When this happened, the lead researchers made sure to review the risk of bias checklist of the study in question.
We did not exclude studies based on their methodological design a priori because one of the main goals of this systematic review was to critically appraise and provide guidance on the level of validation of risk tools in the PVE field. If less robust designs (e.g., retrospective rather than prospective) were to be excluded from this review, it would present a picture of the literature that is artificially optimistic.
Assessment of Certainty of the Evidence
Certainty of the evidence was assessed through both the modified COSMIN Risk of Bias checklist and the overall robustness of the methodological design. Robustness was rated on a 3-point scale inspired by Helmus and Babchishin (2017) and the types of studies we anticipated to find in our search: (1) below average (retrospective data obtained in convenience samples), (2) average (retrospective data obtained in non-convenience samples), or (3) robust (prospective data obtained in non-convenience samples). Conclusions of studies were discussed according to the strength of the evidence presented. COSMIN scores for all eligible studies can be found in Appendix C. Assessment of the certainty of the evidence was rated by the lead researchers.
Measures of Psychometric Qualities
This review focused on synthesizing psychometric indicators of reliability and validity. For predictive/postdictive validity, studies reported a range of effect size metrics, including Cohen's d, area under the ROC curve (AUC), and raw group differences (means and standard errors) between individuals who did and did not engage in violent extremist behavior. To allow for comparability and meta-analysis, we transformed all predictive/postdictive validity estimates into point-biserial correlations (r) because they offer interpretability, comparability across study designs, and compatibility with the Comprehensive Meta-Analysis (CMA) software used in this review. Effect sizes conversions followed Rice and Harris's (2005) guidelines for recidivism studies. When only group means and standard errors were available, we first calculated Cohen's d and then converted it to r.
We intended to meta-analyze interrater agreement following Sun's (2011) guidelines for Cohen's kappa, but the lack of standard error/percent of agreement reporting in kappa studies made this analysis impossible. No other psychometric properties were meta-analyzable in this review.
Unit of Analysis Issues and Determination of Independent Findings
In this review, the unit of analysis was a single predictive/postdictive validity estimate per study—specifically, the association between the total score of the risk assessment tool and the outcome of interest. Most estimates were derived from postdictive analyses (i.e., outcomes had occurred before assessment). We did not identify studies with multiple effects for an outcome, so no statistical adjustments (e.g., robust variance estimation) were necessary.
To maintain statistical independence, we excluded estimates based on dimensional or subscale scores, even when available, as including them alongside total scores would have introduced multiple effect sizes from the same sample. Although we would have considered subgroup-specific estimates (e.g., by gender or ideology), none were reported—likely due to limited sample sizes in the original studies.
We also remained attentive to potential secondary reports of the same study. However, no overlapping reports were identified among the included studies. Had such secondary reports been found, we would have retained all versions for coding and documentation while selecting a single effect size per sample for synthesis based on completeness and methodological rigor.
Dealing With Missing Data
We did not contact study authors to obtain missing or supplementary data, as outlined in our protocol and justified based on feasibility and resource constraints. When critical information (e.g., effect sizes, standard errors, or sample sizes) was missing and could not be derived from available statistics (e.g., means, group comparisons, or p-values), the study was excluded from quantitative synthesis but retained in the narrative synthesis when relevant. No imputation procedures were applied.
Assessment of Heterogeneity
Statistical heterogeneity among studies included in the meta-analysis was assessed using the Q-test, the I² statistic, and Tau-squared (τ²). I² values of 25%, 50%, and 75% were interpreted as indicative of low, moderate, and high heterogeneity, respectively, following the guidelines of Higgins (2003). Tau-squared provided an estimate of the between-study variance in effect sizes and was included to complement I² by quantifying absolute heterogeneity. All heterogeneity statistics were computed using random-effects models.
Assessment of Reporting Biases
Due to the limited number of studies included in the meta-analysis, we did not conduct formal statistical tests of publication bias, such as funnel plots or Egger's regression (Sterne et al. 2011), as these methods are underpowered and unreliable with fewer than 10 studies (Ioannidis and Trikalinos 2007). However, we qualitatively considered the potential for reporting biases by examining author affiliations/study funding and the presence of selectively reported outcomes.
Data Synthesis
All quantitative syntheses were conducted using the CMA Version 4 software (Borenstein et al. 2022a). The analyses relied on each study's effect size (expressed as a point-biserial correlation, r) and its corresponding sample size (N), which allowed for the calculation of standard errors and confidence intervals. Meta-analyses were conducted only when two or more independent studies were available for the same tool and psychometric property.
A random-effects model was used throughout, reflecting the assumption that effect sizes may vary across studies due to differences in design, populations, tools, and implementation contexts. This model accounts for both within- and between-study variance and is recommended when methodological and contextual heterogeneity is expected (Borenstein et al. 2022b). For each meta-analysis, 95% confidence intervals, standard errors, and heterogeneity statistics were computed. Forest plots were produced to visually display the results, including the individual study estimates and the pooled effect size.
In cases where studies reported perfect predictive/postdictive accuracy (e.g., AUC = 1.00), a continuity correction was applied to allow inclusion in the meta-analysis. This involved slightly reducing the effect size (e.g., to AUC = 0.98) to avoid zero variance, which can produce infinite weights in random-effects models (Sweeting et al. 2004). If the effect size metric was not already a correlation, the corrected value was then translated into a point-biserial correlation using the procedures outlined by Rice and Harris (2005).
Subgroup Analysis and Investigation of Heterogeneity
Our original plan was to use meta-regression to investigate potential sources of heterogeneity, such as client ideology or evaluation setting. However, due to the limited number of studies, these analyses could not be conducted. As recommended by Borenstein (2019), meta-regression generally requires at least 10 studies to yield reliable results—a threshold we did not meet.
Sensitivity Analysis
We conducted a sensitivity analysis using CMA's “one study removed” procedure to detect and examine possible outliers. This method assesses whether the overall effect size is unduly influenced by any single study (Borenstein et al. 2022b). It requires at least three studies and was therefore applied only to the meta-analysis of the TRAP-18's postdictive validity. We inspected the results to determine whether the removal of any single study altered the statistical significance or substantially changed the magnitude of the pooled effect.
Narrative Synthesis
When meta-analysis was not feasible—typically due to only one eligible study per tool or incomplete statistical reporting—a narrative synthesis was conducted. In these cases, results were described qualitatively, highlighting reported effect sizes, methodological strengths or limitations, and relevance to the review's objectives. This approach allowed us to include valuable information from studies that could not be aggregated quantitatively.
Treatment of Qualitative Evidence
This review did not include qualitative studies, in accordance with editorial guidance and the scope defined in the registered protocol. The focus was limited to studies reporting quantitative data on the psychometric properties of PVE risk assessment tools. As such, no qualitative evidence was synthesized or appraised.
Results
Results of the Search
The three phases of the search strategy led to the identification of 10,859 records. Of these, 10,748 were excluded following title and abstract screening. Of the 110 documents available for full-text screening, one could not be retrieved using available library resources, 91 were excluded for not meeting one or more of the inclusion criteria established in our protocol. This led to a final selection of 19 documents comprising 20 studies, as the Hart et al. (2017) report contained two eligible studies (see Figure 1). All but one of these eligible were written in English; one was written in German.
[IMAGE OMITTED. SEE PDF]
Included Studies
The list of retained studies, along with their general characteristics, can be found in Table 2. These studies were published between 2013 and 2023. Manuscripts with a final publication date of 2022 or 2023 were in advance online publication status in December 2021 and, thus, eligible.
Table 2 Summary of the evidence.
| Tools | Study | Data source | Sample (Adults) | Outcomes (Reliability) | Outcomes (Validity) | Postdictive validity criterion | Statistical analyses | Design robustnessa |
| Der Screener—Islamismus | Böckler et al. (2017) | Official records | N = 31 (Islamist VE) | Interrater (N = 8 practitioners) | Contentb | — | Gwet's AC1, number of items present per case, user satisfaction (Likert) | Average |
| ERG22+ | Powis et al. (2019) | Official records | N = 50 (various ideologies but mostly Islamist VE) | Interrater (N = 35; 2 researchers, 33 practitioners) | — | — | Percent agreement, Cohen's weighted kappa, Fleiss' kappa, intraclass correlation | Average |
| ERG22+ | Powis et al. (2021) | Interviews + official records | N = 171 (Islamist VE) | Internal consistency | Construct | — | Exploratory factor analysis, multidimensional scaling analysis, Cronbach's alpha | Average |
| IVP guidance | Egan et al. (2016) | Publicly available data | N = 182 (various ideologies) | Internal consistency, interrater (N = 2) | Discriminant | — | Cronbach's alpha, Cohen's kappa, ANOVA + post hoc, AUC, sensitivity/specificity | Below average |
| MLG-V2 | Hart et al. (2017) – Study 1 | Publicly available data | N = 5 (various ideologies)c | Interrater (N = 4) | Convergent | — | Intraclass correlation, Pearson correlation | Below average |
| MLG-V2 | Hart et al. (2017) – Study 2 | N/A | N/A | — | Convergent (N = 3 raters) | — | Degree of overlap between MLG-V2 and VERA 2 items | N/A |
| TRAP-18 | Böckler et al. (2020) | Publicly available data | N = 80 (Islamist VE) | — | Postdictive | Committing extremist violence | Chi2, t-test, ANOVA + post hoc, AUC | Below average |
| TRAP-18 | Brugh et al. (2023) | Publicly available data | N = 77 (Islamist VE) | Interrater (N unspecified) | Content, discriminant | — | Krippendorff's alpha, percent of present/absent/missing items, Chi2, t-test | Below average |
| TRAP-18 | Challacombe and Lucas (2019) | Publicly available data | N = 58 (far right) | Interrater (N = 2) | Postdictive | Committing extremist violence | Cohen's kappa, Chi2, Cohen's d, logistic regression | Below average |
| TRAP-18 | Collins and Clark (2021) | Publicly available data | N = 1 (incel) | — | Content | — | Number of items present | Below average |
| TRAP-18 | Dmitrieva and Meloy (2022) | Publicly available data | N = 1 (far right) | — | Content | — | Number of items present | Below average |
| TRAP-18 | Erlandsson and Reid Meloy (2018) | Publicly available data | N = 1 (far right) | — | Content | — | Number of items present | Below average |
| TRAP-18 | Fernández García-Andrade et al. (2019) | Interviews + official records | N = 44 (homeless persons with severe mental illness) | — | Postdictive | Committing extremist violence | Chi2, t-test/Mann–Whitney's U, AUC | Average |
| TRAP-18 | Goodwill and Meloy (2019) | Publicly available data | N = 56 (various ideologies)d | — | Construct, postdictive | Committing extremist violence | Multidimensional scaling analysis, centroid analysis, t-test | Below average |
| TRAP-18 | Kupper and Meloy (2021) | Publicly available data | N = 30 (various ideologies) | — | Content | — | Percent of present items, Chi2 | Below average |
| TRAP-18 | Meloy and Gill (2016) | Publicly available data | N = 111 (various ideologies) | — | Discriminant, postdictive | Attack thwarted versus carried out | Percent of present items, Chi2 | Below average |
| TRAP-18 | Meloy et al. (2015) | Publicly available data | N = 22 (various ideologies) | Interrater (N = 2) | Content, discriminant | — | Cohen's kappa, percent of present items, Chi2 | Below average |
| TRAP-18 | Meloy et al. (2019) | Publicly available data | N = 56 (various ideologies)d | — | Postdictive | Committing extremist violence | Chi2, Odds ratio | Below average |
| TRAP-18 | Meloy et al. (2021) | Publicly available data | N = 125 (various ideologies) | — | Content (time sequencing) | — | Proximity coefficients | Below average |
| VERA | Beardsley and Beech (2013) | Publicly available data | N = 5 (various ideologies)c | Interrater (N = 2) | Content | — | Cohen's kappa, number of items present per case | Below average |
All of the studied tools but one were SPJ tools, with the sole actuarial tool being Der Screener—Islamismus (Böckler et al. 2017). There were 13 studies on the TRAP-18, two studies about the ERG22+, two studies on the MLG-V2, one study on the IVP guidance, one study on the VERA, and one study on Der Screener—Islamismus. The tools varied in length: the TRAP-18 includes 18 items, the ERG22+ includes 22, the MLG-V2 includes 16, the IVP guidance includes 16, the VERA includes 28, and Der Screener—Islamismus includes 13.
The mean number of participants in studies was 58.21 (SD = 55.14), with a total of 1106 participants across all studies. The largest sample size was 182 (Egan et al. 2016), and the lowest one was 1, as found in three case studies of the TRAP-18 (Collins and Clark 2021; Dmitrieva and Meloy 2022; Erlandsson and Reid Meloy 2018). All studies focused on adult individuals susceptible to violent extremism, with no samples of children or adolescents. Seven studies included a small percentage (2%–14%) of female participants (Böckler et al. 2020; Challacombe and Lucas 2019; Egan et al. 2016; Goodwill and Meloy 2019; Meloy et al. 2015, 2019, 2021), but the rest comprised only men.
The ideologies represented across studies were surprisingly varied. In the case of the TRAP-18, for example, there were approximately as many studies focused on religiously inspired violent radicalization as there were on far-right extremism. However, an emphasis on jihadist extremism remains evident in the validation of certain tools—most notably the ERG22+ and Der Screener—Islamismus. In the case of the latter, this is to be expected: Der Screener—Islamismus is the only ideology-specific tool documented in this systematic review, and its validation study inevitably involved individuals with jihadist backgrounds (Böckler et al. 2017). Finally, one study was about homeless persons with severe mental illness at risk of extremist violence (Fernández García-Andrade et al. 2019).
All studies but three formed their sample through the triangulation of publicly available data, which has multiple implications regarding the risk of bias and certainty of the evidence. Only the ERG22+ studies (Powis et al. 2019, 2021), one TRAP-18 study (Fernández García-Andrade et al. 2019), and the Der Screener—Islamismus study (Böckler et al. 2017) used official records or interview data. Furthermore, none of the studies used prospective research designs. According to the criteria we outlined in the Certainty of the Evidence section, this means that no studies met the threshold for what we pre-defined as a robust methodological design. Even though we anticipated such studies to be the exception rather than the rule, finding none in our review was notable and has numerous ramifications for PVE risk assessment practice, policy, and future research.
Excluded Studies
A total of 91 full-text studies were excluded during the review process, based on predefined eligibility criteria. Reasons for exclusion are summarized in Figure 1 and detailed in Appendix D, which lists all excluded studies along with the primary reason for exclusion.
Four observations stemmed from these exclusions. First, many tools that were included in Block 3 of our search strategy did not have eligible quantitative studies attesting to their psychometric validation. Some of the evidence about these tools is known to be obscured from the public (i.e., internal publications; for an overview, see Lloyd 2019). In other cases, the tool was simply never assessed empirically. Second, a considerable portion of the excluded literature focused on self-report scales such as the ARIS, RWA, or RFS, which aim to measure attitudes toward radicalization, authoritarianism, and fundamentalism, respectively. These studies often featured advanced psychometric work, including factor analyses, tests of convergent and cultural validity, and were commonly used in explanatory models testing correlates and predictors of violent extremism (e.g., Bhui et al. 2014). In fact, the psychometric literature on these research-oriented scales may be more developed than that for some of the applied risk tools included in this review. However, since these instruments are not intended for use by practitioners conducting individual risk assessments, they were excluded.
Third, many excluded studies were tool presentation papers (e.g., Pressman 2016)—often falling into the nonempirical or secondary-data categories—and nearly matched the number of included studies in this review. Such papers are notably documented in Clesle et al.'s (2025) systematic review. This reflects a broader trend in the PVE space, where the quantity of theoretical frameworks and systematic reviews often rivals the number of original empirical studies. Similar trends have been observed in research on PVE programming, where reviews frequently highlight the lack of robust primary data and call for stronger empirical foundations rather than additional syntheses drawing from the same limited evidence base (e.g., Brouillette-Alarie et al. 2022, 2025).
Fourth, we believe that most studies were excluded because they fell outside the scope of the review rather than due to overly strict methodological standards. We adopted a deliberately inclusive approach for empirical studies—even those with notable design limitations—to capture the breadth of available evidence. Thus, the exclusions presented here do not distort the overall state of knowledge of PVE risk tools, but rather reflect the field's current limitations in terms of validated, practitioner-oriented instruments.
Risk of Bias in the Included Studies
All included studies (k = 20) were appraised for risk of bias using a modified version of the COSMIN checklist (see Appendix C). This framework covered key aspects such as clarity in tool description, adequacy of sample size, and the appropriateness of study design and analytical methods. Because most tools (aside from the TRAP-18) were represented by only one or two studies, comparisons of COSMIN scores across tools were not considered meaningful. Nor were domain-level scores calculated, as most domains contained only one or two items, and many items were rated as not applicable (i.e., missing), resulting in domain scores that would have been nearly indistinguishable from individual item ratings. Nevertheless, we provide descriptive statistics (i.e., success rates) for each COSMIN item in Table 3, offering insight into common methodological limitations and areas for improvement in future psychometric evaluations of PVE risk assessment tools.
Table 3 Descriptive statistics of COSMIN checklist items and total score.
| Domains/Items | Valid N | Percent of items succeeded or M (SD) |
| Risk tool presentation | ||
| (1) Is a clear description provided of the construct assessed by the tool? | 20 | 100.0% |
| (2) Is a clear description provided of the target population for which the tool was developed? | 20 | 95.0% |
| (3) Is a clear description provided of the tool's context of use? | 20 | 100.0% |
| Data analysis (general) | ||
| (4) Was an appropriate approach used to analyze the data? | 20 | 95.0% |
| (5) Was the sample size appropriate? | 20 | 50.0% |
| (6) Were the design and statistical methodology of the study free of any significant flaws? | 20 | 30.0% |
| If there were inter-rater reliability analyses | ||
| (7) For dichotomous/nominal/ordinal scores: Was kappa calculated? | 8 | 87.5% |
| (8) For ordinal scores: Was a weighted kappa calculated? | 3 | 66.7% |
| If there were internal consistency analyses | ||
| (9) Was an internal consistency statistic calculated for each unidimensional scale or subscale separately? | 2 | 50.0% |
| (10) For continuous scores: Was Cronbach's alpha or omega calculated? | 1 | 100.0% |
| (11) For dichotomous scores: Was Cronbach's alpha or KR-20 calculated? | 1 | 100.0% |
| If there were face/content validity analyses | ||
| (12) Was each item tested in an appropriate number of participants? | 4 | 50.0% |
| (13) Was an appropriate method used to ask participants about the relevance of each item? | 2 | 100.0% |
| (14) Was an appropriate method used to ask participants about the tool's comprehensiveness? | 1 | 100.0% |
| (15) Was an appropriate method used to ask participants about the comprehensibility of the tool's instructions and items? | 1 | 100.0% |
| If there were convergent validity analyses | ||
| (16) Is it clear what the comparator instrument(s) measure(s)? | 2 | 100.0% |
| (17) Was the statistical method appropriate for the hypotheses to be tested? | 2 | 50.0% |
| If there were comparisons between groups (except cross-cultural validity) | ||
| (18) Was an adequate description provided of important characteristics of the subgroups? | 8 | 75.0% |
| (19) Was the statistical method appropriate for the hypotheses to be tested? | 8 | 100.0% |
| If there were concurrent/predictive validity analyses | ||
| (20) For continuous scores: Were correlations or the area under the receiver operating curve calculated? | 7 | 71.4% |
| (21) For dichotomous scores: Were sensitivity and specificity determined? | 5 | 60.0% |
| If there were factor analyses | ||
| (22) Was an exploratory or confirmatory factor analysis performed? | 1 | 100.0% |
| (23) Was the sample size appropriate for factor analysis? | 1 | 100.0% |
| If there were cross-cultural (external) validity analyses | ||
| (24) Were the samples similar for relevant characteristics except for the group variable? | 1 | 100.0% |
| COSMIN total score (/100) | 78.85 (14.67) |
Items most consistently rated as “Yes” were those relating to the clarity of the construct assessed (Item 1), the identification of the target population (Item 2), and the definition of the tool's intended context of use (Item 3). These conceptual components were reported clearly in the majority of studies. In contrast, several items were frequently failed. The most commonly failed item was Item 6 (design/statistical flaws), which was rated “No” in 14 out of 20 studies (70%). Item 5 (sample size adequacy) was also problematic, failing in 10 studies (50%), underscoring a widespread issue with underpowered designs.
Low COSMIN total scores were most often attributable to small sample sizes. Sample size considerations appear across multiple COSMIN domains—including construct validity, reliability, and internal consistency—such that studies with limited samples often incurred repeated penalties. Detailed information about the risk of bias is discussed in each outcome section below, so that it may be contextualized by type of psychometric validation, for which appropriate versus inappropriate methods and designs differ.
Synthesis of Results
Reliability
Interrater Agreement
Results concerning interrater agreement can be found in Table 4. Meta-analysis could not be performed because it requires both the kappa value and the standard error/percent of agreement (Sun 2011). In the context of this systematic review, only the TRAP-18 had interrater agreement values coming from multiple studies, and those failed to provide standard errors or percentages of agreement.
Table 4 Summary of the evidence concerning interrater agreement.
| Tools | Study | Context | Resultsa | Effect size qualificationb |
| Der Screener—Islamismus | Böckler et al. (2017) | Eight practitioners with varying levels of experience blind-coded six anonymized case summaries involving jihadist extremists from German-speaking countries. | Mean Gwet's AC1 of 0.80 (can be interpreted in the same way as a Cohen's kappa). | Moderate |
| ERG22+ | Powis et al. (2019) |
|
|
|
| IVP guidance | Egan et al. (2016) | Two raters blind-coded 30 of the 182 cases (16%). | Mean Cohen's kappa of 0.80. | Moderate |
| MLG-V2 | Hart et al. (2017) – Study 1 | Two raters coded the five cases. | Median intraclass correlation of 0.95. | Substantial |
| TRAP-18 | Brugh et al. (2023) | Raters coded 20 of the 77 cases (26%). | Krippendorff's alpha of 0.95. | Substantial |
| TRAP-18 | Challacombe and Lucas (2019) | Two raters (the lead author and a graduate student) coded the 58 cases. | Mean Cohen's kappa of 0.76. | Moderate |
| TRAP-18 | Meloy et al. (2015) | Two raters expert in the field coded the 22 cases. | Mean Cohen's kappa of 0.90. | Substantial |
| VERA | Beardsley and Beech (2013) | Two raters coded the five cases (the second rater was trained by the first). | Cohen's kappas were of 0.76 or greater. | Moderate to substantial |
Interrater agreement was in the moderate to substantial range when coding was done in a research context by authors, research assistants, or experts in the field. Two studies assessed interrater agreement in conditions approximating routine use: Böckler et al. (2017) and Powis et al. (2019). In the latter, practitioners trained on the ERG22+ showed unsatisfactory agreement (below 0.60), raising concerns about its reliability in practice. By contrast, Böckler et al. (2017) found moderate (0.80) interrater agreement for Der Screener—Islamismus among users with varying levels of training and expertise. Therefore, while most studies report good interrater reliability for PVE risk tools, uncertainty remains as to whether these results generalize to routine, non-research settings.
Internal Consistency
This type of reliability was scarcely studied in violent extremism risk tools, even though most of them comprise items organized into dimensions, which usually warrant such examinations. No meta-analytic synthesis was conducted, as only two studies on two different tools reported internal consistency statistics. Powis et al. (2021) conducted a multidimensional scaling analysis of the ERG22+ and reported the internal consistency of the extracted subscales, as well as that of the original subscales and the full tool. The Cronbach's alpha for the full scale was 0.80, and those of the original or extracted subscales varied between 0.19 and 0.85. For the IVP guidance, Cronbach's alpha was computed for the whole scale using all participants as well as subtypes of extremist groups (Egan et al. 2016). The alpha for all participants was 0.64 and varied substantially between groups (α = 0.32 to 0.84). It was especially low for individuals involved in animal rights groups (α = 0.32) or individuals who committed school shootings (α = 0.38). Authors generally agree that alphas below 0.70 are problematic (Tavakol and Dennick 2011). As such, the internal consistency of the IVP guidance could be considered lackluster; the same is true for many ERG22+ subscales. However, we have to keep in mind that risk tools, by design, do not operate like psychometric scales (Helmus and Babchishin 2017). Because risk tools try to minimize the redundancy of and variance shared by their items, they should not be held to the same internal consistency standards as personality scales such as the NEO-PI (Costa and McCrae 2008).
Validity
Content
In the context of our systematic review, content validity studies most often corresponded to case studies (n = 1) of known terrorists on which a risk tool was retrospectively applied to assess for item fit and item relevance. Three such studies were available on the TRAP-18: Collins and Clark (2021), Dmitrieva and Meloy (2022), and Erlandsson and Reid Meloy (2018). All three studies found that the TRAP-18 was fit for purpose, as most of its items (distal and proximal behaviors) were endorsed by individuals who either committed or tried to commit a terrorist attack. Specifically, 14 of the 18 TRAP-18 items were endorsed in an Incel case study (Collins and Clark 2021), 12 of 18 in a far-right case (Dmitrieva and Meloy 2022), and 15 of 18 in a case involving a mass school shooting perpetrator with far-right views and severe mental health issues (Erlandsson and Reid Meloy 2018).
Other types of content validity studies were available on the TRAP-18. In Meloy et al. (2015), the content validity of the scale was tested in a sample of 22 individuals who committed terrorist acts motivated by various types of extremist ideas. According to the authors, the TRAP-18 demonstrated good fit in real-world data, with more than half of participants scoring positively on 13 out of 18 items (72%) on the scale.
Another avenue to study the content validity of violent extremism risk tools was employed by Kupper and Meloy (2021), who verified if the written or spoken manifestos of lone actors who planned or committed attacks contained words or sentences that could enable the scoring of TRAP-18 items. The authors found that 17 of the 18 items (94%) could be scored positively or negatively according to the content of the manifestos. On average, manifestos comprised 4.5 of the possible 8 proximal warning behaviors and 3.8 of the possible 10 distal characteristics.
Brugh et al. (2023) also verified the extent to which TRAP-18 items could be scored using publicly available information alone. The percentage of missing values varied by item (range = 3.9%–87%), and around half of them (11/18) were more often missing than scored. On average, half (9/18) of the TRAP-18 items were unscored per case. Distal characteristics were deemed easier to score than proximal warning behaviors. When comparing the proportion of present and absent items while discarding unscored ones, more items were present rather than absent, leading the authors to think that, on average, TRAP-18 items are relevant to lone-actor terrorists. However, they concluded that triangulation of publicly available data was insufficient to score the TRAP-18, which casts doubt on most TRAP-18 validation studies, as all but one were scored with such data.
Finally, Meloy et al. (2021) tested if the division of TRAP-18 items into distal and proximal characteristics was confirmed by a time sequence analysis relying on the open-source data of 125 lone-actor terrorists who mounted attacks in Europe or North America. The authors found that when looking at the trajectories of these individuals, most distal characteristics preceded the proximal warning behaviors that culminated in an attack. Therefore, the division of TRAP-18 items into what equates to static/stable and acute risk factors (e.g., Douglas and Skeem 2005; Hanson and Harris 2000) was validated by that study.
One content validity study on the VERA was conducted by Beardsley and Beech (2013), who examined whether items from the scale would be found in individuals who committed terrorist attacks. Using information from Google on five highly mediatized cases, the authors concluded that the VERA was fit for purpose, as items representing multiple risk dimensions, particularly extremist attitudes, were present, while those from the protective dimension were relatively absent. Specifically, on average across the five cases, 85% of attitudinal items were present, followed by 66% of contextual items, 63% of demographic items, 40% of historical items, and only 27% of protective factor items. The authors furthermore noted that VERA items applied to both lone actors and members of offline extremist groups.
A similar approach was used by Böckler et al. (2017) to evaluate Der Screener—Islamismus. The authors applied the 13-item tool to 31 anonymized case summaries of individuals convicted of jihadist activity, including completed attacks, attempted plots, and participation in terrorist organizations. On average, the tool identified nearly seven present risk indicators per case (M = 6.84, SD = 2.67), representing 53% of the total items. All 31 cases presented at least two flagged items. In addition, the authors conducted a face validity assessment through a workshop involving 20 frontline practitioners from education, justice, and police sectors. Participants rated the tool highly on a 5-point Likert scale (1 = very satisfied, 5 = very dissatisfied), with overall satisfaction averaging 1.6 (SD = 0.60), perceived clarity at 1.3 (SD = 0.47), and practical utility at 1.35 (SD = 0.75), further supporting the content relevance and operational usefulness of the tool.
Evidently, by nature of being mostly curtailed to descriptive statistics, content validity studies are not well suited to test null versus alternative hypotheses and cannot answer the question of whether higher scores on risk tools lead to more negative outcomes compared to individuals with lower scores. In addition, content validity analyses were often based on a very low number of individuals—in some studies, one. Nevertheless, authors evaluating risk tools in the PVE space found that these tools demonstrated good content validity.
Construct
Two studies explored the construct validity of PVE risk tools: one for the ERG22+ (Powis et al. 2021) and one for the TRAP-18 (Goodwill and Meloy 2019). Based on a sample of individuals adhering to Islamist extremist ideas, Powis et al. (2021) conducted a principal component analysis and multidimensional scaling analysis of ERG22+ items, which both suggested the presence of multiple dimensions in the scale. Principal component analysis suggested the presence of seven dimensions accounting for 64% of the explained variance, while multidimensional scaling analysis suggested the presence of five. For both analyses, the choice of parameters was sound and well-documented. However, neither the principal component analysis nor the multidimensional scaling analysis led to a factor structure that mirrored that of the conceptual division of the scale. Furthermore, the fit indices for the multidimensional scaling solution fell slightly short of the predefined threshold, with a coefficient of alienation (CoA) of 0.23—just above the acceptable cut-off of 0.20 specified in the methods section. Thus, construct validity was not established for the ERG22+.
In Goodwill and Meloy (2019), multidimensional scaling was used to verify if TRAP-18 items under the distal characteristics dimension would cluster together and separately from items under the proximal warning behaviors dimension. The analysis was documented appropriately, and the fit of the solution was deemed good (Stress-1 = 0.098, S-Stress = 1.69%). Proximal warning behaviors mostly clustered together, but not distal characteristics, which sometimes clustered with proximal warning behaviors and sometimes did not cluster at all. Therefore, construct validity as it relates to the division between distal and proximal behaviors was partly established. It was, however, validated in a time-sequence analysis of TRAP-18 indicators (Meloy et al. 2021).
Convergent
Two studies—those in Hart et al. (2017)—assessed the convergent validity of violent extremism risk tools. In the first study, the authors tested the convergence between the MLG-V2 and, respectively, the Historical Clinical Risk Management-20 Version 3 (HCR-20V3; Douglas et al. 2013) and the VERA. Convergent validity between the total scores of the MLG-V2 and the HCR-20V3 was not supported (r = −0.45). However, the MLG-V2 “individual” subscale showed moderate correlations with several components of the HCR-20V3: r = 0.37 with the total score, 0.55 with historical items, 0.16 with clinical items, and 0.22 with risk management items. Due to the small sample size (n = 5; same sample as Beardsley and Beech 2013), none of these correlations reached significance. Group-related dimensions of the MLG-V2 correlated negatively with HCR-20V3 scores (range: r = −0.10–−0.68), a lack of covariation that the authors anticipated given that the HCR-20V3 focuses exclusively on individual-level factors and does not account for group influences.
With respect to the VERA, a substantial correlation was observed between its total score and that of the MLG-V2 (r = 0.69). Domain-level scores between the two tools also showed considerable overlap, with correlations ranging from −0.74 to 0.92. The strongest associations were found between the “contextual” subscale of the VERA and the group-related dimensions of the MLG-V2: r = 0.90 with “individual-in-group,” r = 0.90 with “group,” and r = 0.92 with “group-in-society.” These three correlations were the only ones to reach statistical significance in the analysis.
In the second study, Hart et al. (2017) asked three researchers to evaluate the degree of item overlap between the MLG-V2 and VERA 2. Most of the VERA 2 items overlapped with those of the MLG-V2, but the opposite was not true. Only two of the four MLG-V2 dimensions—“individual” and “individual-in-group”—showed substantial overlap with VERA-2 content. In contrast, the “group” and “group-in-society” dimensions of the MLG-V2 were largely unrepresented in the VERA-2.
In sum, available studies do not attest to the convergent validity of the MLG-V2. This is in part due to the low sample size of Hart et al. (2017), making the evidence base relatively anecdotal, but also to the fact that the MLG-V2 is a tool that integrates risk factors about the individual and their group. With most other PVE risk tools being purely focused on the individual, convergent validity is curtailed. Tools that are more similar in nature (e.g., the ERG22+ and the TRAP-18) may be found to correlate more strongly in the future, but as of now, studies on the topic were not found in our review.
Discriminant
Discriminant validity analyses conducted for violent extremism risk tools manifested in whether scores on such tools could discriminate between groups based on (a) type of violent extremist ideology, (b) type of attack, (c) country/continent, and (d) individuals working alone versus in an autonomous cell. Studies comparing whether radicalized individuals would go on to commit violence or not were classified under predictive/postdictive validity.
As to the type of ideology, Egan et al. (2016) found that IVP guidance scores were significantly lower for animal rights activists (M = 11.8, SD = 2.9) and school shooters (M = 14.9, SD = 5.3) than for Irish Republicans (M = 23.2, SD = 9.5), Islamists (M = 21.4, SD = 7.4), and right-wing extremists (M = 24.7, SD = 6.4), who did not differ among themselves (F[4, 173] = 15.5, p < 0.001). Combined with the low internal consistency of the IVP guidance for animal rights activists and school shooters, the authors concluded that the scale may not be fit for purpose for these two groups. Meloy and Gill (2016) investigated whether TRAP-18 scores significantly differed across right-wing, Islamic, and single-issue extremists. They found that even though some items (e.g., personal grievance and moral outrage) were more frequently found in proponents of certain extremist ideologies, average TRAP-18 scores did not differ between the three groups (average scores ranged between 9.5 and 9.9).
Concerning the type of attack, Egan et al. (2016) tested whether the IVP guidance was more sensitive in identifying a particular violent outcome compared to others (causing injury, killing, and bombing). AUC greatly varied depending on the group under study and outcome, with the authors reaching no substantial conclusion other than the scale being potentially less sensitive, no matter the outcome, for animal rights activists and school shooters (AUC between 0.49 and 0.67).
With regard to geographic differences, Brugh et al. (2023) found that the TRAP-18 may be less well suited for assessing lone-actor cases from Europe compared to those from the United States. European cases had significantly more missing items (M = 10.4, SD = 3.5) than U.S. cases (M = 7.3, SD = 2.6) (t[70] = 2.56, p = 0.013). They also had fewer positively scored items (M = 5.7, SD = 2.6) compared to their U.S. counterparts (M = 8.8, SD = 3.1) (t[70] = 2.13, p = 0.037). As a result, fewer European cases were recommended for active monitoring by the tool (p = 0.034, Fisher's exact test), despite all individuals having been involved in attacks and/or affiliated with terrorist groups.
Two studies evaluated whether violent extremism risk tools showed differences in assessing individuals who acted alone or in a group. Egan et al. (2016) found no significant differences in IVP guidance scores between individuals who worked alone and those who worked in a group (AUC = 0.40, n.s.). In turn, Meloy et al. (2015) found nearly no differences between the TRAP-18 scores of lone actors and individuals who were part of an autonomous cell. Only 1 of the 18 items showed significant differences: a history of criminal violence, which was more common in autonomous cell extremists (p = 0.005, φ = 0.70, Fisher's exact test).
Discriminant validity analyses collected here do not tell us much about PVE risk tools. These analyses were mostly exploratory in nature and did not specify in advance which group should theoretically obtain lower or higher scores on the scales. In that sense, they do not constitute proper discriminant validity analyses and provide very little information on the nomological network of constructs evaluated in violent extremism risk tools. They nevertheless suggest that some tools, such as the IVP guidance, might not appropriately capture the risk dynamics of some groups (animal rights activists and school shooters) and that there might not be major differences in item fit for lone-actor terrorism and those who are members of a group. Therefore, even though scales such as the TRAP-18 were developed specifically for lone-actor extremists, the future may reveal that the scale could be applicable to violent extremists who are part of a group.
Predictive/Postdictive
Predictive validity is often considered the litmus test for risk tool validation (Helmus and Babchishin 2017). However, as of now, only the TRAP-18 seems to benefit from such studies, and these have substantial methodological flaws, namely, reliance on open-source data and use of retrospective (postdictive) rather than prospective designs. In total, six postdictive validity studies were identified, and none provided evidence of true predictive validity.
Two studies evaluated whether scores on TRAP-18 items were significantly different for radicalized individuals who did or did not go on to commit violence. Meloy and Gill (2016) compared individuals whose attacks were thwarted to individuals whose attack was not stopped. Not only did this conflate intention to act with outside elements such as police investigation, but results indicated that among the five items that were significantly different in prevalence, two were in the opposite direction (higher in the thwarted-attack group). In Meloy et al. (2019), attackers were compared to non-attackers. In that study, seven items were significantly more frequent in attackers, and two were more frequent in non-attackers. Both studies did not disclose if total TRAP-18 scores differed between the two groups. Although the TRAP-18 is a SPJ tool and does not include a formal total score, SPJ tools can nonetheless be evaluated using the number of indicators present (see de Vogel et al. [2004] and Hanson et al. [2009] for examples). This omission limits the extent to which Meloy and Gill (2016) and Meloy et al. (2019) can be interpreted as providing postdictive validity evidence in the conventional psychometric sense.
Four studies tested whether TRAP-18 total scores postdicted violence among samples of radicalized individuals at risk of committing extremist violence (Böckler et al. 2015; Challacombe and Lucas 2019; Fernández García-Andrade et al. 2019; Goodwill and Meloy 2019). Their reported effect sizes can be found in Table 5. Among the four studies, one reported a perfect effect size (AUC of 1.00; Fernández García-Andrade et al. 2019). This implies that in that study, the two participants who committed extremist violence had higher TRAP-18 scores than all the other 42 participants in the study. To enable inclusion in the meta-analysis, we applied a continuity correction, adjusting the AUC to 0.98. This adjusted value was then converted to a point-biserial correlation of r = 0.82 following Rice and Harris's (2005) guidelines. Once all effect sizes were converted to point-biserial correlations, a meta-analysis of the TRAP-18's postdictive validity was conducted, as shown in Table 6 and Figure 2.
Table 5 Summary of the evidence concerning the postdictive validity of the TRAP-18.
| Study | N | Effect sizes and conversion to point-biserial correlation | ||
| AUC | Cohen's d | Point-biserial r | ||
| Böckler et al. (2020) | 80 | 0.880 | — | 0.639 |
| Challacombe and Lucas (2019) | 58 | — | 1.700 | 0.648 |
| Fernández García-Andrade et al. (2019) | 44 | 1.000/0.980a | — | 1.000/0.824a |
| Goodwill and Meloy (2019) | 56 | — | 0.426b | 0.208 |
Table 6 Meta-analysis of the postdictive validity of the TRAP-18 (random-effect model).
| N | r [95% CI] | Z | p | Q | I2 | τ² | |
| Study | |||||||
| Böckler et al. (2020) | 80 | 0.639 [0.488–0.753] | 6.638 | 0.000 | — | — | — |
| Challacombe and Lucas (2019) | 58 | 0.648 [0.468–0.776] | 5.724 | 0.000 | — | — | — |
| Fernández García-Andrade et al. (2019) | 44 | 0.824 [0.698–0.901] | 7.486 | 0.000 | — | — | — |
| Goodwill and Meloy (2019) | 56 | 0.208 [−0.058–0.446] | 1.537 | 0.124 | — | — | — |
| Studies pooled | 238 | 0.619 [0.349–0.794] | 3.953 | 0.000 | 21.211 (df = 3, p = 0.000) | 86% | Z = 0.115 (r ≈ 0.107) |
| Sensitivity analysis | |||||||
| Böckler et al. (2020) removed | 158 | 0.613 [0.178–0.848] | 2.620 | 0.009 | 21.927 (df = 2, p = 0.000) | 91% | Z = 0.202 (r ≈ 0.199) |
| Challacombe and Lucas (2019) removed | 180 | 0.610 [0.200–0.838] | 2.748 | 0.006 | 21.907 (df = 2, p = 0.001) | 91% | Z = 0.181 (r ≈ 0.179) |
| Fernández García-Andrade et al. (2019) removed | 194 | 0.526 [0.261–0.732] | 3.281 | 0.001 | 11.522 (df = 2, p = 0.003) | 83% | Z = 0.078 (r ≈ 0.078) |
| Goodwill and Meloy (2019) removed | 182 | 0.708 [0.564–0.810] | 7.090 | 0.000 | 5.171 (df = 2, p = 0.075) | 61% | Z = 0.028 (r ≈ 0.028) |
[IMAGE OMITTED. SEE PDF]
The postdictive validity of the TRAP-18 for extremist violence varied considerably across studies, with individual effect sizes ranging from r = 0.208 to 0.824. The pooled correlation across the four included studies was large (r = 0.619, 95% CI = 0.349–0.794) and statistically significant. However, the heterogeneity was substantial (I² = 86%), suggesting that the variation in observed effect sizes likely reflects true differences between studies rather than sampling error alone. Sensitivity analyses revealed that the overall estimate was unstable: removing the weakest study (Goodwill and Meloy 2019) resulted in a markedly stronger pooled effect (r = 0.708, p < 0.001), whereas removing any of the other studies reduced the correlation and, in some cases, its statistical significance. These findings point to the promise of the TRAP-18 as a potential tool to distinguish between individuals who do and do not engage in extremist violence—but they also raise significant concerns about its consistency across samples.
To contextualize these findings, it is helpful to compare the TRAP-18 to established risk assessment tools in adjacent fields. For example, the Static-99R—a widely used instrument for assessing sexual recidivism risk—has demonstrated a pooled AUC of 0.69 (r ≈ 0.38) in a large-scale meta-analysis (Helmus et al. 2022). Similarly, the Violence Risk Appraisal Guide (VRAG; Harris et al. 1993) has shown an AUC of 0.74 (r ≈ 0.48), and the HCR-20, commonly used to assess violence risk in psychiatric settings, has reached an AUC of 0.70 (r ≈ 0.40) according to Singh et al. (2011). In the criminological field, the LS/CMI—a widely used general recidivism risk tool—showed an average correlation of approximately 0.30 in a meta-analysis by Olver et al. (2014). Against this backdrop, the TRAP-18's pooled correlation of r = 0.62 appears unusually high. While this may reflect the tool's potential to discriminate between individuals who did and did not engage in extremist violence, it also raises concerns about possible sampling bias or model overfitting, particularly given the retrospective nature and modest sample sizes of the included studies. Moreover, no study to date has tested the TRAP-18's predictive validity using prospective methods, which are known to result in more conservative estimates compared to retrospective designs (Hanson et al. 2009). As such, the current evidence base, though encouraging, remains preliminary and should be interpreted with appropriate caution.
Discussion
Summary of Main Results
The objectives of the current systematic review were to synthesize evidence about the psychometric validation of risk tools available in the PVE space. Encouraging results were found concerning the inter-rater agreement of scales in research contexts, but one of the two studies that examined it in a routine/field setting obtained disappointing results. Content validity studies were mostly positive, indicating that PVE risk tools adequately cover the risk factors and offending processes of individuals who go on to commit extremist violence. Construct validity analyses were few and far between, with results indicating that empirical divisions of scales did not match their conceptual divisions. The internal consistency of subscales was lackluster, while that of full scales was acceptable. Only one study examined convergent validity, and it revealed a lack of convergence, primarily due to particularities of the scale under study (the MLG-V2). Discriminant validity analyses were exploratory in nature rather than true tests of null versus alternative hypotheses, but suggested that most PVE risk tools might not be ideology-specific and may apply to both lone and group actors. Finally, even though encouraging results were found regarding the predictive validity of scales—arguably the most important validation criterion—effect sizes varied substantially and were based on research designs that cannot truly test predictive validity, hence the use of the term “postdictive validity” by the authors. Such data was only available on the TRAP-18.
Overall Completeness and Applicability of Evidence
The 20 studies on PVE risk tool validation included in this review focused on the following tools: the TRAP-18, ERG22+, MLG-V2, VERA, IVP guidance, and Der Screener—Islamismus. Apart from the TRAP-18, no risk tool had more than two validation studies comprising primary data. Furthermore, many risk tools beyond those did not benefit from any publicly available validation study. The Extremism Monitoring Instrument (EMI-20; Schmid 2014), IAT8, Islamic Radicalization (IR-46; Elzinga et al. 2010), Radar (Barrelle 2015), RADAR-iTE (Sadowski et al. 2021), Radicalization Risk Assessment in Prisons (RRAP; Esgalhado et al. 2018), Référentiel des indicateurs de basculement dans la radicalization (Comité interministériel de prévention de la délinquance 2016), and Vulnerability Assessment Framework (HM Government 2012) were all tools that were included in our search strategy, but for which no eligible validation study was found. The lack of psychometric validation data for many of these tools was also reported by Clesle et al. (2025) and Lloyd (2019).
This suggests that some widely used and publicized violent extremism risk tools rely on evidence that is either nonexistent, very slim, or not published by governments and organizations. The issue of data not being made available to researchers has been known for a long time in the PVE space (Ayres 2021; Sageman 2014) and continues to be a hindrance to the confidence that practitioners can have in violent extremism risk tools. In comparison, research on criminological risk scales dates back to the start of the 20th century, with Burgess' (1928) seminal paper on the predictors of parole success. Meta-analyses on criminal recidivism risk scales commonly find dozens of eligible papers (e.g., 128 for the LS/CMI [Olver et al. 2014] and 56 for the Static-99R [Helmus et al. 2022]). While data on perpetrators of extremist violence is sensitive, so is data on adjudicated individuals and perpetrators of sexual violence, which has nevertheless been used for more than 50 years to anchor best practice guidelines in the evaluation of the risk, case management, and treatment of individuals involved in criminal careers.
Although several postdictive validity studies reported AUC values, which offer insight into a tool's overall performance, they provide only limited information about false positives and false negatives—a critical concern in low base rate contexts like PVE. Notably, only one study (Egan et al. 2016) reported sensitivity and specificity alongside AUC. However, this was not a postdictive design but rather a discriminant validity analysis comparing subtypes of attackers. As a result, we lack direct evidence on one of the most pressing concerns in the application of PVE risk tools: their potential to generate high false positive rates when applied in operational settings.
Another factor related to the completeness of the evidence base is that samples comprised nearly exclusively male participants. Some samples did comprise women, but these represented 2% to 5% of the total sample (Böckler et al. 2020; Egan et al. 2016; Goodwill and Meloy 2019; Meloy et al. 2015, 2019, 2021). One exception was present in Challacombe and Lucas (2019), where women constituted 14% of the sample. While this underrepresentation possibly reflects the lower involvement of women in violent extremist acts compared to men, it nonetheless limits the ability to draw gender-specific conclusions. As such, it is unlikely that current risk assessment research in the PVE field enables a meaningful understanding of the dynamics of risk in women.
Finally, none of the studies examined the predictive or postdictive validity of summary risk judgments (e.g., low/moderate/high) produced by practitioners using SPJ tools—judgments that differ from the additive scoring approaches used in current validation studies. This lack of predictive or postdictive validation for structured summary judgments represents a notable blind spot in the validation of PVE risk tools. It is particularly striking given that nearly all such tools follow a SPJ model. If developers of these instruments wish to empirically demonstrate the added value of SPJ frameworks over actuarial approaches, they should test whether professional judgments made after tool administration predict violent extremist outcomes—or whether they improve upon predictions based on summed item scores alone. Designs of this nature are well established in the broader field of violence risk assessment (Campbell et al. 2009; Singh et al. 2011) and could be readily adapted for PVE risk tool validation. In the absence of such evidence, it remains difficult to ascertain whether these tools meaningfully support decision-making in operational or clinical contexts, underscoring the limited ecological validity of currently available validation studies.
Quality of the Evidence
Risk of bias was omnipresent in risk tool validation studies and included multiple forms of reporting bias, as defined by Campbell Collaboration guidelines (Aloe et al. 2024). First, all studies but one on the TRAP-18 and all studies on the MLG-V2, IVP guidance, and VERA were based on publicly available data. This means that for these studies, information on participants resulted from public terrorist databases, newspaper articles, manifestos, Google searches, and autobiographies. While open-source data has notable strengths—such as accessibility, transparency, and relevance for understanding how radicalized individuals present in public discourse—it also comes with well-documented limitations. Scholars have emphasized that such data may be biased, influenced by government agendas, limited by the effectiveness of journalistic investigations, or, in the case of manifestos and autobiographies, susceptible to social desirability bias (Ayres 2021; Sageman 2014; Spaaij and Hamm 2015). Additionally, as emphasized by Brugh et al. (2023), using open-source data entails a lot of missing information, which makes tools such as the TRAP-18 impractical to score.
Importantly, in the context of violent extremism risk tool validation, the reliance on open-source data often constrains how outcomes are chosen, which may compromise their relevance for actual risk prediction and contribute to selective outcome reporting. For example, in Egan et al. (2016), because the database did not comprise non-attackers, the AUCs reported did not discriminate attackers from non-attackers but rather distinguished between different types of attackers. The reported AUCs compared different types of attackers (e.g., bombers vs. shooters), a comparison that, while potentially interesting for descriptive profiling, cannot establish predictive/postdictive validity. In turn, Meloy and Gill (2016) compared individuals whose attacks were thwarted to those who completed them—an outcome that conflates intent to act with factors such as planning, conscientiousness, and the effectiveness of the police investigation. Some postdictive validity studies, however, did compare radicalized individuals who acted out to those who did not (Böckler et al. 2015; Challacombe and Lucas 2019; Fernández García-Andrade et al. 2019; Goodwill and Meloy 2019).
The second main source of bias was the reliance on retrospective rather than prospective designs. With predictive validity being the litmus test of risk tool validation, and with true predictive validity being only testable via prospective designs (Mathes and Pieper 2019), it was surprising that not a single prospective study was found. The lack of prospective designs was inextricably linked to the use of publicly available data, which cannot be, by nature, prospective. Using retrospective data runs the risk of mixing up temporal sequences and paves the way for reverse causation and recall biases. Hindsight or outcome bias (Henriksen and Kaplan 2003) is also a concern: researchers retrospectively scoring risk tools based on known case outcomes (e.g., a completed attack) may unintentionally assign higher risk scores to those individuals, inflating apparent postdictive validity. That said, the absence of prospective designs in the included studies must also be interpreted within the broader context of PVE research. The low base rate of violent extremist recidivism, ethical constraints around prediction in non-correctional populations, and restricted access to institutional samples make prospective designs exceptionally difficult to implement (Borum 2015; Hodwitz 2021; Sarma 2017; Silke and Morrison 2020). As such, while the lack of prospective research weakens the overall certainty of evidence, it may reflect practical limitations rather than methodological neglect.
Third, very small samples were a threat to validity. Five studies relied on samples of five or fewer individuals preselected by authors (Beardsley and Beech 2013; Collins and Clark 2021; Dmitrieva and Meloy 2022; Erlandsson and Reid Meloy 2018; Hart et al. 2017). They were mostly descriptive, did not enable a test of the null hypothesis, and were likely limited in terms of external validity.
The trifecta of using small sample sizes, retrospective open-source data, and convenience outcomes—along with the associated biases—represents a real threat to generalizability. According to our initial guidelines, no reviewed studies could be rated as methodologically robust. Three were rated as average (Fernández García-Andrade et al. 2019; Powis et al. 2019, 2021), as they did not rely on open-source data, and the rest (k = 17) were rated as below average. It is not a given that the findings gathered in the current systematic review, particularly in terms of predictive/postdictive validity, will be replicated once more methodologically rigorous designs are employed. Interestingly, these flaws did not manifest in COSMIN scores, which seemed to assess mostly for sample size considerations and the appropriateness of the analytical strategy. Studies that obtained lower COSMIN scores were mostly case studies or studies with very few participants.
In addition to these structural limitations, author affiliation, study funding, and potential conflicts of interest represent a final source of reporting bias. Although such involvement is common in early-stage validation research and often reflects specialized expertise with the instrument, it also raises concerns about potential inflation of predictive/postdictive validity estimates. Fazel et al. (2022) found that studies conducted by tool developers tend to report more favorable predictive performance than independent evaluations. For instance, the TRAP-18 is a commercially available instrument marketed by Multi-Health Systems, and its lead developer, who is an author on most validation studies, may receive royalties. While such arrangements are not uncommon in psychological assessment, they underscore the need for independent replication to reduce the risk of bias and enhance confidence in the generalizability of findings. We do not imply any ill intent or misconduct—on the contrary, most authors in this review were transparent about their methodological constraints. However, in accordance with Campbell Collaboration guidelines, this potential conflict of interest must be acknowledged. Independent replications remain essential to corroborate the utility and validity of these tools—and we are confident that such studies will emerge in due course.
Another core limitation of the current evidence base is its lack of ecological validity. Nearly all included studies relied on open-source data or clinical vignettes, rather than real-world assessments conducted by practitioners. These designs often involve limited and/or missing data, which may underrepresent key psychosocial and contextual factors relevant to risk. Moreover, they do not reflect the informational environments in which practitioners operate—environments that vary substantially depending on the intervention setting (RTI International 2018). In secondary prevention contexts, for instance, practitioners may lack access to police or institutional records (and may deliberately avoid relying on them), but are typically better positioned to assess psychological functioning, motivation, and protective factors through direct engagement. In tertiary or correctional settings, by contrast, access to institutional and criminal history data is generally more robust. Yet to date, no prospective validation studies of PVE risk tools have captured this variation—unlike what has been done in the broader criminological risk assessment literature (e.g., Hanson et al. 2007). In short, while it remains unclear how more ecologically valid conditions will impact estimates of reliability and validity, it is evident that current research does not reflect real-world assessment contexts—posing a key limitation for interpreting the practical utility of these tools.
Despite the substantial threats to validity and potential reporting biases, it is paramount to put things in perspective: Imperfect data is preferable to no data—especially if it is presented as such. In most of the reviewed papers, authors were very cognizant of the limitations of the evidence and made conclusions accordingly. There were very few limitations that we noted that the authors had not already mentioned. As such, we did not feel like authors and/or risk tool developers were minimizing data quality issues while presenting their tool as the gold standard. This also manifested in the quality of analytical strategies, which generally upheld standards in risk tool psychometric testing and validation (Brouillette-Alarie et al. 2016; Hanson 2021; Helmus and Babchishin 2017). Flaws were mostly present in research designs and sample constitutions. Researchers and developers of PVE risk tools can only do so much with the data they are provided and authorized to access. It is easy for systematic review authors to report that methodological designs are not strong enough; it is much harder for researchers, stakeholders, and practitioners to assemble the conditions to make better research happen. In that, we acknowledge the numerous hurdles that shift researchers towards open-source data.
Potential Biases in the Review Process
The current systematic review is not without limitations. First, the December 31, 2021, end date for inclusion means that some papers that would otherwise have been eligible were not included in the review. We have documented some of these papers, which are listed in Table 7 (note that this list is not intended to be exhaustive). Although we did not integrate them into our analyses, we did go through each of these papers and can conclude that their results do not impact our main conclusion: To date, there is no prospective study attesting to the predictive validity of any violent extremism risk tool. Some of these papers are, however, methodologically sophisticated and would compare favorably, in terms of sample constitution and methods, to those included in this systematic review. These papers also provide new evidence on the VERA-2R, which was until recently quite slim.
Table 7 Recently published papers on violent extremism risk tools.
| Study | Tools | Data source | Sample | Reliability | Validity |
| Challacombe and Patrick (2023) | TRAP-18 | Publicly available data | N = 101 (Capitol insurrectionists) | Interrater | Postdictive |
| Cherney and Belton (2024) | VERA-2R | Publicly available data | N = 50 (various ideologies) | Interrater | Content, postdictive |
| Corner and Pyszora (2022) | TRAP-18 | Focus groups + interviews | N = 58 (PVE experts) | N/A | Content |
| Corner and Taylor (2023) | Radar, VERA-2R | Publicly available data + official records | N = 30 (university students, PVE experts, trained assessors) using the tools on 60 anonymized vignettes | Interrater | Content, discriminant,a postdictive |
| Duits and Kempes (2023) | VERA-2R | Official records | N = 2 raters (researchers) using the tool on 30 extremist cases | Interrater | N/A |
| Elliott et al. (2023) | ERG22+ | Official records | N = 310 (various ideologies but mostly Islamist VE) | Internal consistency | Construct, content |
| Tassin and Allely (2022) | TRAP-18 | Publicly available data | N = 1 (Islamist VE) | N/A | Content |
Second, because of unreleased PVE risk tool evaluations, we may only have a truncated picture of the reliability, validity, and usefulness of such tools. Systematic reviews are limited by the accessibility of scientific evidence, and evidence kept under wraps will necessarily have evaded the scrutiny of the current review.
Third, another limitation may result from the variability introduced by each rater. We attempted to address this by measuring and monitoring inter-rater agreement rates, updating training as required, and reaching consensus when raters had divergent ratings or made different selections. However, inter-rater reliability remained in the moderate range, suggesting room for improvement.
Finally, because of the limited methodological quality of PVE risk tool validation research, we were not able to complete objective 4, which aimed to provide tool recommendations for specific audiences and contexts. The current state of evidence precludes the recommendation of any tool over another—much less specific ones for specific settings. Indeed, the heterogeneity of validation techniques and outcomes precluded us from running moderator analyses, as too few studies were available on the same tools to enable meta-regression data synthesis. It is noteworthy, however, that tools were much more ideology-agnostic than anticipated. Apart from Der Screener—Islamismus, surveyed tools were designed to apply to radicalized individuals no matter their ideology, and discriminant validity (or equity) evaluations suggested these tools did not provide arbitrarily higher scores for specific ideological affiliations (Corner and Taylor 2023; Meloy and Gill 2016).
Authors' Conclusions
Implications for Research
The findings of this review, despite the aforementioned limitations, allow us to offer informed recommendations for future work. With regard to research, first and foremost, the issue of data being withheld by governments and organizations is well-documented in the field (Cubitt and Wolbers 2022; RTI International 2018). Some tools rely on validation data only present in internal reports (e.g., the IR-46; Lloyd 2019). In other cases, data that exists is neither made available to researchers for independent evaluation nor published by government-affiliated researchers in publicly available organization reports or peer-reviewed scientific journals. For example, many governments and security agencies are known to use the VERA-2R to structure surveillance, case management, and resource allocation in both pre- and post-crime contexts (Duits et al. 2023). However, no studies have yet been published demonstrating whether these screenings are predictive of future extremist violence or intentions to act, despite the fact that operational use implies that relevant data may exist. Two recent postdictive studies (Cherney and Belton 2024; Corner and Taylor 2023) signal growing interest in addressing this gap, but predictive evidence remains elusive. In parallel, existing ERG22+ data could feasibly support predictive validity analyses if the criminal records of assessed individuals were collected to identify subsequent extremism-related offenses (or their absence).
However, low base rates would likely be an issue for such studies. While base rates of recidivism for violent extremism (approximately 3%; Hodwitz 2021; Silke and Morrison 2020) are not dramatically lower than those observed in other fields like sexual violence prevention (5%–10%; Lussier et al. 2023), the key issue is the substantially smaller sample sizes typically available in violent extremism research. As a result, the absolute number of individuals who reoffend is often so low that group comparisons or predictive analyses become severely underpowered, regardless of the statistical technique employed. Although effect size measures such as AUC are relatively robust to low base rates (Fawcett 2006; Swets 1988), they still require a sufficient number of positive outcome cases to yield stable and interpretable estimates. This constraint highlights the need for outcome measures that extend beyond terrorism-related recidivism. Future PVE risk tool validation studies could consider multidimensional outcomes that encompass extremist attitudes, intentions, or lower-level behaviors. Such outcomes are conceptually relevant and would likely occur at a higher frequency, increasing the number of analyzable cases and improving statistical power. For example, tertiary prevention evaluations have often used proxy measures of re-radicalization due to the frequent unavailability of recidivism data (Hassan et al. 2021; Demant et al. 2009; van der Heide and Schuurman 2018). While such alternative outcomes are not recommended for operational use, they may provide meaningful insights for research purposes.
Beyond sharing undisclosed existing data to researchers, there is also a need in the PVE space for the development and implementation of new and robust research initiatives to assess the reliability, validity, and relevance of risk assessment tools. Despite the numerous articles published since Scarcella et al.'s (2016) systematic review, the robustness of research designs has not improved substantially, which suggests that new initiatives are needed instead of reusing the same open-source data currently in circulation. To enable new research efforts in the field, multisectoral collaboration is likely needed. Such initiatives need to be funded by decision-makers who see value in obtaining accurate data about PVE risk tools (including changing their practices if results are not in the anticipated direction). They also need to be designed and implemented by researchers with expertise in risk tool evaluation and prospective designs. Crucially, future research initiatives need to have buy-in from practitioners, as they will be the ones using the tools and collecting data. If practitioners are neither well-trained nor receptive to using such tools (because of time constraints or personal advice), they are unlikely to collect reliable data on them, which will curtail both reliability and validity. Quality of training was found to be one of the most important predictive validity moderators for sexual violence scales, with higher-quality training being consistently associated with higher predictive validity (Helmus et al. 2022). Finally, some amount of high-level provincial and federal coordination should also be helpful to ensure that data collected in different organizations and provinces is comparable. This will allow pooling data to reach higher sample sizes, which is likely necessary for discriminative analyses, considering the low base rate of violent extremism reoffenses.
In sum, the current systematic review highlights in no uncertain terms the need for higher-quality validation data as it pertains to PVE risk assessment tools. Non-convenience samples and prospective predictive validity designs are urgently needed. Even though some authors have questioned the viability of such analyses (e.g., Sarma 2017), quality research from the fields of sexual violence, general criminality, and violence has proven that such analyses are possible, even with relatively low base rates (e.g., Helmus et al. 2022; Olver et al. 2014).
Finally, we do not believe that further reviews of the available PVE risk assessment literature are needed, as literature reviews on PVE risk tools have multiplied in recent years and mostly reached the same conclusions about the lack of quality validation data (Corner and Taylor 2023; Clesle et al. 2025; Cubitt and Wolbers 2022; Risk Management Authority 2021; RTI International 2018; Scarcella et al. 2016; van der Heide et al. 2019). Therefore, if one were to have to choose between funding new research initiatives and funding new literature syntheses, we would encourage the former rather than the latter—at least until the quality of validation data improves sufficiently to warrant a new investigation of the available literature.
Implications for Practice
The current state of validation data raises concerns regarding the legal admissibility of violent extremism risk tools in judicial proceedings. As an illustrative example, in the United States, the Frye v. United States (1923) and Daubert v. Merrell Dow Pharmaceuticals Inc. (1993) standards outline criteria to determine the admissibility of expert testimony and the tools used to support it. These include: (1) empirical testing of the tool's reliability and validity; (2) a known error rate or misclassification potential; (3) peer-reviewed publication of supporting evidence; and (4) general acceptance within the relevant scientific community (Faigman et al. 2014, 2016; Helmus et al. 2022; Neal et al. 2019). While the TRAP-18 is the only tool for which postdictive validity data were readily available, it may not meet the first two criteria due to methodological limitations and possible bias in the validation studies. This example from U.S. law highlights broader issues that would likely raise admissibility concerns in other jurisdictions as well.
In many legal systems, decisions related to liberty, supervision, or sentencing require the use of rigorously validated tools. At present, the evidence base supporting PVE risk tools may be insufficient to provide courts with confidence that these instruments consistently produce valid assessments. Moreover, consensus regarding the functionality of PVE risk tools is lacking within the field, as reflected in critical commentary from numerous scholars (e.g., Borum 2015; Monahan 2012; Sarma 2017). Although some critiques may hold these tools to idealized standards that even well-established instruments in other fields may not meet, the absence of high-quality evidence remains a legitimate concern—particularly when such tools inform decisions that significantly impact individuals' rights and liberties. Legal actors and policymakers should therefore exercise caution when relying on PVE risk tools in high-stakes contexts.
The lack of predictive validity data does not mean, however, that PVE risk tools should be discarded entirely. The last 100 years of research in criminology have unequivocally demonstrated that structured assessments of risk outperform unstructured assessments (i.e., those relying on pure clinical judgment without support from any tool; Hanson 2009). In that sense, currently available risk tools constitute lists of (mostly) relevant risk and protective factors for violent extremism that can be used by practitioners to structure clinical reflections, assess risk, and plan interventions. Assessing each case by considering all relevant risk and protective factors while discarding irrelevant ones has historically been the recipe for success in risk assessment (Hanson 2009).
To that end, it might be worthwhile for risk tool developers to revisit the risk and protective factors (items) present in PVE risk tools, as the scientific literature has significantly evolved since the design and release of most of the tools for which validation studies were documented (Corner and Taylor 2023). Many authors critical of the viability of risk assessment in the field rightly mentioned the lack of scientific literature on risk and protective factors for violent extremism at the time risk tools were developed (Borum 2015; Monahan 2012; Sarma 2017). Now that, years later, the first meta-analyses and systematic reviews have come out (Desmarais et al. 2017; Emmelkamp et al. 2020; Gill et al. 2021; Lösel et al. 2018; Misiak et al. 2019; Vergani et al. 2020; Wolfowicz et al. 2020), it may be time to review existing tools with an updated evidence base. Regarding potential updates, many authors have mentioned that ideological and political factors may have been overemphasized compared to risk factors for crime and violence (Wolfowicz et al. 2020). The relative weight of these dimensions could thus be revisited in currently used assessment scales.
Agreements and Disagreements With Other Studies and Reviews
The conclusions of the present Campbell review are broadly consistent with those of two previous systematic reviews on violent extremism risk assessment tools (Clesle et al. 2025; Scarcella et al. 2016), despite important differences in scope, methods, inclusion criteria, and publication timing. For reference, of the 20 studies included in our review, three were also covered in Scarcella et al. (2016), and 10 in Clesle et al. (2025). Although most of the validation studies in this field were published within the last 8–9 years, Scarcella et al. had already concluded in 2016 that “based on the quality reporting and on the psychometric properties (or the lack thereof), there is no substantial evidence that would enable the authors to recommend one instrument over another” (p. 17). Nearly a decade later, our conclusion remains, unfortunately, quite similar.
All three systematic reviews converge on a central finding: the evidence base concerning the psychometric properties of violent extremism risk tools remains insufficient to support confident, empirically informed guidelines for tool selection and deployment. Although a growing number of such tools are being used by frontline practitioners, data on their validity remains sparse (Clesle et al. 2025). Even the TRAP-18—supported by what appears to be the most active research community—still lacks essential validation, particularly in the form of prospective predictive validity analyses. Some methodologically rigorous studies have begun to emerge for the ERG22+ and VERA-2R, occasionally using official records instead of publicly available data; however, robust predictive or postdictive validity analyses are still missing. Moreover, recent work has raised concerns about the psychometric properties of the VERA-2R (Cherney and Belton 2024; Corner and Taylor 2023).
That all three systematic reviews—despite differences in inclusion criteria, methodology, and intended audiences—arrive at similarly cautious conclusions underscores the persistent challenges of conducting rigorous validation research in this field. Collectively, these reviews portray a maturing field still in search of a solid empirical footing, and underscore the urgent need for better-designed, prospective, and transparent studies to support sound clinical and operational decision-making in PVE. Encouragingly, the increasing volume and methodological sophistication of recent studies suggest that the field is moving in the right direction, laying the groundwork for a stronger and more evidence-informed future.
Author Contributions
Sébastien Brouillette-Alarie and Ghayda Hassan: co-leads of the systematic review who were involved at all stages of the process (conceptualization of the review, production of the search strategy, coordination of the search, selection, and data extraction phases, analysis of the data, and redaction of the manuscript).
Wynnpaul Varela, Emmanuel Danis, and Deniz Kilinc: research assistants who were involved in the selection of articles, data extraction, and analysis of the data. Wynnpaul Varela also acted as our English writing expert. Deniz Kilinc completed his studies in 2020 and, thus, was mostly involved in the early stages of the review.
Sarah Ousman, Pablo Madriaza, and Eugene Borokhovski: co-authors involved in the conceptualization of the review and manuscript writing. Sarah Ousman substantially helped to develop and test the search strategy.
Inga L. Pauls and Robert Pelzer: responsible for coordinating and conducting the German literature search, as well as selecting and coding these articles. They also contributed to drafting the manuscript.
David Pickup: library science expert who reviewed the search syntax, conducted the official literature searches, indexed the results, and produced documents for the article selection phase.
Acknowledgments
The authors of this study would like to thank the staff of the Campbell Collaboration for their support throughout the systematic review process, especially Vivian Welch, Liz Eggins, and other members of the Crime and Justice Group.
Conflicts of Interest
The authors declare no conflicts of interest.
Transparent Peer Review
The peer review history for this article is available at .
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Plans for Updating the Review
At present, there are no plans to update this review, as the allocated funding has been fully expended. However, should the evidence base on violent extremism risk assessment tools expand significantly and warrant a new synthesis, the lead researchers would likely seek funding to support an update. All data extraction and coding files have been securely archived and can support future updating efforts if resources become available.
Differences Between Protocol and Review
Deviations from the initial protocol include conducting a literature search specifically in German (requested by Public Safety Canada; see the Bibliographic Search Phases section) and the inability to conduct more advanced meta-analytic techniques (e.g., meta-regression) due to the limited number of studies available for quantitative data synthesis (see the Data Synthesis section). Additionally, the Australian Criminology Database (CINCH, Informit), which had been listed in the protocol, was not searched due to a miscommunication between our team and the Campbell editorial group. This omission has been noted, and the database will be included in any future update of the review, should one be undertaken.
Note that the presentation of the methodology slightly diverges from the protocol to improve readability and clarity, especially in the Search Methods for Identification of Studies section. However, this has no practical implications for the way the search was conducted.
Sources of Support
This project was funded by the Community Resilience Fund of Public Safety Canada, in collaboration with the Crime and Justice Group of the Campbell Collaboration. The views expressed herein do not necessarily reflect those of Public Safety Canada or the Campbell Collaboration.
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.