Introduction
Interpretations of therapeutic superiority in phase III oncology trials rely on statistical significance1. Statistical significance is defined by achieving a certain P value in the survival model less than or equal to a pre-specified threshold, which is often 0.052,3. Although using statistical significance to define superiority is essentially universal in phase III oncology trials, the robustness of this practice has been increasingly questioned4,5. P is a continuous statistic, yet statistical significance treats P as a categorical variable. For example, statistical significance assigns diametrically opposed interpretations to P of 0.049 and P of 0.051 when significance is defined as P < 0.05. However, the information provided by P of 0.049 and 0.051 is nearly identical6, 7, 8, 9–10. Further compounding the issue is that P is known to have a component of random variability between different samples6, 7, 8, 9–10. Thus, many have voiced concerns that the current practice using P value thresholds may not reliably establish whether an oncology treatment offers superior treatment effects compared to control4.
The fragility index (FI) has been proposed as a method to study and quantify the robustness of statistical significance11,12. The FI is canonically defined as the minimum number of individual patient outcomes needed to change the interpretation of the trial according to the statistical significance definition. For trials studying binary endpoints, the canonical definition of FI is appropriate. However, oncology trials most commonly investigate survival outcomes, which have a binary component (e.g., alive or dead) and a continuous component (i.e., time from randomization to event or censor). To better model fragility in oncology, the survival-inferred fragility index (SIFI) has been proposed to specifically address the issues associated with survival endpoints13,14. SIFI entails an iterative re-assignment of patients between arms until the significance interpretation is changed, leveraging the principles that randomization itself naturally produces small variations in the distribution of patients and that crossover between arms is commonplace. While there are still important assumptions in SIFI because treatment assignment is the focus of perturbation, FI for survival data entails much less realistic assumptions as it evaluates changes in the patient’s survival status (e.g., from alive to dead). Consistent with this, median FI of phase III oncology trials was estimated as only 2 patients in the study by Del Paggio and Tannock, whereas median SIFI was estimated more conservatively as 5 patients or 13 patients in studies by Bomze et al. and Olsen et al., respectively13,15,16.
However, data from individual patients are required to estimate SIFI, as opposed to the simple 2 × 2 table used to calculate FI. In oncology, individual patient-level data are rarely shared, which has been a major limitation for understanding fragility in oncology trials at scale17. In this study, we sought to estimate SIFI among a superset of 230 phase III oncology trials using manually reconstructed individual patient-level data. The purpose of the present study was to provide an updated, comprehensive characterization of the robustness of statistical significance specific to phase III oncology trials to better understand whether alternative approaches to clinical trial design and interpretation are needed. We also investigated multiple methods of evaluating SIFI, compared the effects of different survival models on SIFI, and associated trial-level characteristics with SIFI.
Results
Characteristics of eligible trials
We included 230 trials, which were published from 2005 to 2020 and enrolled 184,752 patients in aggregate (Table S1). Most trials evaluated surrogate primary endpoints (PEPs) (n = 140, 61%). The PEP was positive (i.e., the experimental arm was interpreted as demonstrating superiority relative to the control arm) in 120 trials (52%), and led to Food and Drug Administration (FDA) approval of the experimental therapy in 82 trials (36%).
Estimations of survival-inferred fragility
SIFI was first estimated by iteratively re-assigning the best survivors between arms until the statistical significance interpretation was altered. For trials with a positive PEP, the best survivors were re-assigned from the experimental arm to the control arm, whereas for trials with a negative PEP, the best survivors were re-assigned from the control arm to the experimental arm. SIFIB, therefore, was defined as the minimum number of re-assigned patients (i.e., the best survivors) needed to change the trial’s statistical significance interpretation, either from positive to negative or negative to positive. Across all trials, the median SIFIB was found to be 8 patients (IQR, 4–19 patients), and the distribution of SIFIB is shown in Fig. 1A. In other words, in 50% of the trials, if only 8 or fewer of the longest survivors were assigned to the other randomized arm, the PEP interpretation of the trial would differ. Furthermore, because of the heterogeneity in the number of patients enrolled in each included trial, we also normalized SIFIB as a percentage of the enrolled subjects to improve cross-trial comparisons. As a percentage of the enrolled patients, the median SIFIB was found to be 1.4% (IQR, 0.7%–3%) (Fig. 1B).
[See PDF for image]
Fig. 1
The survival-inferred fragility index (SIFI) for the primary endpoints of phase III oncology trials according to SIFIB, SIFIW, and SIFIM.
A The absolute numbers of patients and B the percentages of the number of patients studied in the primary endpoint analyses are shown. Bars represent medians and interquartile ranges.
Focusing SIFI estimations on the best surviving patients, while more ideal than FI, may hold more assumptions than if other types of patients were assigned to different treatment arms. Thus, we performed sensitivity assessments using two other methods of estimating SIFI: SIFIW, where the worst (shortest) surviving patients were re-assigned, and SIFIM, where the median surviving patients were re-assigned. The median SIFIW and SIFIM were 11 patients (IQR, 6–20 patients) and 17 patients (IQR, 9–30 patients), respectively (Fig. 1A). Expressed as proportions of patients included in the PEP analyses, the median SIFIW and SIFIM percentage were 1.9% (IQR, 0.9%–3.8%) and 3% (IQR, 1.3%–6.0%), respectively (Fig. 1B). Correlations between each SIFI method were strong, suggesting that SIFIB represents a reasonable approach to measuring survival-inferred fragility (Fig. 2).
[See PDF for image]
Fig. 2
Different methods of calculating survival-inferred fragility index (SIFI) percentages provide highly concordant results.
The Spearman’s rank correlation coefficient is shown. In each figure, the solid line is the best fit univariable regression. The dashed line is the line of identity. A SIFI worst vs SIFI best; B SIFI median vs SIFI best; C SIFI median vs SIFIworst.
Associations of trial-level covariates with severe survival-inferred fragility
While understanding the absolute number and proportion of patients needed to alter the significance interpretation is important, defining severe and non-severe survival-inferred fragility at the trial level is useful for cross-trial comparisons and understanding the types of trials at greatest risk of fragile conclusions. Thus, based on previous work, we defined severe survival-inferred fragility as SIFI ≤ 1% of the study enrollment15. Consequently, we observed severe survival-inferred fragility in 87 trials (38%) according to the SIFIB method, in 68 trials (30%) according to the SIFIW method, and in 49 trials (21%) according to the SIFIM values.
Compared with trials with surrogate PEPs, trials with overall survival PEPs were associated with greater odds of severe survival-inferred fragility using the SIFIB method (OR, 2.33; 95% CI, 1.35–4.06; P = 0.003). This association held with comparing SIFIB values directly (median 1.0% vs 1.8%; P < 0.0001) (Fig. 3A). In general, trials with surrogate PEPs were more likely to claim superiority than trials with overall survival PEPs (64% vs 30%; P < 0.0001). The greater event rate and power typical with surrogate PEPs may thus underlie the observed differences in SIFI and severe survival-inferred fragility.
[See PDF for image]
Fig. 3
Trials claiming superiority, studying surrogate endpoints, and receiving FDA approval tend to have larger survival-inferred fragility indices (SIFI).
The association of fragility, as measured by SIFIB, with trial-level features: A primary endpoint type; B statistical significance; C subsequent FDA approval. P by Mann–Whitney U test. Bars represent medians and interquartile ranges.
P itself was closely associated with SIFI (−log10(P): ρ, 0.25; P = 0.0002) and severe survival-inferred fragility (−log10(P): OR, 0.73; 95% CI, 0.63–0.83; P < 0.0001) (Fig. S1). The association with severe survival-inferred fragility persisted after conditioning for the type of PEP (adjusted OR, 0.75; 95% CI, 0.64–0.86; P = 0.0001). This result was also consistent across SIFI methodologies (Fig. S1). Interestingly, trials claiming superiority appeared to have a greater SIFI (i.e., less fragile) than trials that did not claim superiority (median 1.9% vs 1.1%; P = 0.0002) (Fig. 3B). This finding strengthens the argument that trials without significance differences (i.e., those with high P values) are better interpreted as inconclusive rather than as showing equivalent effects18. However, despite this association, there was only a weak signal that categorical statistical significance interpretations were associated with severe survival-inferred fragility (OR, 0.62; 95% CI, 0.36–1.06; P = 0.08). This association became inconclusive after adjusting for the potential confounding effects of PEP type on trial outcome (adjusted OR, 0.78; 95% CI, 0.44–1.37; P = 0.38). Thus, while P values themselves appear to convey SIFI and severe SIFI, statistical significance interpretations fail to reliably communicate the underlying robustness and strength of the findings.
On the other hand, the FDA approvals process did appear to reliably differentiate between trials with or without survival-inferred fragility. The median SIFIB percentage of trials leading to FDA approval or not was 3.1% vs 1.1%, respectively (P < 0.0001) (Fig. 3C). Trials with FDA approval were associated with lower odds of severe SIFIB (OR, 0.47; 95% CI, 0.26–0.83; P = 0.01). This association persisted after adjustment for PEP type (adjusted OR, 0.53; 95% CI, 0.29–0.96; P = 0.04). On sensitivity analysis, this finding persisted using SIFIW methodology (adjusted OR, 0.52; 95% CI, 0.27–0.96; P = 0.04), although the effect was weaker when using SIFIM methodology (aOR, 0.59; 95% CI, 0.28–1.17; P = 0.14).
Regarding other characteristics, more recently published trials appeared to exhibit lower odds of severe fragility (OR, 0.92; 95% CI: 0.84–0.999; P = 0.049), implying that findings of more recent trials may potentially be more robust. Other trial-level characteristics were not associated with severe fragility, e.g., enrollment (enrollment of every 100 patients, OR, 1.02; 95% CI, 0.98–1.05; P = 0.33), industry sponsorship (OR, 1.33; 95% CI: 0.59–3.22; P = 0.51), and immunotherapy study type (OR, 1.16; 95% CI: 0.41–3.15; P = 0.77).
Sensitivity analyses of survival-inferred fragility based on the survival model
The Cox regression is the most common survival model used in oncology trials, and thus, we based estimations of SIFI on statistical significance defined by Cox regression. However, there are multiple advantages and disadvantages of the Cox regression compared to other survival models19,20. To explore the effects of the underlying survival model on SIFI, we calculated SIFI using the alternate models of restricted mean survival time (RMST), MaxCombo2, and MaxCombo3. RMST does not rely on the proportional hazards assumption, while MaxCombo differentially weights the survival curve to account for late separation of the survival curve or diminishing relative effects over time. The effects of these different survival models on estimating the SIFI values are shown in Fig. 4. The SIFIB values were highest (i.e., least fragile) when estimated by the RMST model (P < 0.0001 vs Cox) and lowest (i.e., most fragile) when estimated by the MaxCombo approaches (P < 0.0001 and P = 0.03 for MaxCombo2 and MaxCombo3, respectively, vs Cox) (Fig. 4A). The inverse was found for SIFIW values, where MaxCombo2 and MaxCombo3 were found to be least fragile and RMST most fragile versus Cox (Fig. 4B). Taken together, particularly when treatment effects change over time (i.e., the proportional hazards assumption does not hold) and the Cox model is less suitable, the choice of the underlying survival model appears to have a large impact on fragility and robustness.
[See PDF for image]
Fig. 4
The survival-inferred fragility index (SIFI) estimates are influenced by the underlying survival model.
A SIFIB and B SIFIW estimated using Cox regression, restricted mean survival time (RMST), MaxCombo2 (MC2), and MaxCombo3 (MC3).
Discussion
In this study, we estimated the survival-inferred fragility of phase III oncology trial results by leveraging reconstruction techniques to assemble a uniquely comprehensive dataset of 184,752 individual patient outcomes from 230 trials. We found evidence of survival-inferred fragility in most trials, with a median SIFI count of 8 patients representing 1.4% of the study enrollment. Our results provide novel, quantitative insights into the problems of using statistical significance to interpret the results of phase III oncology trials as superior or not superior. These findings suggest that alternative approaches to trial interpretation, which do not rely solely on statistical significance, are urgently needed to reliably identify robust treatment effects.
Severe survival-inferred fragility was related to the initial P value as a continuous parameter, but much less related to the statistically significant interpretation. In other words, both “positive” and “negative” trials appear susceptible to fragility, and a small series of changes to individual outcomes may readily reverse the trial interpretation in either direction. This finding reinforces the established concept that P is a continuous parameter; the information provided by P values of 0.049 and 0.051 is essentially identical21. However, our empirical observation is quite concerning, because it implies that the extent of both type I errors and type II errors is underestimated in phase III oncology trials, although our data suggest this underestimation may be decreasing over time22. While trials that obtained FDA approval did seem to exhibit less survival-inferred fragility on average, suggesting the importance of the downstream regulatory approval process, survival-inferred fragility still seemed fairly problematic even for this subset, as the median SIFI value was 3%.
Thus, alternative approaches to clinical trial interpretation beyond statistical significance are urgently needed. Ultimately, the regulatory process, as perhaps is implicit in our dataset, appears most strongly poised to affect such change23. The European Society of Medical Oncology and the American Society of Clinical Oncology have proposed that evaluations of treatment effect superiority should include assessments of effect size in addition to statistical significance24, 25–26. Effect sizes describe the relative magnitude of benefit to patients and add important clinical meaning to the interpretation of survival data20. Because P values do not convey effect sizes (even though they are often perceived in this way), treatment effects with very low P values (e.g., P < 0.001) can be associated with only marginal effects7,18. Quantifying the probability of achieving clinically meaningful effect sizes in phase III oncology trials is a highly valuable strategy for trial interpretation and complements statistical significance22,27. The ongoing challenges of trial interpretation include generalizability, transportability, and caveats of trial design, and extrapolation of data to patient care must also account for the characteristics and needs of individual patients28,29.
In our study, different approaches to estimating SIFI reveal the strengths and weaknesses of the general concept of survival-inferred fragility. We found strong concordance between SIFI estimated according to flipping the best survivor, the worst (shortest) survivor, and the median survivor. However, survival-inferred fragility appears to be markedly influenced by the choice of the underlying survivor model. RMST analyses, which compare the area under the survival curve, appeared less fragile when the longest-term survivors were flipped between arms, consistent with expectations that late-occurring drops will hold less influence on area under the curve calculations. Conversely, MaxCombo models, which weigh for late separation of the curves or late diminishing treatment effects, appeared more fragile when flipping the best survivors. As expected, we found the inverse to be the case when focusing SIFI on the earliest aspects of the survival curve in the SIFIW models. Thus, the underlying data from an individual trial and the underlying survival model could considerably influence the results of survival-inferred fragility estimations, and each should be considered as offering distinct and unique information on the behavior of the underlying survival data.
Caution is warranted when interpreting the present study for several reasons. First, this study’s findings are subject to the limitations of reconstructed survival data30. Although trial-specific significance levels were used to mimic the conditions of the original studies, all reconstructed regressions were univariable. In contrast, most phase III oncology trials use multivariable regressions for the PEP because multivariable regressions have greater power and efficiency than univariable regressions31,32. Thus, the SIFI modeling approach may overestimate fragility for trials that were initially statistically significant, because lower P values, which reflect more information than higher P values, may be obtained when strongly prognostic covariates are included in the PEP regression model32. To reduce this bias from this risk, our analysis excluded studies where the HR based on reconstructed data differed from the original HR by more than 0.1 on the logarithmic scale. Like any single summary measure, SIFI has limitations, and we do not suggest that it can provide a stand-alone alternative to P for making decisions regarding treatment efficacy or robustness15. There is no consensus approach among researchers regarding the definitions of SIFI and severe survival-inferred fragility, and so we used multiple approaches to examine SIFI. While altering treatment arms to estimate survival-inferred fragility may be more conservative than changing patient survival status, like with FI, SIFI is inherently an exercise in modeling and limited by assumptions because, as is most often the case, patient outcomes are heavily influenced by treatment assignment. Thus, retrospective modeling re-assigning patients to different arms, even with their same outcomes, does not truly model the counterfactual. Nonetheless, despite these important limitations, we do propose that SIFI is a relatively intuitive metric for shedding light on the instability of clinical trial results, which we argue deserves considerably more attention among oncologists. Moreover, although studying SIFI is ultimately in silico, one of the key strengths of this study is that actual patient outcomes were used to estimate fragility as compared to a pure-simulation study. Thus, our findings here have direct implications for phase III oncology trials, which frequently influence standard of care clinical practices.
Building on a growing literature conveying the limitations of statistical significance criteria, this study provides new, quantitative, and easily understandable insights into survival-inferred fragility present in many late-phase trials. To improve the reliability of the evidence generated in oncology trials, alternative strategies for interpreting clinical trials beyond statistical significance are urgently needed.
Methods
Inclusion criteria and patient-level data reconstruction
We performed a meta-epidemiological analysis of phase III oncology trials identified from ClinicalTrials.gov in February 2020 with no date limitations. This report conforms to the modified Preferred Reporting Items for Systematic Reviews and Meta-Analyses for meta-epidemiological studies33. Only 2-arm superiority trials were included in this study. PEPs that were not published, were not time-to-event, or lacked a Kaplan–Meier plot with a number-at-risk table were excluded. Screening with these criteria yielded 337 trials evaluable for the reconstruction of individual patients’ survival data. Methods of manual survival-data reconstruction using WebPlotDigitizer (Austin, TX) have been reported previously22,30. The reconstruction quality was defined as the absolute value of the natural logarithm of HRrecon/HRreported, where HRrecon was the point estimate of the HR for the reconstructed data determined by Cox regression, and HRreported was the point estimate of the HR reported in the trial publication. Reconstructions in which this value was greater than 0.1 were excluded. Trials with proportional hazards violations in the PEP were also excluded, because the Cox proportional hazards model, used for the SIFI estimation, is unreliable in this setting34. After application of these criteria, 230 trials were eligible for inclusion in the study and further analysis, as reported previously22.
Covariate definitions
Trial-level features were recorded in a standardized database. Enrollment was defined as the number of patients in the PEP analysis. Surrogate endpoints were defined as disease-related endpoints attempting to represent overall survival, consistent with other publications35. United States FDA approvals were evaluated as reported previously36. Institutional review board approval was not needed because the data were publicly available.
Estimation of survival-inferred fragility index
The SIFI of each trial was estimated using R v.4.4.2 (Vienna, Austria) with previously published methods, and the code is included in the Supplement15. SIFI values were estimated by counting the number of patients that needed to be iteratively flipped between treatment arms to change the original statistical significance interpretation of the PEP Cox proportional hazards regression (i.e., from positive to negative or vice versa). Statistical significance was defined uniquely for each trial as the threshold set by the trial. Flipping of the best (longest) survivors from the experimental arm to the control arm to calculate SIFIB was used for the main analysis. Sensitivity analyses were performed by flipping the worst (shortest) survivors from the control arm to the experimental arm (SIFIW) and flipping the median survivors from the control arm to the experimental arm (SIFIM). SIFI counts were computed using R v4.4.2 (Vienna, Austria). After the absolute value of the SIFI count was calculated, the index was normalized as a percentage of the total number of participants evaluated in the PEP analysis. Based on prior work, we defined severe fragility as being present when the SIFI was less than or equivale to 1% of the study enrollment15.
To explore the effects of the survival model on SIFI, SIFI, as conventionally estimated by Cox proportional hazards regression, was compared to SIFI estimated by three alternative survival models: RMST, MaxCombo2, and MaxCombo3. RMST models estimate the difference in mean survival between treatment arms by integrating the area under the survival curve until the truncation time τ, defined here as the earlier of the last observed event from either arm34. MaxCombo2 incorporates the Fleming–Harrington log-rank statistic plus a weighted log-rank statistic for late separation of curves, and MaxCombo3 provides an additional weighting to MaxCombo2 to account for diminishing treatment effects37.
Statistical analysis
Continuous variables were summarized by median and IQR. Correlations between the SIFI and the initial P value were estimated using Spearman’s rank correlation coefficient. The chi-square test was used to test the association between categorical variables. The associations between trial-level features and the SIFI were evaluated using the Mann–Whitney U test. Binary logistic regressions tested the association between trial-level features and severe fragility to estimate OR. Two-sided P values with 95% CIs were calculated using SAS v9.4 (Cary, NC), and significance was defined as P < 0.05. Plots were created using Prism v10 (La Jolla, CA).
Acknowledgements
We thank Laura Russell, scientific editor at The University of Texas MD Anderson Cancer Center’s Research Medical Library, for editing the manuscript. This work was supported in part by the National Institutes of Health/National Cancer Institute through MD Anderson’s Cancer Center Support Grant P30CA016672. P.M. and E.L. are recipients of the Andrew Sabin Family Foundation Fellowship.
Author contributions
Conception and design: Alexander Sherry, Yufei Liu, Pavlos Msaouel, Timothy Lin, Jonathan Ofer, David Bomze, Zachary McCaw, Tomer Meirson, Ethan Ludmir. Collection and assembly of data: Alexander Sherry, Timothy Lin, Alex Koong, Christine Lin, Joseph Abi Jaoude, Roshal Patel, Ramez Kouzy, Molly El-Alam, Avital Miller, Ethan Ludmir. Financial support: Ethan Ludmir. Software and code: Alexander Sherry, Yufei Liu, Timothy Lin, Mohannad Owiwi, Jonathan Ofer, David Bomze, Tomer Meirson, Ethan Ludmir. Data analysis and interpretation: all authors. Manuscript writing: ADS wrote the first draft. All authors revised the paper for critical intellectual content. Final approval of manuscript: all authors. Accountable for all aspects of work: all authors.
Data availability
Deidentified reconstructed data for individual patients are available at the online repository Figshare and accessible via: https://figshare.com/articles/dataset/Reconstructed_survival_data_from_Phase_3_oncology_trials/26103268.
Code availability
Code for this study is included in the supplement.
Competing interests
A.S. reports honoraria from Sermo. P.M. reports honoraria for scientific advisory board membership for Mirati Therapeutics, Bristol-Myers Squibb, and Exelixis; consulting fees from Axiom Healthcare; non-branded educational programs supported by DAVA Oncology, Exelixis, and Pfizer; leadership or fiduciary roles as a Medical Steering Committee Member for the Kidney Cancer Association and as a Kidney Cancer Scientific Advisory Board Member for KCCure; and research funding from Regeneron Pharmaceuticals, Takeda, Bristol-Myers Squibb, Mirati Therapeutics, and Gateway for Cancer Research (all unrelated to this manuscript’s content). Z.M. reports employment at Insitro (unrelated to this manuscript’s content). T.M. reports consulting fees from Purple Biotech. No other authors report any conflicts of interest.
Supplementary information
The online version contains supplementary material available at https://doi.org/10.1038/s41698-025-01024-2.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Lin, TA; Sherry, AD; Ludmir, EB. Challenges, complexities, and considerations in the design and interpretation of late-phase oncology trials. Semin Radiat. Oncol.; 2023; 33, pp. 429-437. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37684072][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10917127]
2. Concato, J; Hartigan, JA. P values: from suggestion to superstition. J. Investig. Med.; 2016; 64, pp. 1166-1171. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27489256][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5099183]
3. Wasserstein, RL; Lazar, NA. The ASA statement on p-values: context, process, and purpose. Am. Stat.; 2016; 70, pp. 129-133.
4. Amrhein, V; Greenland, S; McShane, B. Scientists rise up against statistical significance. Nature; 2019; 567, pp. 305-307.1:CAS:528:DC%2BC1MXotFagurY%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30894741]
5. Benjamin, DJ et al. Redefine statistical significance. Nat. Hum. Behav.; 2018; 2, pp. 6-10. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30980045]
6. Goodman, SN. Toward evidence-based medical statistics. 1: the P value fallacy. Ann. Intern. Med.; 1999; 130, pp. 995-1004.1:STN:280:DyaK1M3psVeqsQ%3D%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/10383371]
7. Greenland, S et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol.; 2016; 31, pp. 337-350. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27209009][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414]
8. Greenland, S. Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: or, how divergence P-values measure evidence even when decision P-values do not. Scand. J. Stat.; 2023; 50, pp. 54-88.
9. Greenland, S. Invited Commentary: The need for cognitive science in methodology. Am. J. Epidemiol.; 2017; 186, pp. 639-645. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28938712]
10. Boos, DD; Stefanski, LA. P-value precision and reproducibility. Am. Stat.; 2011; 65, pp. 213-221. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22690019]
11. Walsh, M et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a fragility index. J. Clin. Epidemiol.; 2014; 67, pp. 622-628. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24508144]
12. Johnson, K. W., Rappaport, E., Shameer, K., Glicksberg, B. S. & Dudley, J. T. fragilityindex: an R package for statistical fragility estimates in biomedicine. Preprint at bioRxivhttps://www.biorxiv.org/content/10.1101/562264v1 (2019).
13. Del Paggio, JC; Tannock, IF. The fragility of phase 3 trials supporting FDA-approved anticancer medicines: a retrospective analysis. Lancet Oncol.; 2019; 20, pp. 1065-1069. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31296490]
14. Bomze, D; Meirson, T. A critique of the fragility index. Lancet Oncol.; 2019; 20, e551. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31578993]
15. Bomze, D et al. Survival-inferred fragility index of phase 3 clinical trials evaluating immune checkpoint inhibitors. JAMA Netw. Open; 2020; 3, e2017675. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33095247][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7584930]
16. Olsen, HE et al. Statistical fragility of findings from randomized phase 3 trials in pediatric oncology. Cancer Med.; 2024; 13, e70356. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39676273][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11646933]
17. Naudet, F et al. Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: survey of studies published in The BMJ and PLOS Medicine. BMJ; 2018; 360, k400. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29440066][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5809812]
18. Msaouel, P; Lee, J; Thall, PF. Interpreting randomized controlled trials. Cancers; 2023; 15, 4674. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37835368][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10571666]
19. McCaw, Z. R., Kim, D. H. & Wei, L. J. Pitfall in the design and analysis of comparative oncology trials with a time-to-event endpoint and recommendations. JNCI Cancer Spectr.6https://doi.org/10.1093/jncics/pkac007 (2022).
20. McCaw, ZR et al. Choosing clinically interpretable summary measures and robust analytic procedures for quantifying the treatment difference in comparative clinical studies. Stat. Med.; 2021; 40, pp. 6235-6242. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34783094][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8687139]
21. van Zwet, E et al. A new look at P values for randomized clinical trials. NEJM Evid.; 2024; 3, [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38320512]EVIDoa2300003.
22. Sherry, A. D. et al. Towards treatment effect interpretability: a Bayesian re-analysis of 194,129 patient outcomes across 230 oncology trials. Preprint at medRxivhttps://doi.org/10.1101/2024.07.23.24310891 (2024).
23. Wasserstein, RL; Schirm, AL; Lazar, NA. Moving to a world beyond “p < 0.05”. Am. Stat.; 2019; 73, pp. 1-19.
24. Del Paggio, JC et al. Delivery of meaningful cancer care: a retrospective cohort study assessing cost and benefit with the ASCO and ESMO frameworks. Lancet Oncol.; 2017; 18, pp. 887-894. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28583794]
25. Cherny, NI et al. ESMO-Magnitude of Clinical Benefit Scale version 1.1. Ann. Oncol.; 2017; 28, pp. 2340-2366.1:STN:280:DC%2BC1M%2FitV2rsQ%3D%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28945867]
26. Ellis, LM et al. American Society of Clinical Oncology perspective: raising the bar for clinical trials by defining clinically meaningful outcomes. J. Clin. Oncol.; 2014; 32, pp. 1277-1280. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24638016]
27. Sherry, AD et al. Evidenced-based prior for estimating the treatment effect of phase III randomized trials in oncology. JCO Precis. Oncol.; 2024; 8, [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39348660]e2400363.
28. Msaouel, P; Lee, J; Karam, JA; Thall, PF. A causal framework for making individualized treatment decisions in oncology. Cancers; 2022; 14, 3923.1:CAS:528:DC%2BB38XitlSlsb7J [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36010916][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9406391]
29. Msaouel, P; Lee, J; Thall, PF. Making patient-specific treatment decisions using prognostic variables and utilities of clinical outcomes. Cancers; 2021; 13, 2741.1:CAS:528:DC%2BB3MXislehs7rF [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34205968][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8198909]
30. Guyot, P; Ades, AE; Ouwens, MJ; Welton, NJ. Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Med. Res. Methodol.; 2012; 12, [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22297116][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3313891]9.
31. Sherry, AD et al. Increasing power in phase III oncology trials with multivariable regression: an empirical assessment of 535 primary end point analyses. JCO Clin. Cancer Inf.; 2024; 8, e2400102.
32. Senn, S. Seven myths of randomisation in clinical trials. Stat. Med.; 2013; 32, pp. 1439-1450. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23255195]
33. Murad, MH; Wang, Z. Guidelines for reporting meta-epidemiological methodology research. Evid. Based Med.; 2017; 22, 139. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28701372][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5537553]
34. Lin, TA et al. Proportional hazards violations in phase III cancer clinical trials: a potential source of trial misinterpretation. Clin. Cancer Res.; 2024; 30, pp. 4791-4799. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39133081][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11479825]
35. Chen, EY; Haslam, A; Prasad, V. FDA acceptance of surrogate end points for cancer drug approval: 1992-2019. JAMA Intern. Med.; 2020; 180, pp. 912-914. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32338703][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7186918]
36. Abi Jaoude, J et al. Food and Drug Administration approvals in phase 3 Cancer clinical trials. BMC Cancer; 2021; 21, [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34118915][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8196526]695.
37. Mukhopadhyay, P et al. Log-rank test vs MaxCombo and difference in restricted mean survival time tests for comparing survival under nonproportional hazards in immuno-oncology trials: a systematic review and meta-analysis. JAMA Oncol.; 2022; 8, pp. 1294-1300. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35862037][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9305601]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In phase III oncology trials, superiority is defined by statistical significance using P thresholds. However, this approach has been criticized because P is continuous. Here, we reconstruct patient-level data for 230 phase III oncology trials to model the robustness of statistical significance by estimating the survival-inferred fragility index (SIFI), defined as the smallest number of patients changing arms that alters the statistical significance interpretation. The median SIFI was 8 patients (IQR 4–19), representing 1.4% of enrollments (IQR 0.7%–3%). As a continuous statistic, P—but not the significance interpretation—was correlated with SIFI. Moreover, overall survival endpoints were more fragile than surrogate endpoints. Taken together, while phase III oncology trials are intended to robustly inform patient care, shifting the assignment of a few patients is often sufficient to upend the statistical significance interpretation. This vulnerability underscores the need for more robust strategies to identify superiority in oncology.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 The University of Texas MD Anderson Cancer Center, Division of Radiation Oncology, Department of Radiation Oncology, Houston, USA (GRID:grid.240145.6) (ISNI:0000 0001 2291 4776); Mayo Clinic, Department of Radiation Oncology, Rochester, USA (GRID:grid.66875.3a) (ISNI:0000 0004 0459 167X)
2 City of Hope National Medical Center, Department of Radiation Oncology, Duarte, USA (GRID:grid.410425.6) (ISNI:0000 0004 0421 8357)
3 The University of Texas MD Anderson Cancer Center, Division of Cancer Medicine, Department of Genitourinary Medical Oncology, Houston, USA (GRID:grid.240145.6) (ISNI:0000 0001 2291 4776); The University of Texas MD Anderson Cancer Center, Department of Translational Molecular Pathology, Houston, USA (GRID:grid.240145.6) (ISNI:0000 0001 2291 4776)
4 The University of Texas MD Anderson Cancer Center, Division of Radiation Oncology, Department of Radiation Oncology, Houston, USA (GRID:grid.240145.6) (ISNI:0000 0001 2291 4776)
5 Stanford University, Department of Radiation Oncology, Stanford, USA (GRID:grid.168010.e) (ISNI:0000 0004 1936 8956)
6 Memorial Sloan-Kettering Cancer Center, Department of Radiation Oncology, New York, USA (GRID:grid.51462.34) (ISNI:0000 0001 2171 9952)
7 Jerusalem Mental Health Center, Eitanim Psychiatric Hospital, Jerusalem, Israel (GRID:grid.416889.a) (ISNI:0000 0004 0559 7707)
8 Rabin Medical Center-Beilinson Hospital, Davidoff Cancer Center, Petah Tikva, Israel (GRID:grid.413156.4) (ISNI:0000 0004 0575 344X)
9 Tel Aviv University, Faculty of Medicine, Tel Aviv, Israel (GRID:grid.12136.37) (ISNI:0000 0004 1937 0546)
10 Insitro, South San Francisco, USA (GRID:grid.12136.37) (ISNI:0000 0005 0975 8106); University of North Carolina at Chapel Hill, Department of Biostatistics, Chapel Hill, USA (GRID:grid.10698.36) (ISNI:0000 0001 2248 3208)
11 The University of Texas MD Anderson Cancer Center, Division of Radiation Oncology, Department of Gastrointestinal Radiation Oncology, Houston, USA (GRID:grid.240145.6) (ISNI:0000 0001 2291 4776); The University of Texas MD Anderson Cancer Center, Department of Biostatistics, Houston, USA (GRID:grid.240145.6) (ISNI:0000 0001 2291 4776)