Capability and accuracy of usual statistical

Full text

Turn on search term navigation

Introduction

In the classical approach of clinical trials, data from healthcare providers is transferred and stored in a centralized environment for statistical analysis (Fig 1). Due to legal, ethical or informed consent restrictions that protect privacy of the patient, individual-level data cannot always be easily and timely shared between organizations in a central environment [1]. Due to these increasing regulations on data sharing to protect sensitive data, such as the General Data Protection Regulation (GDPR) in Europe [2] and similar regulations in other regions [3], the healthcare organizations tend to retain ownership of their data in the context of collaboration, keeping control on access to data and its value.

[Figure omitted. See PDF.]

Data sharing and collaboration have been significantly improved with new technologies, and the Federated Analysis (FA) platform is one of the solutions to decentralize data [4]. FA enables the generation of statistical analyses without data transfer agreements between healthcare organizations. Individual-level data remains under the control of the provider. Only computational queries and corresponding aggregated results are transferred in an FA environment, which is respectful of data privacy and property of each data source, in accordance with regulations and informed consent requirements (Fig 1).

The area of FA is new and developing rapidly, there is a lot of research, and more and more are ready to use patterns and tools. In the landscape of existing FA tools, by its maturity and the strength of its community, data aggregation through anonymous summary-statistics from harmonized individual-level databases (DataSHIELD) [5] is recognized as a potential solution for privacy preserving analysis.

This project aims to provide learnings and insights for the future use of DataSHIELD by assessing its capability and accuracy of common statistical analyses in a real-world setting compared to the classical centralized approach. The practical use of DataSHIELD within the privacy preserving federated environment will allow this evaluation. The comparison of the results generated through DataSHIELD with those generated through the centralized approach will support a preliminary review of the accuracy of the results. The objective of this project was to determine the capability of an FA approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting.

Material

DataSHIELD is a software framework for secure bioscience collaboration that enables statistical analysis of individual-level data from multiple healthcare providers without transfer of data. DataSHIELD takes the analysis to the data by sending analysis requests from a central machine to multiple data-holding machines storing the harmonized data to be co-analysed simultaneously. The individual-level data remains within each healthcare provider and only non-disclosive summary data is shared with the data analyst.

DataSHIELD is an R-based programming tool that provides functions to assist the statistical analyses from data preparation to analysis [6]:

* Data preparation functions: creation or modification of variables needed for the analysis (coercing, variable manipulation or creation).

* Administrative functions: support and information setting functions.

* Data analysis functions: creation of statistical output and generalized linear modeling (data structure queries, summary statistics, matrices, tables, survival analysis, distribution generating, modeling).

* Data presentation functions: creation of graphical output.

In case the base functionality package does not cover specific data derivation or analysis needs at the time of the analysis, DataSHIELD offers the flexibility to define custom functions or to use and update existing custom functions provided by the DataSHIELD community [7]. However, community packages are supplied without warranty, and it is the responsibility of the user to ensure that sufficient data disclosure controls are implemented by the function.

Project data and ethic statement

This research was based on data from a French real-world retrospective study collected from patient’s electronic medical records. The raw data represented a non-interventional study describing the epidemiology and the therapeutic management of 315 patients treated for early breast cancer among 57 sites in France. The start date of data extraction was May 2019, and the end date of data extraction was September 2019. Based on French regulations, this study belonged to the category of healthcare research involving secondary use and analysis of data. This retrospective study complied with the French “Commission nationale de l’informatique et des libertés–CNIL” (translated to National Commission on Informatics and Liberty), reference methodology 004 (MR-004). The cohort study was approved by the Institutional Review Board/Independent Ethics Committee on May 03, 2018, prior study set-up. All the patients received written information before any trial-related activities were carried out. In addition, the analyses of this research were performed on an anonymised synthetic version of the data, no author had access to information that could identify individual participants during or after data collection.

Project infrastructure set-up

The raw database is composed of seven different datasets: demographic characteristics, disease diagnostic, surgery, pathological complete response (pCR), adjuvant treatments, patients follow-up and patients lost to follow-up information.

Based on this synthetic raw database, a virtual federated network of three healthcare organizations was generated by randomly splitting the data into three distinct local databases. Each database represented a healthcare organization of different size (N1 = 157 patients, N2 = 94 and N3 = 64, respectively). This splitting process ensured that each database contained a unique part of the patient by maintaining a similar data structure. These three databases were then stored on a data platform integrating DataSHIELD functionality (Fig 2). This environment is a practical implementation of an FA setup in a multi-organization context while preserving data privacy (no individual data transfer and storage).

[Figure omitted. See PDF.]

In order to assess the existing capabilities of DataSHIELD, the programming intent of this project was to use built-in distributed R functions from community packages, available at the start of the project in 2022, as a priority over writing custom functions. The privacy guardrails, including disclosive control filters [8], were set at the default level in each local data repository server. These filters are key in the DataSHIELD infrastructure to protect privacy and ensure that sensitive data is not disclosed during statistical analysis. R version 4.2.2 and DataSHIELD version 6.2 were used.

Statistical methods

11 derivations were performed on both the centralized and federated data to facilitate further statistical outputs and modeling. The following data transformations were performed:

Class factors:

* Age group in years (<40, [40–49], [50–59], [60–69], > = 70).

* BMI group in kg/m2 (<25, [25–29], > = 30).

* T class et N class.

Duration:

* Follow-up duration (in years).

* Time from diagnostic to surgery (in months).

* Time from diagnostic to progression (in years).

* Age at adjuvant treatment initiation of Herceptin (in years).

* Treatment duration (in months).

* Time from surgery to adjuvant treatment initiation of Herceptin (in days).

* Time to event/censor for survival analysis (in days).

Binary outcomes (multilevel conditions):

* pCR results (Yes, No).

* Indicator of event/censored for survival analysis.

We conducted the following types of analysis:

* Descriptive statistics with mean, standard deviation (SD), median, quantiles for continuous variables and proportions for categorical variables.

* Survival analysis [9] to visualize and identify predictive factors for progression-free survival (PFS).

* Logistic regression [10] to identify predictive factors for pCR.

* Correlation matrix to assess associations between variables.

The analyses were first generated in the centralized environment using R programming, and then reproduced in the privacy preserving federated environment using R-based DataSHIELD programming (Fig 2).

DataSHIELD was evaluated by assessing the capability and accuracy of the reproduced statistical analyses compared to those derived from the centralized approach. The reproducibility was assessed by evaluating which analyses could be produced using DataSHIELD. The accuracy of the results was assessed by comparison with those obtained using the centralized approach. The following evaluation were performed:

* Descriptive statistics included evaluation of difference in proportion for categorical outcome and difference in mean/SD/median for continuous outcome. The selected rules for determining similarity were defined as an absolute difference of 5% or less.

* Survival analysis included checking the survival probabilities from the K-M method, hazard ratio (HR) and global effect p-value from the Cox proportional hazards model.

* Logistic analysis included checking the odds ratio (OR) and global effect p-value from logistic regression.

* In the correlation matrix, the p-value obtained by Chi-square/Fisher’s exact test or ANOVA global effect p-value was checked.

Results

Data transformation

Derivation of class variables from existing numeric variables (such as BMI or age groups) and derivation of duration between two existing date variables (such as age at diagnosis) were successfully performed in the FA environment, using dssDeriveColumn [11], ds.Boole and ds.assign functions.

Survival analysis involves selecting the appropriate event and censoring dates to perform the analysis. The reference start date was defined as the adjuvant initiation start date, the event date was the date of progression or death, and the censoring date for patients without event was the last on-treatment adjuvant date. Due to data disclosure restrictions in DataSHIELD, the selection of event/censor date was considered as sensitive information for patient privacy, and therefore the use of the built-in DataSHIELD function for such derivation was not possible. In order to perform the required derivation, the solution was to develop a specific custom function to allow data selection on the server side. The same situation arose for more complex multi-level derivations, where custom functions were also created.

In addition, subgroup flags had to be created for the selection of patients having received a specific medication. These selections were possible in FA, using the ds.dataFrameSubset function. However, data disclosure restrictions limited the reporting of subgroups with fewer than three patients.

The use of a combination of DataSHIELD base functions and analysis-specific custom functions allowed the derivation of all variables needed to meet the analysis requirements (see functions in Appendix 1). This outcome shows the capability and flexibility of DataSHIELD to manipulate and transform data in an FA environment while maintaining data privacy.

Descriptive statistics

Most of the analyses were descriptive, covering a total of 73 variables, 25 continuous and 48 categorical. In FA, descriptive statistics were calculated simultaneously but in parallel within each local database and then summary statistics were provided back to the federated analyst. For categorical parameters, the sums of all available observations in each source database (and the corresponding proportion) were calculated (dh.getStats [12] and ds.table functions). For continuous variables, the statistics were weighted by the number of patients with a non-missing value from each source database (ds.meanSdGp function), allowing proportional contribution of each database to the aggregated result (see aggregation methods in Appendix 2). All of these analyses were performed using the built-in DataSHIELD functions.

Table 1 shows the main characteristics of the data obtained from the centralized and federated approaches.

[Figure omitted. See PDF.]

For categorical variables, the results (number of observations and percentages) were exactly the same between centralized and FA. However, the privacy filter threshold (n<3 observations within a category in at least one database) was reached for 28 out of 48 categorical variables (58%). In this situation, the DataSHIELD disclosure filter restricted the analysis and aggregated results were not returned to the analyst. As a workaround, the aggregated results were obtained by tripling the source datasets to reach the threshold cell count of three and then dividing the results by three to obtain the exact count.

For continuous variables, the mean and SD were identical for the centralized and FA results. However, we observed five out of 25 continuous variables with an absolute difference of more than 5% for the median, as these statistics depend on the number of observations within each source database (even count of data points: median is the average of the two middle values, odd count of data points: median is the value of the (n+1)/2 observation).

As visualized in Table 2, more than half of the descriptive categorical variables (58%) would not have been generated in the federated report without the triplication workaround.

[Figure omitted. See PDF.]

Note that the minimum and maximum values are not part of the aggregated result due to data disclosure restrictions in DataSHIELD. Only one continuous variable out of 48 was not summarized in the FA because the variable was completely missing in one of the source databases. There was no workaround for this finding due to privacy restrictions in the Datashield settings at the time of analysis.

The built-in DataSHIELD base functions were sufficient to reproduce similar descriptive statistics in the FA environment, no additional custom function was required.

Survival analysis

Survival K-M estimates (survival probabilities at fixed year) were obtained within each database using the ds.survit function from the dsSurvival package [13]. The weighted mean by the number of patients included in each database was used as the aggregation method performed locally in R programming. Four-year survival probabilities were similar with a maximum difference of 0.3% (see Table 3).

[Figure omitted. See PDF.]

The variability of the results between databases due to heterogeneity in the number of patients/events may affect the robustness of the aggregated results. The availability of the data and the survival trend within each database must be carefully assessed in advance.

Depending on the privacy settings of the FA, the K-M curve can be considered as a potentially disclosive output, as it can provide the exact time of occurrence of an event for a patient. At the time of the analysis, there was no built-in function in DataSHIELD to produce an aggregated non-disclosive Kaplan-Meier curve. A custom function was written to generate the exact K-M curve from the survival estimates of each source database. Since 2023, the dsSurvival2.0 package has introduced an enhanced version of the dsSurvival package that provides new privacy enhancing survival curves using locally estimated scatterplot smoothing (LOESS) method [14]. However, this package can only provide one curve per source database, not a global curve.

Univariate survival modeling was performed using the built-in ds.survival package [13]. Cox models were performed using the ds.coxph.SLMA function which includes the aggregation of the HR estimates (and standard errors) using the metafor function [15]. The effect p-values associated with HR were not calculated with this package, therefore we adapted this function to obtain the associated global effect p-value per variable for each database. The global effect p-values obtained were then aggregated using the sum of logs Fisher’s method (sumlog function from the metap package) [16].

The univariate results are shown in Table 4. The largest difference in HR was observed for N classification (N2&N3 category: centralized HR = 0.59 [0.13–2.60] vs federated HR = 1.37 [0.28–6.69]). The significance at 15% was not maintained for five up to nine variables. This difference was attributed to the heterogeneity of the source databases (the variability in database 1 was higher than in database 3, which affected the aggregated results), and to the aggregation method used to calculate the p-value (sum of logs Fisher’s method).

[Figure omitted. See PDF.]

Using a 15% threshold for the variable selection, we obtained the same final multivariate model as in the centralized analysis (Table 5). However, due to the heterogeneity of the data availability within each database (222 patients (25 events) in total with 107 (12), 70 (8) and 45 (5) per database respectively), the aggregated HR estimates were not robust enough.

[Figure omitted. See PDF.]

Logistic analysis

Univariate logistic modeling was performed using the built-in dsBaseClient package [17] in the FA environment. Logistic models were run using the ds.glmSLMA function which includes the aggregation of the OR estimates (and standard error) with the metafor function [15]. For HR estimates, a custom function was used to obtain the associated global effect p-value for each database by variable. The global effect p-values were then also aggregated using the sum of logs Fisher’s method (sumlog function from the metap package) [16]. The univariate results are shown in Table 6.

[Figure omitted. See PDF.]

The largest difference observed for OR was for N classification (N1 category: centralized OR = 1.09 [0.66–1.81] vs federated HR = 0.86 [0.31–2.36]). Significance at 15% was not maintained in the FA for 2 to 8 variables compared to the centralized analysis, this difference was attributed to the heterogeneity of the source databases (more variability in database 1 than in database 3) and the aggregation method of the p-value (sum of logs Fisher’s method). No additional variable was selected for the multivariate model at the 15% threshold.

Correlation matrix

Relationships between two categorical variables were assessed using either the Chi-square test or Fisher’s exact test. These tests are based on the aggregated counts, therefore no DataSHIELD function was required, and the results obtained in FA were exactly equivalent to the centralized results.

Relationships between continuous and categorical variables were assessed using the global effect p-value from ANOVA. This p-value was not available in the built-in DataSHIELD function (ds.glmSLMA). To address this, we customized the function to retrieve the global effect p-value for each database. These global effect p-values were then aggregated using the sum of logs Fisher’s method [16]. The same level of relationship as in the centralized environment was observed at the 5% threshold (Table 7).

[Figure omitted. See PDF.]

Limitation

This proof-of-concept of a federated approach was implemented using only a single synthetic real-world data representing a non-interventional study, divided into three virtual healthcare organizations of different sizes in terms of number of patients. This setting was an ideal federated environment: limited number of sources and data already harmonized. This proof-of-concept should be generalized to a broader federated real-world scenario, representative of practical applications, where data privacy, availability and harmonization should be anticipated and mitigated. The availability of the centralized results was also helpful in carrying out the FA programming, which will not be the case in a real-world setting. It would be interesting to evaluate how FA performs under different scenarios in terms of data volumes, data types, level of data quality levels and different privacy preserving settings.

Federated learning using more complex models (deep learning, neural networks, …) was also outside the scope of our research, as this was not originally planned. These approaches could be investigated using the same methodology of this research.

Discussion

Our research showed a practical implementation of an FA environment in a proof-of-concept configuration and its use. The following key insights can be shared for future implementation of such an approach:

* The implementation of a federated environment in real-world setting needs to be anticipated and discussions with the participating healthcare organizations need to be initiated in advance in order to fit the approach within the existing infrastructure and privacy requirements.

* The level of data disclosure prevention to be set in the FA should be agreed with all participants in accordance with technical and local regulatory requirements, to establish the right balance between privacy and accuracy (e.g., for small cohorts, a trade-off between privacy and feasibility needs to be made).

* Establish a governance framework to control the access to the FA environment, setting appropriate access for each data owner and analyst while maintaining the privacy of the individual data.

* Establish a governance framework for the creation of custom DataSHIELD functions (data server and client machines) to ensure the security of the function and the non-disclosure of sensitive information. All custom functions must be tested and validated by all data custodians prior to local deployment in the FA environment.

* Participating healthcare organizations may use different data structures. To avoid statistical inaccuracies, data standardization should be performed in advance to harmonize the selected data between the FA network organizations (without data transfer). Data quality measurement can also be performed in a federated architecture using not disclosive data quality measurement functions.

* The variability of the data between individual source databases may affect the accuracy of the FA results. A data review to assess the quality and disposition of each database should be performed in order to minimize the potential statistical bias and understand the data. For organizations with a small number of patients, this means either relaxing FA restriction rules or pooling data before performing the FA analysis (i.e. a hybrid centralized-federated solution).

* It is recommended to prepare a federated statistical analysis plan before performing the analyses. The document should describe the source data (number of databases, number of patients per database and type of variable), the analysis to be performed and the aggregation methods to be used to ensure the feasibility and accuracy of the analysis. An evaluation of the standard/custom functions to be used according to the data structure and DataSHIELD capabilities at the time of the analysis must also be performed.

* As for a centralized approach, variable derivations to support further analysis requirements should be performed prior to using any of the analytical functions of DataSHIELD.

Conclusions

The DataSHIELD open-source solution was used to generate a federated statistical report from three local source databases with different sample sizes but with identical structure (variable and format). In our proof-of-concept situation, the data was partitioned horizontally, with the same data structure across all source databases but on different patients. Common data transformations and statistical analyses were reproduced using DataSHIELD programming and compared with the centralized report generated from the raw database. Both analyses were performed using R-based programming.

All analyses were successfully reproduced using built-in or custom DataSHIELD functions. The FA results were identical to the raw results in terms of descriptive statistics, except for some differences in positional measures (quartiles). The FA estimates of the univariate models were aligned with those from the centralized results, but a loss of accuracy was observed for the multivariate model due to source database variability. The accuracy of the FA results is related to the statistical aggregation method used and the number of data points within each source database. Centers with fewer patients will result in more federated privacy restrictions and a greater loss of accuracy than in a centralized setting.

The capability and accuracy of common data manipulation and statistical analysis was satisfactory with DataSHIELD. The flexibility of the tool (ability to develop new functions) allows a variety of analyses to be carried out while maintaining the privacy of individual sensitive data, no blocking points were identified. The DataSHIELD forum was a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, the privacy requirements should be established before starting the analysis. The FA approach is a good alternative when a centralized approach is not feasible due to data access and/or data sharing issues.

This approach is suitable for real-world research using multiple data sources (either at site level or at national cohort level), as long as the data are harmonized beforehand. For prospective studies, it may be preferable to use an electronic Case Report Form (eCRF) to ensure harmonization of data collection, while still using the FA approach for statistical analysis. Finally, the FA approach still requires the necessary research approvals, depending on the regulations in the countries involved in the research.

Supporting information

S1 Table. R functions used for federated and centralized programming.

https://doi.org/10.1371/journal.pone.0312697.s001

(TIF)

S2 Table. Aggregation methods used for federated analysis.

https://doi.org/10.1371/journal.pone.0312697.s002

(TIF)

S1 File. Statistical report centralized.

https://doi.org/10.1371/journal.pone.0312697.s003

(DOCX)

S2 File. Statistical report federated.

https://doi.org/10.1371/journal.pone.0312697.s004

(DOCX)

S3 File. Rmd code data import federated.

https://doi.org/10.1371/journal.pone.0312697.s005

(TXT)

S4 File. Rmd code data management federated.

https://doi.org/10.1371/journal.pone.0312697.s006

(TXT)

S5 File. Rmd code descriptive stat federated.

https://doi.org/10.1371/journal.pone.0312697.s007

(TXT)

S6 File. Rmd code correlations federated.

https://doi.org/10.1371/journal.pone.0312697.s008

(TXT)

S7 File. Rmd code survival federated.

https://doi.org/10.1371/journal.pone.0312697.s009

(TXT)

S8 File. Rmd code logistic federated.

https://doi.org/10.1371/journal.pone.0312697.s010

(TXT)

S9 File. R code dataset derivation centralized.

https://doi.org/10.1371/journal.pone.0312697.s011

(TXT)

S10 File. R code generate analysis centralized.

https://doi.org/10.1371/journal.pone.0312697.s012

(TXT)

References

1. 1. Rosenbaum L. Bridging the Data-Sharing Divide—Seeing the Devil in the Details, Not the Other Camp. N Engl J Med. 2017 Jun 8;376(23):2201–2203. Epub 2017 Apr 26. pmid:28445080.

* View Article

* PubMed/NCBI

* Google Scholar

2. 2. E. Parliament, C. of European Union, Regulation (eu) 2016/679 of the european parliament and council (2016). https://eur-lex.europa.eu/eli/reg/2016/679/oj.

3. 3. Edemekong, Peter F.; Annamaraju, Pavan; Haydel, Micelle J. (2023). Health Insurance Portability and Accountability Act, StatPearls, Treasure Island (FL): StatPearls Publishing.

4. 4. Templ M., Sariyar M. A systematic overview on methods to protect sensitive data provided for various analyses. Int. J. Inf. Secur. 21, 1233–1246 (2022). Available from:

* View Article

* Google Scholar

5. 5. Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int J Epidemiol. 2014 Dec;43(6):1929–44. Epub 2014 Sep 26. pmid:25261970

* View Article

* PubMed/NCBI

* Google Scholar

6. 6. DataSHIELD list of functions. https://data2knowledge.atlassian.net/wiki/spaces/DSDEV/overview.

7. 7. DataSHIELD community packages. https://www.datashield.org/help/community-packages.

8. 8. DataSHIELD disclosure controls. https://data2knowledge.atlassian.net/wiki/spaces/DSDEV/pages/714768398/Disclosure+control.

9. 9. Cox DR (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society. Series B (Methodological) 34 (2), 187–220.

* View Article

* Google Scholar

10. 10. Cox DR. The regression analysis of binary sequences. Journal of the Royal Statistical Society, Series B. 1958;20:215–242.

* View Article

* Google Scholar

11. 11. Dragan, I., Sparsø, T., Kuznetsov, D., Slieker, R. & Ibberson, M. dsSwissKnife: An R package for federated data analysis. https://doi.org/10.1101/2020.11.17.386813 (2020).

12. 12. ds-Helper package. https://github.com/lifecycle-project/ds-helper.

13. 13. Banerjee S, Sofack GN, Papakonstantinou T, Avraam D, Burton P, Zöller D, et al. dsSurvival: Privacy preserving survival models for federated individual patient meta-analysis in DataSHIELD. BMC Res Notes. 2022 Jun 3;15(1):197. pmid:35659747

* View Article

* PubMed/NCBI

* Google Scholar

14. 14. Banerjee S, Bishop TRP. dsSurvival 2.0: privacy enhancing survival curves for survival models in the federated DataSHIELD analysis system. BMC Res Notes. 2023 Jun 6;16(1):98. pmid:37280717

* View Article

* PubMed/NCBI

* Google Scholar

15. 15. Schwarzer G. R. (2007). An R package for meta‐analysis. R News, 7, 40–45. Available from: http://www.metafor-project.org/doku.php.

* View Article

* Google Scholar

16. 16. Dewey M (2023). metap: Meta-Analysis of Significance Values. R package version 1.9. https://CRAN.R-project.org/package=metap.

17. 17. Developers D (2023). dsBaseClient: DataSHIELD Client Functions. R package version 6.3.0.

Citation: Jégou R, Bachot C, Monteil C, Boernert E, Chmiel J, Boucher M, et al. (2024) Capability and accuracy of usual statistical analyses in a real-world setting using a federated approach. PLoS ONE 19(11): e0312697. https://doi.org/10.1371/journal.pone.0312697

About the Authors:

Romain Jégou

Roles: Formal analysis, Methodology, Writing – original draft, Writing – review & editing

E-mail: [email protected]

Affiliation: Keyrus Life Science, Nantes, France

ORICD: https://orcid.org/0009-0002-8047-1357

Camille Bachot

Roles: Conceptualization, Methodology, Writing – review & editing

Affiliation: Roche Medical Data Center, Boulogne-Billancourt, France

Charles Monteil

Roles: Conceptualization, Methodology, Writing – review & editing

Affiliation: Roche Informatics, Boulogne-Billancourt, France

Eric Boernert

Roles: Writing – review & editing

Affiliation: Roche Federated Open Science Solution, Basel, Switzerland

Jacek Chmiel

Roles: Conceptualization, Methodology, Writing – review & editing

Affiliation: Avenga, Warsaw, Poland

Mathieu Boucher

Roles: Formal analysis, Writing – original draft, Writing – review & editing

Affiliation: Keyrus Life Science, Nantes, France

David Pau

Roles: Conceptualization, Methodology, Writing – review & editing

Affiliation: Roche Medical Data Center, Boulogne-Billancourt, France

[/RAW_REF_TEXT]

References

1. Rosenbaum L. Bridging the Data-Sharing Divide—Seeing the Devil in the Details, Not the Other Camp. N Engl J Med. 2017 Jun 8;376(23):2201–2203. Epub 2017 Apr 26. pmid:28445080.

2. E. Parliament, C. of European Union, Regulation (eu) 2016/679 of the european parliament and council (2016). https://eur-lex.europa.eu/eli/reg/2016/679/oj.

3. Edemekong, Peter F.; Annamaraju, Pavan; Haydel, Micelle J. (2023). Health Insurance Portability and Accountability Act, StatPearls, Treasure Island (FL): StatPearls Publishing.

4. Templ M., Sariyar M. A systematic overview on methods to protect sensitive data provided for various analyses. Int. J. Inf. Secur. 21, 1233–1246 (2022). Available from:

5. Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int J Epidemiol. 2014 Dec;43(6):1929–44. Epub 2014 Sep 26. pmid:25261970

6. DataSHIELD list of functions. https://data2knowledge.atlassian.net/wiki/spaces/DSDEV/overview.

7. DataSHIELD community packages. https://www.datashield.org/help/community-packages.

8. DataSHIELD disclosure controls. https://data2knowledge.atlassian.net/wiki/spaces/DSDEV/pages/714768398/Disclosure+control.

9. Cox DR (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society. Series B (Methodological) 34 (2), 187–220.

10. Cox DR. The regression analysis of binary sequences. Journal of the Royal Statistical Society, Series B. 1958;20:215–242.

11. Dragan, I., Sparsø, T., Kuznetsov, D., Slieker, R. & Ibberson, M. dsSwissKnife: An R package for federated data analysis. https://doi.org/10.1101/2020.11.17.386813 (2020).

12. ds-Helper package. https://github.com/lifecycle-project/ds-helper.

13. Banerjee S, Sofack GN, Papakonstantinou T, Avraam D, Burton P, Zöller D, et al. dsSurvival: Privacy preserving survival models for federated individual patient meta-analysis in DataSHIELD. BMC Res Notes. 2022 Jun 3;15(1):197. pmid:35659747

14. Banerjee S, Bishop TRP. dsSurvival 2.0: privacy enhancing survival curves for survival models in the federated DataSHIELD analysis system. BMC Res Notes. 2023 Jun 6;16(1):98. pmid:37280717

15. Schwarzer G. R. (2007). An R package for meta‐analysis. R News, 7, 40–45. Available from: http://www.metafor-project.org/doku.php.

16. Dewey M (2023). metap: Meta-Analysis of Significance Values. R package version 1.9. https://CRAN.R-project.org/package=metap.

17. Developers D (2023). dsBaseClient: DataSHIELD Client Functions. R package version 6.3.0.

Word count: 4613

Show less

© 2024 Jégou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Methods

The objective of this project was to determine the capability of a federated analysis approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting. This research was carried out on an anonymous synthetic longitudinal real-world oncology cohort randomly splitted in three local databases, mimicking three healthcare organizations, stored in a federated data platform integrating DataSHIELD. No individual data transfer, statistics were calculated simultaneously but in parallel within each healthcare organization and only summary statistics (aggregates) were provided back to the federated data analyst.

Descriptive statistics, survival analysis, regression models and correlation were first performed on the centralized approach and then reproduced on the federated approach. The results were then compared between the two approaches.

Results

The cohort was splitted in three samples (N1 = 157 patients, N2 = 94 and N3 = 64), 11 derived variables and four types of analyses were generated. All analyses were successfully reproduced using DataSHIELD, except for one descriptive variable due to data disclosure limitation in the federated environment, showing the good capability of DataSHIELD. For descriptive statistics, exactly equivalent results were found for the federated and centralized approaches, except some differences for position measures. Estimates of univariate regression models were similar, with a loss of accuracy observed for multivariate models due to source database variability.

Conclusion

Our project showed a practical implementation and use case of a real-world federated approach using DataSHIELD. The capability and accuracy of common data manipulation and analysis were satisfying, and the flexibility of the tool enabled the production of a variety of analyses while preserving the privacy of individual data. The DataSHIELD forum was also a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, set-up of privacy requirements should be established prior to the start of the analysis, as well as a data quality review of the participating healthcare organization.

Details

Title

Capability and accuracy of usual statistical analyses in a real-world setting using a federated approach

Author

Jégou, Romain

; Bachot, Camille; Monteil, Charles; Boernert, Eric; Chmiel, Jacek; Boucher, Mathieu; Pau, David

First page

e0312697

Section

Research Article

Publication year

2024

Publication date

Nov 2024

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0312697

ProQuest document ID

3128584746

Capability and accuracy of usual statistical analyses in a real-world setting using a federated approach

Jump to:

Full text

Introduction

Material

Project data and ethic statement

Project infrastructure set-up

Statistical methods

Results

Data transformation

Descriptive statistics

Survival analysis

Logistic analysis

Correlation matrix

Limitation

Discussion

Conclusions

Supporting information

References

Abstract

Details

Suggested sources