Full text

Turn on search term navigation

© The Author(s) 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Background

Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance.

Methods

We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data.

Results

The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised.

Conclusions

It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.

Plain language summary

Many artificial intelligence (AI) methods aim to classify samples of data into groups, e.g., patients with disease vs. those without. This often requires datasets to be complete, i.e., that all data has been collected for all samples. However, in clinical practice this is often not the case and some data can be missing. One solution is to ‘complete’ the dataset using a technique called imputation to replace those missing values. However, assessing how well the imputation method performs is challenging. In this work, we demonstrate why people should care about imputation, develop a new method for assessing imputation quality, and demonstrate that if we build AI models on poorly imputed data, the model can give different results to those we would hope for. Our findings may improve the utility and quality of AI models in the clinic.

Details

Title
The impact of imputation quality on machine learning classifiers for datasets with missing values
Author
Shadbahr, Tolou 1   VIAFID ORCID Logo  ; Roberts, Michael 2   VIAFID ORCID Logo  ; Stanczuk, Jan 3 ; Gilbey, Julian 3   VIAFID ORCID Logo  ; Teare, Philip 4 ; Dittmer, Sören 5 ; Thorpe, Matthew 6 ; Torné, Ramon Viñas 7 ; Sala, Evis 8   VIAFID ORCID Logo  ; Lió, Pietro 6   VIAFID ORCID Logo  ; Patel, Mishal 9 ; Preller, Jacobus 10   VIAFID ORCID Logo  ; Selby, Ian 8 ; Breger, Anna 11 ; Weir-McCall, Jonathan R. 12 ; Gkrania-Klotsas, Effrossyni 10 ; Korhonen, Anna 13 ; Jefferson, Emily 14 ; Langs, Georg 15 ; Yang, Guang 16 ; Prosch, Helmut 15 ; Babar, Judith 10 ; Escudero Sánchez, Lorena 8 ; Wassin, Marcel 17 ; Holzer, Markus 17 ; Walton, Nicholas 18 ; Rudd, James H. F. 19   VIAFID ORCID Logo  ; Mirtti, Tuomas 20   VIAFID ORCID Logo  ; Rannikko, Antti Sakari 21 ; Aston, John A. D. 22 ; Tang, Jing 1   VIAFID ORCID Logo  ; Schönlieb, Carola-Bibiane 3 

 University of Helsinki, Research Program in Systems Oncology, Faculty of Medicine, Helsinki, Finland (GRID:grid.7737.4) (ISNI:0000 0004 0410 2071) 
 University of Cambridge, Department of Applied Mathematics and Theoretical Physics, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934); Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK (GRID:grid.417815.e) (ISNI:0000 0004 5929 4381) 
 University of Cambridge, Department of Applied Mathematics and Theoretical Physics, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934) 
 Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK (GRID:grid.417815.e) (ISNI:0000 0004 5929 4381) 
 University of Cambridge, Department of Applied Mathematics and Theoretical Physics, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934); ZeTeM, University of Bremen, Bremen, Germany (GRID:grid.7704.4) (ISNI:0000 0001 2297 4381) 
 University of Manchester, Department of Mathematics, Manchester, UK (GRID:grid.5379.8) (ISNI:0000 0001 2166 2407) 
 University of Cambridge, Department of Computer Science and Technology, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934) 
 University of Cambridge, Department of Radiology, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934) 
 Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK (GRID:grid.417815.e) (ISNI:0000 0004 5929 4381); Clinical Pharmacology & Safety Sciences, AstraZeneca, Cambridge, UK (GRID:grid.417815.e) (ISNI:0000 0004 5929 4381) 
10  Cambridge University Hospitals NHS Trust, Addenbrooke’s Hospital, Cambridge, UK (GRID:grid.24029.3d) (ISNI:0000 0004 0383 8386) 
11  University of Cambridge, Department of Applied Mathematics and Theoretical Physics, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934); University of Vienna, Faculty of Mathematics, Vienna, Austria (GRID:grid.10420.37) (ISNI:0000 0001 2286 1424) 
12  University of Cambridge, Department of Radiology, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934); Royal Papworth Hospital, Cambridge, Royal Papworth Hospital NHS Foundation Trust, Cambridge, UK (GRID:grid.412939.4) (ISNI:0000 0004 0383 5994) 
13  University of Cambridge, Language Technology Laboratory, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934) 
14  University of Dundee, Population Health and Genomics, School of Medicine, Dundee, UK (GRID:grid.8241.f) (ISNI:0000 0004 0397 2876) 
15  Computational Imaging Research Lab Medical University of Vienna, Department of Biomedical Imaging and Image-guided Therapy, Vienna, Austria (GRID:grid.22937.3d) (ISNI:0000 0000 9259 8492) 
16  Imperial College London, National Heart and Lung Institute, London, UK (GRID:grid.7445.2) (ISNI:0000 0001 2113 8111) 
17  contextflow GmbH, Vienna, Austria (GRID:grid.5335.0) 
18  University of Cambridge, Institute of Astronomy, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934) 
19  University of Cambridge, Department of Medicine, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934) 
20  University of Helsinki, Research Program in Systems Oncology, Faculty of Medicine, Helsinki, Finland (GRID:grid.7737.4) (ISNI:0000 0004 0410 2071); University of Helsinki and Helsinki University Hospital, Department of Pathology, Helsinki, Finland (GRID:grid.7737.4) (ISNI:0000 0004 0410 2071); iCAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland (GRID:grid.7737.4) 
21  University of Helsinki, Research Program in Systems Oncology, Faculty of Medicine, Helsinki, Finland (GRID:grid.7737.4) (ISNI:0000 0004 0410 2071); iCAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland (GRID:grid.7737.4); University of Helsinki and Helsinki University Hospital, Department of Urology, Helsinki, Finland (GRID:grid.7737.4) (ISNI:0000 0004 0410 2071) 
22  University of Cambridge, Department of Pure Mathematics and Mathematical Statistics, Cambridge, UK (GRID:grid.5335.0) (ISNI:0000 0001 2188 5934) 
Pages
139
Publication year
2023
Publication date
Dec 2023
Publisher
Springer Nature B.V.
e-ISSN
2730664X
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2873641501
Copyright
© The Author(s) 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.