Content area

Abstract

DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth’s penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.

This work compares several ML and calibration algorithms for classifying tumor DNA methylation profiles. The resulting protocol provides workflows for selecting, training and calibrating ML algorithms to generate well-calibrated multiclass probability estimates.

Details

Title
Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data
Author
Maros, Máté E 1 ; Capper, David 2 ; Jones David T W 3 ; Hovestadt Volker 4 ; von Deimling Andreas 5 ; Pfister, Stefan M 6 ; Benner, Axel 7   VIAFID ORCID Logo  ; Zucknick Manuela 8   VIAFID ORCID Logo  ; Sill, Martin 9 

 Institute of Medical Biometry and Informatics (IMBI), University of Heidelberg, Heidelberg, Germany (GRID:grid.7700.0) (ISNI:0000 0001 2190 4373); University Medical Center, Medical Faculty Mannheim of Heidelberg University, Department of Neuroradiology, Mannheim, Germany (GRID:grid.7700.0) 
 German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), partner site Berlin, Berlin, Germany (GRID:grid.7497.d) (ISNI:0000 0004 0492 0584); Charité–Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Department of Neuropathology, Berlin, Germany (GRID:grid.7497.d) 
 Hopp Children’s Cancer Center Heidelberg (KiTZ), Heidelberg, Germany (GRID:grid.7497.d); Pediatric Glioma Research Group, German Cancer Consortium (DKTK) and German Cancer Research Center (DKFZ), Heidelberg, Germany (GRID:grid.7497.d) (ISNI:0000 0004 0492 0584) 
 Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany (GRID:grid.7497.d) (ISNI:0000 0004 0492 0584); Massachusetts General Hospital and Harvard Medical School, Boston, Department of Pathology and Center for Cancer Research, MA, USA (GRID:grid.7497.d); Broad Institute of MIT and Harvard, Cambridge, MA, USA (GRID:grid.66859.34) 
 German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), partner site Berlin, Berlin, Germany (GRID:grid.7497.d) (ISNI:0000 0004 0492 0584); University Hospital Heidelberg, Department of Neuropathology, Heidelberg, Germany (GRID:grid.5253.1) (ISNI:0000 0001 0328 4908) 
 Hopp Children’s Cancer Center Heidelberg (KiTZ), Heidelberg, Germany (GRID:grid.5253.1); Division of Pediatric Neurooncology, German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany (GRID:grid.7497.d) (ISNI:0000 0004 0492 0584); Hematology and Immunology, University Hospital Heidelberg, Department of Pediatric Oncology, Heidelberg, Germany (GRID:grid.5253.1) (ISNI:0000 0001 0328 4908) 
 Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany (GRID:grid.7497.d) (ISNI:0000 0004 0492 0584) 
 Institute of Basic Medical Sciences, University of Oslo, Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, Oslo, Norway (GRID:grid.5510.1) (ISNI:0000 0004 1936 8921) 
 Hopp Children’s Cancer Center Heidelberg (KiTZ), Heidelberg, Germany (GRID:grid.5510.1); Division of Pediatric Neurooncology, German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany (GRID:grid.7497.d) (ISNI:0000 0004 0492 0584); Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany (GRID:grid.7497.d) (ISNI:0000 0004 0492 0584) 
Pages
479-512
Publication year
2020
Publication date
Feb 2020
Publisher
Nature Publishing Group
ISSN
17542189
e-ISSN
17502799
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2350321634
Copyright
2020© The Author(s), under exclusive licence to Springer Nature Limited 2020