It appears you don't have support to open PDFs in this web browser. To view this file, Open with your PDF reader
Abstract
We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 UniSA Clinical and Health Sciences, University of South Australia, Australian Centre for Precision Health, Adelaide, Australia (GRID:grid.1026.5) (ISNI:0000 0000 8994 5086); UniSA STEM, University of South Australia, Computational Learning Systems Laboratory, Mawson Lakes, Australia (GRID:grid.1026.5) (ISNI:0000 0000 8994 5086)
2 UniSA Clinical and Health Sciences, University of South Australia, Australian Centre for Precision Health, Adelaide, Australia (GRID:grid.1026.5) (ISNI:0000 0000 8994 5086); South Australian Health and Medical Research Institute (SAHMRI) Level 8, Adelaide, Australia (GRID:grid.430453.5) (ISNI:0000 0004 0565 2606)
3 UniSA STEM, University of South Australia, Computational Learning Systems Laboratory, Mawson Lakes, Australia (GRID:grid.1026.5) (ISNI:0000 0000 8994 5086)