Full Text

Turn on search term navigation

Data Availability:Restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. The data that support the findings of this study are available from Clinical Practice Research Datalink (CPRD) through a data request application process (https://cprd.com/data-access). Researchers can contact [email protected] for more information.

Funding:RKA was funded by a National Institute for Health Research School for Primary Care Research (NIHR SPCR) PhD Studentship award, supervised by NQ, FWA, and JK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: SFW has received independent research grant funding from AMGEN. NQ and SFW have previously received honorarium from AMGEN. RKA currently holds an NIHR-SPCR funded studentship (2018-2021). SFW is currently an employee of GSK. FWA is supported by UCL Hospitals NIHR Biomedical Research Centre. The remaining authors have no competing interests.

Introduction

Stroke is a leading cause of death and disability globally with a substantial economic cost due to treatment and post-stroke care [1]. Patients at time of incident stroke have varied clinical characteristics, demographics, and biochemical profiles. This heterogeneity in characteristics at time of incident stroke impacts on cardiovascular morbidity and mortality outcomes [2]. Phenotyping (subgrouping) people after incident stroke, in terms of the risk of various cardiovascular outcomes, could provide individuals with the poorest prognosis better care. Intensive secondary prevention strategies including the use of novel medications such as proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors and colchicine in patients at very high risk of adverse cardiovascular morbidity and mortality outcomes.

Cluster analysis, a hypothesis-free unsupervised machine learning data-driven approach, has been widely used to analyse clinical data to identify new phenotypic subgroups of complex and heterogeneous diseases including obstructive sleep apnoea [3], asthma [4,5], chronic obstructive pulmonary disease, chronic heart failure [6], dilated cardiomyopathy [7], sepsis [8], Parkinson’s disease [9], breast cancer [10], and diabetes [11]. This approach does not include outcome data, and may be less biased in its results, especially when using retrospectively collected data. Clustering of clinical data may, therefore, be helpful in identifying subgroups of patients with incident stroke and generating new hypotheses. Efforts to determine such phenotypic groups in patients with incident stroke remain limited.

Using a large population-based cohort of adult patients with incident stroke, the objectives of this study are: (i) to identify patterns in linked primary and secondary clinical data and cluster patients based on phenotypic similarities; (ii) to assess the association between phenotypic clusters and subsequent recurrent stroke or CVD-related mortality, recurrent stroke or all-cause mortality, coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related mortality, and all-cause mortality.

Methods

Study design and data source

This prospective population-based cohort study used the UK Clinical Practice Research Datalink (CPRD) GOLD database of anonymised longitudinal primary care electronic health records [12], linked to secondary care hospitalisation data (Hospital Episode Statistics [HES]) [13], national mortality data (Office for National Statistics [ONS]) [14], and social deprivation data (Index of Multiple Deprivation (IMD) 2015) [15]. Patients included in the CPRD GOLD database, from a network of general practices across the UK, are representative of the UK general population in terms of sex, age, and ethnicity [12].

Study population

We identified a cohort of patients with incident non-fatal stroke in either primary care (CPRD GOLD) or secondary care (HES) between 1 January 1998 and 31 December 2017. Details about this cohort were previously reported [16]. Patients with a prior record of coronary heart disease (CHD), peripheral vascular disease (PVD), or heart failure before incident stroke event were excluded. Patients were followed from the date of incident stroke diagnosis until they developed a major adverse cardiovascular event (MACE), died, ceased contributing data, or last data collection date of the practice. The study flow diagram is shown in Fig 1.

Fig 1. Study flow diagram.

Outcomes

The primary outcome was a composite of recurrent stroke or CVD-related mortality event recorded after incident stroke from across the linked data sources (CPRD, HES or ONS registry). The secondary outcomes included: CHD, recurrent stroke, PVD, heart failure, CVD-related mortality, all-cause mortality, and the composite of recurrent stroke or all-cause mortality.

Subsequent outcomes within 30 days were considered to be representing or relating to the incident stroke event [16]. Analyses were, therefore, restricted to patients with subsequent outcomes occurring after 30 days of incident stroke.

Potential candidate variables for phenotyping

Based on availability in the electronic health records and established association with CVD, 336 candidate variables were selected. These included demographic data, vital signs, biochemical parameters, comorbid conditions, and prescribed medications (S1 Table). For vital signs and biochemical test results, the most recent values/records within 24 months before incident stroke were extracted. A prescription within 12 months before incident stroke was considered as a medication prescribed. All comorbid conditions were defined based on the latest record of a comorbid condition any time before incident stroke. All code lists used have been published and available for download [17,18].

Data processing

The variable distributions and missingness were first assessed. Multiple imputation by chained equations was used to account for missing data (S1 Fig, S2 Table). Ten imputed datasets were generated, using all available covariates and all the outcomes, although outcomes were not imputed [19,20]. The imputed datasets were pooled into a single dataset using Rubin’s rules [21]. A high number of dimensions from a dataset with many variables/features is associated with a loss of meaningful differentiation between similar and dissimilar individuals–the ‘curse of dimensionality’ [22]. To improve the cluster analysis process and performance, feature selection was carried out to reduce collinearity, conditional dependence and noise contributing to increasing the variance. Feature selection was based on two (2) widely used data-driven feature selection methods (Boruta [23] and Least Absolute Shrinkage and Selection Operator (Lasso) regression [24]–S2 Fig) and clinical expert consensus. An expert group of clinicians from both primary (Consultant General Practitioners–NQ, JK) and secondary care (Stroke Medicine Consultant/Specialist–GN, GG) were independently consulted to attain consensus on which variables to select for the cluster analysis. Clinical expert consensus was defined as a 75% (3 out of 4) agreement among the clinical experts on each variable. 49 variables were rated important by the clinical experts and at least 1 of the 2 data-driven methods–S1 Table. After evaluating correlation among the 49 selected variables using mixedCor and Lares functions in R for mixed-type data (S3 Fig & S4 Fig), we excluded 10 highly correlated variables based on clinical judgement/importance. The remaining 39 variables, Box 1, were used for the cluster analysis.

Box 1. Phenotypic domains and phenotypic variables used for cluster analysis

Phenotypic clustering

The prediction strength method by Tibshirani and Walther, 2015 [25] in the kamila function and the Elbow method were used to select the optimal number of clusters–S5 Fig. The kamila algorithm for mixed data clustering (S1 Text) was implemented to identify distinct patient phenotypic clusters. To ensure robustness of the clusters identified, 1,000 initialisations (that is, random starting points) were carried out. Plot of the clusters with the principal component analysis (PCA) dimensions was generated (S6 Fig).

Using the h2o package (http://www.h2o.ai), a gradient boosting model was applied to identify as well as rank the key covariates (candidate variables) that predict each of the identified phenotypic clusters. The respective cluster groupings were coded as 1 –belonging to cluster or 0 –belonging to other clusters. SHAP (SHapley Additive exPlanations) was used to assess the discriminative influence of the variables for each of the identified clusters [26].

Statistical analysis

For each cluster descriptive characteristics were provided, reporting proportion (%) for categorical variables and mean (SD) or median (IQR) for continuous variables. Kruskal-Wallis and chi-squared tests were used to compare across clusters, for continuous and categorical data, respectively.

The association between phenotypic clusters and adverse cardiovascular morbidity and mortality outcomes were assessed using Cox proportional hazards regression model. The hazard ratio (HR) for each phenotypic group is presented with 95% confidence intervals (CI) and corresponding p-values. Cumulative incidence plots were derived and differences between phenotypic groups assessed by the log-rank test. All statistical analyses were performed using Stata SE version 17 (StataCorp LP) and R version 4.1.0. An alpha level of 0.05 was used.

Ethics approval and consent to participate

Ethical approval for this study was obtained from the Independent Scientific Advisory Committee (ISAC)–study protocol number 19_023R. De-identified (anonymised) patient data was obtained from the CPRD hence this study was exempt from obtaining informed consent from patients.

Results

Clinical characteristics among phenotypic clusters

We identified 68,642 patients aged ≥18 years old with any incident non-fatal stroke event between 1998 and 2017. A total of 20,528 (29.9%) patients with subsequent clinical outcomes occurring within 30 days of incident stroke event were excluded, as these outcomes were considered to be related to the incident stroke event [16]. Cluster analysis was performed in the remaining 48,114 patients. Four phenotypic clusters with significant differences in clinical characteristics were identified. The identified clusters were numbered from 1 to 4 according to the ascendent overall incidence of subsequent composite outcome of recurrent stroke or CVD-related mortality, the primary outcome. Table 1 describes and compares the clinical characteristics among the phenotypic clusters.

Table 1. Characteristics of study population at time of incident stroke according to cluster membership (n = 48,114).

The plots of the clusters are shown with the principal component analysis (PCA) dimensions in S6 Fig. The cluster profiles are summarised in Box 2.

Box 2. Summary of cluster profiles

Variable importance for clusters

The supervised gradient boosting model to identify key covariates (candidate variables) that predict the respective phenotypic cluster had excellent prediction accuracy–area under the receiver operative curve (AUC) of 0.985, 0.982, 0.974, and 0.970 for clusters 1, 2, 3 and 4, respectively. The most common variables for predicting the respective phenotypic clusters were age at incident stroke, blood pressure, hypertension, LDL cholesterol, and potency of prescribed statin—Fig 2.

View Image - Fig 2. Plot showing the clinical parameters which are the core of each phenotypic cluster. aki: acute kidney injury; dbp: diastolic blood pressure; dm_eye_comp: diabetic ophthalmic complications; sbp: systolic blood pressure; gfr: glomerular filtration rate; hb: haemoglobin; hdl: high-density lipoprotein cholesterol; ldl: low-density lipoprotein cholesterol; hba1c: glycated haemoglobin; nonRH_aortic: non-rheumatic aortic valve disorder; smi: severe mental illness; tg: triglyceride; tia: transient ischaemic attack. SHAP summary plot combines feature/variable importance with feature effects. Each point on the summary plot is a Shapley value for an individual. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. The colour represents the value from low to high. The features are ordered according to importance.

Fig 2. Plot showing the clinical parameters which are the core of each phenotypic cluster. aki: acute kidney injury; dbp: diastolic blood pressure; dm_eye_comp: diabetic ophthalmic complications; sbp: systolic blood pressure; gfr: glomerular filtration rate; hb: haemoglobin; hdl: high-density lipoprotein cholesterol; ldl: low-density lipoprotein cholesterol; hba1c: glycated haemoglobin; nonRH_aortic: non-rheumatic aortic valve disorder; smi: severe mental illness; tg: triglyceride; tia: transient ischaemic attack. SHAP summary plot combines feature/variable importance with feature effects. Each point on the summary plot is a Shapley value for an individual. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. The colour represents the value from low to high. The features are ordered according to importance.

Association with subsequent clinical outcomes

During the median follow-up time of 12.60 years (IQR, 7.60–16.97 years), there was a total of 24,588 (51.1%) composite recurrent stroke or CVD-related mortality outcome events. The occurrence of recurrent stroke + CVD-related mortality was different across the 4 phenotypic clusters–cluster 1 had the lowest incidence rate (15.13 per 100 person-years; 95% CI, 14.54–15.74), while cluster 4 had the highest incidence rate (23.17 per 100 person-years, 95% CI: 22.67–23.69). The risk of subsequent recurrent stroke + CVD-related mortality was significantly increased in cluster 2 (hazard ratio (HR), 1.07; 95% CI: 1.02–1.12); cluster 3 (HR, 1.20; 95% CI: 1.14–1.26), and cluster 4 (HR, 1.29; 95% CI: 1.26–1.33), when compared with cluster 1. Similar incidence rate and hazard ratio trends were observed for subsequent recurrent stroke + all-cause mortality outcome (cluster 2: HR, 1.07; 95% CI, 1.03–1.12; cluster 3: HR, 1.32, 95% CI, 1.26–1.37; cluster 4: HR, 1.54; 95% CI: 1.48–1.60) and recurrent stroke outcome (cluster 2: HR, 1.10; 95% CI, 1.05–1.16; cluster 3: HR, 1.12, 95% CI, 1.06–1.18; cluster 4: HR, 1.25; 95% CI: 1.19–1.32).

Different trends in incidence rate and hazard ratios were observed, however, for subsequent CHD, PVD, heart failure, CVD-related and all-cause mortality outcomes–Fig 3 and Table 2. When compared with cluster 1, the risk of subsequent CHD events was significantly decreased in the other 3 clusters (cluster 2: HR, 0.49; 95% CI: 0.44–0.55; cluster 3: HR, 0.64; 95% CI, 0.56–0.73; cluster 4: HR, 0.55; 95% CI, 0.49–0.63). A similar decreased risk in the other 3 clusters when compared to cluster 1 was observed for risk of subsequent PVD.

Fig 3. Incidence rate for the subsequent adverse outcomes by the identified phenotypic clusters.

Table 2. Subsequent major adverse outcomes after incident stroke by phenotypic clusters.

For risk of subsequent heart failure, CVD-related mortality and all-cause mortality, cluster 2 had a significantly decreased risk when compared to cluster 1 while clusters 3 and 4 had a significantly increased risk–Table 2. The occurrence of subsequent cardiovascular morbidity and mortality outcomes across the different phenotypic clusters is presented as Kaplan Meier plots in Fig 4.

View Image - Fig 4. Kaplan-Meier plots for subsequent clinical outcomes stratified by phenotypic clusters. A: Recurrent stroke and CVD-related mortality (log-rank p[less than]0.0001); B: Recurrent stroke and all-cause mortality (log-rank p[less than]0.0001); C:Recurrent stroke (log-rank p[less than]0.0001); D: Coronary heart disease (log-rank p[less than]0.0001); E: Peripheral vascular disease (log-rank p[less than]0.0001); F: Heart failure (log-rank p[less than]0.0001); G: Cardiovascular-related mortality (log-rank p[less than]0.0001); H: All-cause mortality (log-rank p[less than]0.0001).

Fig 4. Kaplan-Meier plots for subsequent clinical outcomes stratified by phenotypic clusters. A: Recurrent stroke and CVD-related mortality (log-rank p[less than]0.0001); B: Recurrent stroke and all-cause mortality (log-rank p[less than]0.0001); C:Recurrent stroke (log-rank p[less than]0.0001); D: Coronary heart disease (log-rank p[less than]0.0001); E: Peripheral vascular disease (log-rank p[less than]0.0001); F: Heart failure (log-rank p[less than]0.0001); G: Cardiovascular-related mortality (log-rank p[less than]0.0001); H: All-cause mortality (log-rank p[less than]0.0001).

Discussion

This population-based study exploring phenotypic characteristics of patients with incident stroke using a data-driven-cluster analysis approach identified four clinically meaningful patient clusters based on the phenotypic characteristics at time of incident stroke. There was a varied relationship between the identified phenotypic clusters and subsequent risk of adverse cardiovascular morbidity and mortality outcomes.

In our study, four distinct and clinically meaningful phenotypic clusters were identified. Smoking, a strong independent modifiable risk factor for cardiovascular morbidity and mortality outcomes [27], was most highly prevalent in clusters 1 and 2. Preventative strategy to communicate the risks of smoking and the benefits of quitting to this cluster of patients could be an effective means to promote smoking cessation and reduce risk for subsequent adverse events [28]. With the exception of clusters 2, the 3 other clusters included had high prevalence of multiple long-term conditions as well as CVD risk factors at time of incident stroke. Patients with incident stroke have been shown to commonly have pre-existing long-term conditions [29]. To optimally manage the possible atherogenic effect of these comorbid condition to reduce risk of subsequent cardiovascular morbidity and mortality outcomes, both non-pharmacological (that is, lifestyle modification [30,31]) and pharmacological (antihypertensives for blood pressure management [32]; lipid-lowering medications such as statins for cholesterol management [33]; antidiabetics for blood sugar control [30]; and antiplatelets/anticoagulants to manage arrhythmia [34]) strategies need to be prioritised in line with clinical guidelines [35]. Frequent monitoring/reviews to ensure treatment targets are being met is important [36]. Age, a non-modifiable risk factor, was a key factor for the patient cluster membership. Among older adults (typical of cluster 4), incidence of aortic disease, PVD and venous thromboembolism increase as age-related alterations in vascular structure and function are compounded by the longer exposure to CVD risk factors [37].

Clustering is a common approach used to analyse large datasets, to identify both the number of subgroups in the data and the attributes of each subgroup, as has been done in this study. Data analysed in real applications including healthcare (from electronic health records) are mostly characterised by a mix of continuous and categorial variables. More common approaches that have been applied to mixed data include converting the variables to a single data type by either coding the categorical variables as numbers or dummy coding the variables and then applying standard distance methods such as k-means designed for continuous variables to the transformed data to achieve the clustering objective(s) [38,39]. Continuous variables have also been converted to categorical variables using interval-based bucketing [40,41]. Similarities that may have been observed in the original data may be lost when the data is transformed in such ways [40]. Kamila clustering algorithm has, however, been shown to better handle high imbalance between continuous and categorical data than any other method [40,42]. From a computational perspective, when compared with other algorithms, the Kamila algorithm offers the best performance and most time-efficient when dealing with large datasets (in relation to both observations and variables) in the setting of heterogeneous data, as was the situation in our study [40,42].

Strengths and limitations

To our knowledge, this is the first time that a data-driven cluster analysis aimed at identifying stroke phenotypes in a well characterised large population-based cohort of adults with any incident stroke. This allows us to cover a large range of stroke phenotypes. Most importantly, we had a comprehensive linked database with a broad spectrum of clinical data with many of these variables being explored in cluster analysis for the first time.

There are, however, limitations of this study worth considering. First and foremost, the study was not meant to propose a new classification for stroke, because the clusters are likely to vary according to patient characteristics and available data. These results serve to underscore the need for novel multidimensional stroke classification approaches for improving patient care. Furthermore, they are aimed to generate hypotheses for future studies that will integrate clinical and biological data in patients, with the goal of improving the care of patients with stroke. With immense advancement in machine learning, cluster analysis can be performed in a large number of ways [42,43]. However, the knowledge and experience of the relevant experts remain the best judge in the interpretation of findings from cluster analysis, hence the involvement of a diverse group of clinical specialists, clinical researchers, and data experts in our study. The presence of missing data is a common occurrence in clinical research using electronic health records collected as part of routine care. For example, laboratory tests are typically requested only when considered necessary for a patient’s health condition. Similarly, information on BMI or smoking status may not be consistently recorded, leading to potential bias in patterns of data completeness. To address this issue, multiple imputation by chained equations, as outlined in the methods section, was used to handle missing data in our study, which is the preferred option under any missingness mechanism [19,20].

Implications

Cluster analysis is most suited to address the multidimensional complexity of disease conditions with considerable heterogeneity such as stroke. Population-based cluster analysis could provide further understanding of disease patterns. Additionally, patients could be phenotyped and allocated to specific clusters that could be associated with different risks for various outcomes. Different treatment strategies or interventions could be targeted at specific phenotypic clusters, based available evidence on risk and possible response. Future clinic trial design could also focus on high-risk clusters or focus on specific aspects within a cluster.

Conclusions

Using an unsupervised learning data-driven cluster analysis on a broad spectrum of baseline clinical data of patients with incident stroke, we identified four phenotypic and clinically meaningful clusters with respect to risk of subsequent major adverse outcomes. These findings highlight the significant heterogeneity that exists within patients with incident stroke with respect to subsequent adverse outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes. Further exploration in different patient cohorts and populations is needed.

Supporting information

S1 Text. Additional Methods.

https://doi.org/10.1371/journal.pdig.0000334.s001

(DOCX)

S1 Fig. All clinical variables with missing values.

https://doi.org/10.1371/journal.pdig.0000334.s002

(DOCX)

S2 Fig. Feature selection.

https://doi.org/10.1371/journal.pdig.0000334.s003

(DOCX)

S3 Fig. Plot of correlation matrix of 49 selected variables.

https://doi.org/10.1371/journal.pdig.0000334.s004

(DOCX)

S4 Fig. Ranked cross-correlation plot of 49 selected variables.

https://doi.org/10.1371/journal.pdig.0000334.s005

(DOCX)

S5 Fig. Optimal number of clusters.

https://doi.org/10.1371/journal.pdig.0000334.s006

(DOCX)

S6 Fig. Principal component analysis (PCA) plots.

https://doi.org/10.1371/journal.pdig.0000334.s007

(DOCX)

S1 Table. Overview of all variables and the in- or exclusion at the various data processing steps.

https://doi.org/10.1371/journal.pdig.0000334.s008

(DOCX)

S2 Table. Observed versus imputed values after multiple imputation for all clinical variables with missing data.

https://doi.org/10.1371/journal.pdig.0000334.s009

(DOCX)

Acknowledgments

We thank the practices that contributed to the CPRD GOLD.

Citation: Akyea RK, Ntaios G, Kontopantelis E, Georgiopoulos G, Soria D, Asselbergs FW, et al. (2023) A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach. PLOS Digit Health 2(9): e0000334. https://doi.org/10.1371/journal.pdig.0000334

References

1. Rajsic S, Gothe H, Borba HH, Sroczynski G, Vujicic J, Toell T, et al. Economic burden of stroke: a systematic review on post-stroke care. Eur J Heal Econ. 2019;20: 107–134. pmid:29909569

2. Prosser J, MacGregor L, Lees KR, Diener HC, Hacke W, Davis S. Predictors of early cardiac morbidity and mortality after ischemic stroke. Stroke. 2007;38: 2295–2302. pmid:17569877

3. Joosten SA, Hamza K, Sands S, Turton A, Berger P, Hamilton G. Phenotypes of patients with mild to moderate obstructive sleep apnoea as confirmed by cluster analysis. Respirology. 2012;17: 99–107. pmid:21848707

4. Haldar P, Pavord ID, Shaw DE, Berry MA, Thomas M, Brightling CE, et al. Cluster analysis and clinical asthma phenotypes. Am J Respir Crit Care Med. 2008;178: 218–224. pmid:18480428

5. Siroux V, Basagan X, Boudier A, Pin I, Garcia-Aymerich J, Vesin A, et al. Identifying adult asthma phenotypes using a clustering approach. Eur Respir J. 2011;38: 310–317. pmid:21233270

6. Ahmad T, Pencina MJ, Schulte PJ, O’Brien E, Whellan DJ, Piña IL, et al. Clinical implications of chronic heart failure phenotypes defined by cluster analysis. J Am Coll Cardiol. 2014;64: 1765–1774. pmid:25443696

7. Verdonschot JAJ, Merlo M, Dominguez F, Wang P, Henkens MTHM, Adriaens ME, et al. Phenotypic clustering of dilated cardiomyopathy patients highlights important pathophysiological differences. Eur Heart J. 2021;42: 162–174. pmid:33156912

8. Seymour CW, Kennedy JN, Wang S, Chang CCH, Elliott CF, Xu Z, et al. Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis. J Am Med Assoc. 2019;321: 2003–2017. pmid:31104070

9. Fereshtehnejad SM, Romenets SR, Anang JBM, Latreille V, Gagnon JF, Postuma RB. New clinical subtypes of Parkinson disease and their longitudinal progression a prospective cohort comparison with other phenotypes. JAMA Neurol. 2015;72: 863–873. pmid:26076039

10. Soria D, Garibaldi JM, Ambrogi F, Green AR, Powe D, Rakha E, et al. A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients. Comput Biol Med. 2010;40: 318–330. pmid:20106472

11. Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018;6: 361–369. pmid:29503172

12. Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data Resource Profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol. 2015;44: 827–836. pmid:26050254

13. NHS Digital. Hospital Episode Statistics (HES). In: NHS Digital [Internet]. 2019 [cited 21 Jun 2019]. Available: https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics

14. Office for National Statistics. Deaths Registration Data. In: ONS [Internet]. 2018 [cited 21 Jun 2019]. Available: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths

15. Department of Communities and Local Government. English Indices of Deprivation 2015. 2015 [cited 10 Jul 2016] pp. 1–11. Available: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015

16. Akyea RK, Vinogradova Y, Qureshi N, Patel RS, Kontopantelis E, Ntaios G, et al. Sex, Age, and Socioeconomic Differences in Nonfatal Stroke Incidence and Subsequent Major Adverse Outcomes. Stroke. 2021;52: 396–405. pmid:33493066

17. Kuan V, Denaxas S, Gonzalez-Izquierdo A, Direk K, Bhatti O, Husain S, et al. A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service. Lancet Digit Heal. 2019;1: e63–e77. pmid:31650125

18. CPRD @ Cambridge. Codes Lists (GOLD). [cited 6 Mar 2021]. Available: https://www.phpc.cam.ac.uk/pcu/research/research-groups/crmh/cprd_cam/codelists/v11/

19. Royston P. Multiple imputation of missing values: Update of ice. Stata J. 2005;5: 527–536.

20. Kontopantelis E, White IR, Sperrin M, Buchan I. Outcome-sensitive multiple imputation: A simulation study. BMC Med Res Methodol. 2017;17: 1–13. pmid:28068910

21. Rubin DB. Multiple imputation for nonresponse in surveys. Wiley; 1987. https://doi.org/10.1002/9780470316696

22. Altman N, Krzywinski M. The curse(s) of dimensionality this-month. Nat Methods. 2018;15: 399–400. pmid:29855577

23. Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J Stat Softw. 2010;36: 1–13.

24. Tishbirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996. pp. 267–88.

25. Foss AH, Markatou M. kamila: Clustering mixed-type data in R and hadoop. J Stat Softw. 2018;83: 1–44.

26. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2: 56–67. pmid:32607472

27. Mons U, Müezzinler A, Gellert C, Schöttker B, Abnet CC, Bobak M, et al. Impact of smoking and smoking cessation on cardiovascular events and mortality among older adults: Meta-analysis of Individual participant data from prospective cohort studies of the CHANCES consortium. BMJ. 2015;350: 18. pmid:25896935

28. Duncan MS, Freiberg MS, Greevy RA, Kundu S, Vasan RS, Tindle HA. Association of Smoking Cessation with Subsequent Risk of Cardiovascular Disease. JAMA—J Am Med Assoc. 2019;322: 642–650. pmid:31429895

29. Gallacher KI, Batty GD, McLean G, Mercer SW, Guthrie B, May CR, et al. Stroke, multimorbidity and polypharmacy in a nationally representative sample of 1,424,378 patients in Scotland: Implications for treatment burden. BMC Med. 2014;12: 1–9. pmid:25280748

30. Kernan WN, Ovbiagele B, Black HR, Bravata DM, Chimowitz MI, Ezekowitz MD, et al. Guidelines for the prevention of stroke in patients with stroke and transient ischemic attack: A guideline for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45: 2160–2236. pmid:24788967

31. Billinger SA, Arena R, Bernhardt J, Eng JJ, Franklin BA, Johnson CM, et al. Physical activity and exercise recommendations for stroke survivors: A statement for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2014;45: 2532–2553. pmid:24846875

32. Arima H, Chalmers J, Woodward M, Anderson C, Rodgers A, Davis S, et al. Lower target blood pressures are safe and effective for the prevention of recurrent stroke: The PROGRESS trial. J Hypertens. 2006;24: 1201–1208. pmid:16685221

33. Fulcher J, O’Connell R, Voysey M, Emberson J, Blackwell L, Mihaylova B, et al. Efficacy and safety of LDL-lowering therapy among men and women: Meta-analysis of individual data from 174 000 participants in 27 randomised trials. Lancet. 2015;385: 1397–1405. pmid:25579834

34. Gent M. A randomised, blinded, trial of clopidogrel versus aspirin in patients at risk of ischaemic events (CAPRIE). Lancet. 1996;348: 1329–1339. pmid:8918275

35. Kleindorfer DO, Towfighi A, Chaturvedi S, Cockroft KM, Gutierrez J, Lombardi-Hill D, et al. 2021 Guideline for the prevention of stroke in patients with stroke and transient ischemic attack; A guideline from the American Heart Association/American Stroke Association. Stroke. 2021;52: E364–E467. pmid:34024117

36. National Institute for Health and Care Excellence. Multimorbidity: clinical assessment and management. NICE; 2016 [cited 1 Oct 2021]. Available: https://www.nice.org.uk/guidance/ng56

37. Miller AP, Huff CM, Roubin GS. Vascular disease in the older adult. J Geriatr Cardiol. 2016;13: 727–732. pmid:27899936

38. Dougherty J, Kohavi R, Sahami M. Supervised and Unsupervised Discretization of Continuous Features. Mach Learn Proc 1995. 1995; 194–202.

39. Hennig C, Liao TF. How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. J R Stat Soc Ser C Appl Stat. 2013;62: 309–369.

40. Foss A, Markatou M, Ray B, Heching A. A semiparametric method for clustering mixed data. Mach Learn. 2016;105: 419–458.

41. Ichino M, Yaguchi H. Generalized Minkowski Metrics for Mixed Feature-Type Data Analysis. IEEE Trans Syst Man Cybern. 1994;24: 698–708.

42. Preud’homme G, Duarte K, Dalleau K, Lacomblez C, Bresso E, Smaïl-Tabbone M, et al. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark. Sci Rep. 2021;11: 1–14. pmid:33603019

43. Mclachlan GJ. Cluster analysis and related techniques in medical research. Stat Methods Med Res. 1992;1: 27–48. pmid:1341650

AuthorAffiliation

About the Authors:
Ralph K. Akyea

Contributed equally to this work with: Ralph K. Akyea, Stephen F. Weng, Nadeem Qureshi
Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Visualization, Writing – original draft, Writing – review & editing
* E-mail: [email protected]
¶‡ These authors are joint senior authors on this work.
Affiliation: PRISM Research Group, Centre for Academic Primary Care, School of Medicine, University of Nottingham, Nottingham, United Kingdom
https://orcid.org/0000-0003-4529-8237
George Ntaios

Roles: Writing – review & editing
Affiliation: Department of Internal Medicine, Faculty of Medicine, School of Health Sciences, University of Thessaly, Larissa, Greece
Evangelos Kontopantelis

Roles: Writing – review & editing
Affiliations: Division of Population Health, Health Services Research and Primary Care, School of Health Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre (MAHSC), The University of Manchester, Manchester, United Kingdom, Division of Informatics, Imaging and Data Sciences, School of Health Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre (MAHSC), The University of Manchester, Manchester, United Kingdom
Georgios Georgiopoulos

Roles: Writing – review & editing
Affiliation: School of Biomedical Engineering and Imaging Sciences, St Thomas Hospital, King’s College London, London, United Kingdom
Daniele Soria

Roles: Writing – review & editing
Affiliation: School of Computing, University of Kent, Canterbury, United Kingdom
Folkert W. Asselbergs

Roles: Writing – review & editing
Affiliations: Amsterdam University Medical Centers, Department of Cardiology, University of Amsterdam, Amsterdam, The Netherlands, Health Data Research UK and Institute of Health Informatics, University College London, London, United Kingdom
Joe Kai

Roles: Funding acquisition, Writing – review & editing
Affiliation: PRISM Research Group, Centre for Academic Primary Care, School of Medicine, University of Nottingham, Nottingham, United Kingdom
Stephen F. Weng

Contributed equally to this work with: Ralph K. Akyea, Stephen F. Weng, Nadeem Qureshi
Roles: Conceptualization, Funding acquisition, Methodology, Supervision, Writing – review & editing
¶‡ These authors are joint senior authors on this work.
Affiliation: PRISM Research Group, Centre for Academic Primary Care, School of Medicine, University of Nottingham, Nottingham, United Kingdom
Nadeem Qureshi

Contributed equally to this work with: Ralph K. Akyea, Stephen F. Weng, Nadeem Qureshi
Roles: Conceptualization, Funding acquisition, Supervision, Writing – review & editing
¶‡ These authors are joint senior authors on this work.
Affiliation: PRISM Research Group, Centre for Academic Primary Care, School of Medicine, University of Nottingham, Nottingham, United Kingdom

Word count: 4870

Show less

© 2023 Akyea et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Individuals developing stroke have varying clinical characteristics, demographic, and biochemical profiles. This heterogeneity in phenotypic characteristics can impact on cardiovascular disease (CVD) morbidity and mortality outcomes. This study uses a novel clustering approach to stratify individuals with incident stroke into phenotypic clusters and evaluates the differential burden of recurrent stroke and other cardiovascular outcomes. We used linked clinical data from primary care, hospitalisations, and death records in the UK. A data-driven clustering analysis (kamila algorithm) was used in 48,114 patients aged ≥ 18 years with incident stroke, from 1-Jan-1998 to 31-Dec-2017 and no prior history of serious vascular events. Cox proportional hazards regression was used to estimate hazard ratios (HRs) for subsequent adverse outcomes, for each of the generated clusters. Adverse outcomes included coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related and all-cause mortality. Four distinct phenotypes with varying underlying clinical characteristics were identified in patients with incident stroke. Compared with cluster 1 (n = 5,201, 10.8%), the risk of composite recurrent stroke and CVD-related mortality was higher in the other 3 clusters (cluster 2 [n = 18,655, 38.8%]: hazard ratio [HR], 1.07; 95% CI, 1.02–1.12; cluster 3 [n = 10,244, 21.3%]: HR, 1.20; 95% CI, 1.14–1.26; and cluster 4 [n = 14,014, 29.1%]: HR, 1.44; 95% CI: 1.37–1.50). Similar trends in risk were observed for composite recurrent stroke and all-cause mortality outcome, and subsequent recurrent stroke outcome. However, results were not consistent for subsequent risk in CHD, PVD, heart failure, CVD-related mortality, and all-cause mortality. In this proof of principle study, we demonstrated how a heterogenous population of patients with incident stroke can be stratified into four relatively homogenous phenotypes with differential risk of recurrent and major cardiovascular outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes.

Details

Title

A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach

Author

Akyea, Ralph K

; Ntaios, George; Kontopantelis, Evangelos; Georgiopoulos, Georgios; Soria, Daniele; Asselbergs, Folkert W; Kai, Joe; Weng, Stephen F; Qureshi, Nadeem

First page

e0000334

Section

Research Article

Publication year

2023

Publication date

Sep 2023

Publisher

Public Library of Science

e-ISSN

27673170

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pdig.0000334

ProQuest document ID

3085662650

A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach

Jump to:

Full Text

Introduction

Methods

Study design and data source

Study population

Outcomes

Potential candidate variables for phenotyping

Data processing

Box 1. Phenotypic domains and phenotypic variables used for cluster analysis

Phenotypic clustering

Statistical analysis

Ethics approval and consent to participate

Results

Clinical characteristics among phenotypic clusters

Box 2. Summary of cluster profiles

Variable importance for clusters

Association with subsequent clinical outcomes

Discussion

Strengths and limitations

Implications

Conclusions

Supporting information

Acknowledgments

Abstract

Details

Suggested sources