Content area
Cardiovascular disease (CVD) is the leading global cause of death, highlighting the urgent need for early, accurate, and interpretable diagnostic tools. However, many AI-based heart disease prediction models lack transparency, hindering their acceptance in clinical settings. This study proposes XAI-HD, a hybrid framework integrating machine learning (ML), deep learning (DL), and explainable AI (XAI) techniques for heart disease detection. The framework systematically addresses key challenges, including class imbalance, missing data, and feature inconsistency, through advanced preprocessing and class-balancing methods such as OSS, NCR, SMOTEN, ADASYN, SMOTETomek, and SMOTEENN. Comparative performance evaluations across multiple datasets (CHD, FHD, SHD) demonstrate that XAI-HD reduces classification error rates by 20–25% compared to traditional ML-based models, achieving superior accuracy, precision, recall, and F1-score. Additionally, SHAP and LIME-based feature importance analysis enhances model interpretability, fostering trust among medical professionals. The proposed framework holds significant real-world applicability, including seamless integration into hospital decision support systems, electronic health records (EHR), and real-time cardiac risk assessment platforms. Unlike conventional AI-driven cardiovascular risk prediction models, XAI-HD offers a more balanced, interpretable, and computationally efficient solution, ensuring both predictive accuracy and practical feasibility in clinical environments. Statistical validation using Wilcoxon signed-rank tests confirms the performance gains, and complexity analysis shows the framework is scalable for large-scale deployment.
Introduction
Cardiovascular diseases (CVDs), including coronary heart disease (CHD), heart failure (HF), and stroke, remain the leading cause of mortality worldwide (Zhu et al. 2024). The World Health Organization (WHO) reports that CVDs result in about 17.9 million deaths each year, constituting nearly 32% of total global mortality (Naeem et al. 2024). The significant prevalence of cardiovascular diseases is influenced by modifiable risk factors like hypertension, obesity, smoking, diabetes, sedentary behavior, and inadequate dietary practices, impacting both individual health and public healthcare systems. The economic impact of CVDs is most pronounced in low- and middle-income countries, where diagnostic resources are scarce and timely clinical interventions are frequently postponed (Mbanze et al. 2025). In this context, early diagnosis and effective prediction tools are vital not only for improving patient survival rates but also for reducing the economic burden on healthcare systems.
Challenges in traditional CVD risk assessment models
Timely detection and accurate risk assessment are essential to reduce mortality and improve treatment outcomes (Meera and Devi 2025). Conventional scoring models, such as the Framingham Risk Score (FRS), are extensively utilized for CVD prediction by evaluating established parameters such as age, cholesterol levels, and blood pressure (Abdullahi et al. 2024). Nonetheless, these models depend on overly simplistic linear assumptions that do not adequately represent the intricate, nonlinear interactions among risk factors in diverse patient populations. Moreover, they contend with absent data, class imbalance, and changing risk indicators, which may result in biased and incorrect forecasts (Tsoumplekas et al. 2024).
AI-driven CVD prediction and the need for explainability
To address these constraints, artificial intelligence (AI), especially machine learning (ML) and deep learning (DL), has gained prominence in cardiovascular disease (CVD) diagnosis due to its capacity to evaluate high-dimensional data and reveal complex patterns that surpass traditional statistical methods (Talukder 2025). Although AI-driven models often achieve high predictive accuracy, their clinical use is hindered by limited interpretability. Numerous high-performing ML and DL models function as “black boxes,” constraining medical personnel’s comprehension of model decisions, hence impacting confidence and reliability in healthcare applications (Vinora et al. 2025). Moreover, real-world medical data poses further obstacles, such as diverse feature distributions, class imbalance, and inconsistent data quality (Carvalho et al. 2025). Inadequate data preprocessing, class balance, and explainability procedures may lead AI-driven models to provide biased and non-generalizable results, hence diminishing their efficacy in clinical environments.
Role of explainable AI (XAI) in HD prediction
Explainable Artificial Intelligence (XAI) has emerged as a disruptive innovation in the medical AI sector, particularly regarding heart disease prognosis (Talukder et al. 2025). Although traditional AI models exhibit superior prediction accuracy, they frequently lack transparency, complicating doctors’ ability to discern the impact of input variables on outcomes.
This lack of transparency is a major barrier to clinical adoption, where decisions must be justifiable and trustworthy. XAI addresses this issue by providing interpretable insights into model behavior through techniques such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) (Ejiyi et al. 2023; Ejiyi et al. 2024a; Ejiyi et al. 2024b). In HD prediction, XAI enables practitioners to understand the contribution of risk factors-such as blood pressure, cholesterol levels, and lifestyle habits-toward a patient’s predicted outcome. This fosters greater confidence in AI-driven decisions, facilitates early intervention strategies, and promotes personalized medicine, making XAI an essential component in modern cardiovascular healthcare solutions.
Existing background study
Sharma et al. (2017) and Abushariah et al. (2014) utilized classical ML algorithms such as Decision Tree (DT) and Artificial Neural Networks (ANN), demonstrating moderate success but limited generalizability and interpretability. Ensemble learning techniques, including bagging, boosting, and stacking, as proposed by Paul (2024a), significantly improved performance metrics, yet lacked transparency and explainability crucial for clinical adoption. Other studies such as Musa and Muhammad (2022) emphasized the importance of feature selection, while Nursyahrina et al. (2024) and Nahar et al. (2013) integrated domain knowledge using Rough Set Theory and medical feature selection (MFS) respectively, to enhance interpretability. However, these methods often suffer from limited scalability and are typically dependent on handcrafted features or shallow architectures. Recently, hybrid models combining DL with traditional ML, such as CNN-XGBoost (Sharma et al. 2017), and optimization techniques like OPTUNA (Srinivas and Katarya 2022), have shown promising results. Yet, these models still face challenges such as imbalance handling and lacks of explainability. Despite these advances, there remains a research gap in developing a unified, interpretable, and high-performing HD prediction framework that integrates advanced feature engineering, robust learning algorithms, and explainability from the ground up.
The XAI-HD framework
In response to these challenges, we propose XAI-HD, a comprehensive framework that uniquely combines data balancing techniques and Explainable Artificial Intelligence (XAI) methodologies to improve the interpretability and robustness of heart disease prediction models. Unlike conventional AI-based approaches, XAI-HD systematically enhances predictive accuracy while ensuring transparency in decision-making through the integration of SHAP and LIME interpretability tools. The framework uses rigorous preprocessing and advanced class-balancing strategies to reduce bias, enhance generalization, and increase clinical relevance.
To validate its effectiveness, XAI-HD is evaluated across three diverse datasets-CHD, FHD, and SHD-to demonstrate its adaptability, performance, and interpretability in real-world medical applications. By bridging the gap between AI advancements and practical clinical deployment, this study aims to contribute to the evolution of equitable and explainable AI-driven cardiology solutions.
Problem statements
The following problem statements define the core challenges addressed in this study:
P1: Traditional heart disease prediction models often struggle with class imbalance, missing data, and inconsistent feature scaling, leading to biased and unreliable outcomes in real-world clinical settings.
P2: There is a lack of comparative analysis across diverse ML and DL models to determine which algorithms perform best in heart disease detection across different datasets with varying feature distributions.
P3: Despite the growing use of AI in healthcare, many high-performing models function as black boxes, lacking transparency and interpretability-limiting trust and adoption among medical professionals.
Objectives
The primary objectives of this study are as follows:
O1: To evaluate and compare the performance of multiple ML and DL models for heart disease detection using diverse real-world datasets (CHD, FHD, SHD).
O2: To enhance model performance and generalization through effective data preprocessing techniques and the application of advanced class balancing methods.
O3: To integrate explainable AI tools such as SHAP and LIME to interpret the decisions of the best-performing models and identify the most influential features in heart disease prediction.
Contributions
The key contributions of this study are summarized as follows:
XAI-HD Framework Development: This study proposes XAI-HD, an integrated framework combining ML, DL, and explainable AI techniques for accurate and interpretable heart disease detection across multiple datasets (CHD, FHD, SHD).
Comprehensive Pipeline with Advanced Preprocessing and Balancing: The framework includes robust data preprocessing (imputation, normalization, encoding) and advanced balancing strategies (OSS, NCR, SMOTEN, ADASYN, SMOTETomek, SMOTEENN) to address data quality and class imbalance, enhancing model generalization.
Interpretability and Rigorous Evaluation: SHAP and LIME-based FIA are integrated for transparent model interpretation, while extensive evaluation using statistical tests, complexity analysis, and multiple performance metrics ensures both predictive accuracy and practical feasibility.
Research questions addressed by the framework
The XAI-HD framework is designed to address the following key research questions:
RQ1: Which ML or DL models demonstrate the highest accuracy and reliability for heart disease prediction across different datasets with varying characteristics?
RQ2: How can effective data preprocessing-such as imputation, normalization, and encoding-combined with advanced class balancing techniques improve the overall performance and fairness of heart disease classification models?
RQ3: Can explainable AI methods like SHAP and LIME provide actionable insights into model predictions, making AI-driven heart disease diagnosis more transparent and trustworthy?
RQ4: Is the proposed XAI-HD framework scalable and efficient enough in terms of training and inference time to be considered suitable for real-world clinical applications?
Context behind this research
This research is motivated by the growing need to bridge the gap between the high accuracy of AI-driven heart disease (HD) prediction models and their limited interpretability in clinical practice. While numerous studies have demonstrated the predictive capabilities of ML and DL models, their opaque nature often hinders real-world deployment, particularly in high-stakes domains like healthcare where explainability is essential.
This study presents the XAI-HD architecture as an effective answer to this difficulty, incorporating comprehensive preprocessing, several AI algorithms, and explainable AI tools to guarantee both precision and transparency. The framework seeks to establish a clinically pertinent and broadly applicable paradigm for the identification of heart disease, facilitating informed decision-making by healthcare practitioners.
Organization of the paper
The remainder of this paper is organized as follows: Section 2 presents a review of related works in cardiovascular disease prediction, highlighting advancements in ML, DL, and explainable AI. Section 3 delineates the proposed XAI-HD methodology, encompassing data preprocessing, class balancing procedures, model structures, and interpretability methodologies. Section 5 presents a comprehensive performance assessment, encompassing accuracy measures, comparison analysis, and statistical validation of the framework. Section 6 discusses key findings, the real-world applicability of XAI-HD, and its feasibility for clinical deployment while addressing computational considerations. Section 7 concludes the paper by summarizing contributions, acknowledging limitations, and outlining future research directions for expanding AI-driven diagnostics in healthcare.
Related works
HD prediction has been extensively studied using various datasets, including the Cleveland and Framingham datasets. Researchers have explored a wide range of ML and data-driven approaches to enhance predictive accuracy and address the limitations of traditional models. The following discusses the related works based on these datasets and the advancements made in HD prediction.
Related works on CHD dataset
The following review synthesizes research efforts focused on heart disease prediction using the Cleveland Heart Disease dataset, a widely utilized benchmark from the UCI repository comprising 303 instances and 13 key features. The studies are organized by methodological focus-traditional ML, ensemble learning, DL, hybrid and neuro-fuzzy approaches, and explainable AI with comparative analyses-to elucidate trends, advancements, and challenges in cardiovascular disease prediction. Each subsection presents a detailed examination of the methodologies, performance metrics, and contributions to the field.
Traditional machine learning approaches
Sharma et al. (2017) developed a ML-based framework for heart disease prediction, leveraging the Cleveland dataset. The study implemented and compared four models-DT, Multivariate Adaptive Regression Splines (MARS), Random Forest (RF), and Tree-based Model with Genetic Algorithm (TMGA)-using the R platform. The DT model demonstrated superior performance, achieving an accuracy of 93.24%, followed by MARS (91.04%), RF (89.95%), and TMGA (88.85%). The authors emphasized the DT’s balance of high accuracy and computational efficiency, making it a practical choice for clinical applications.
The research highlights the capability of conventional ML models to attain strong predictive performance with low computing demands. Shah et al. (2020) investigated supervised learning methodologies for predicting cardiac disease, assessing Naïve Bayes, DT, K-Nearest Neighbors (KNN), and RF models utilizing the Cleveland dataset. The KNN model attained the best accuracy at 90.78%, succeeded by Naïve Bayes at 88.15% and RF at 86.84%. The study emphasized KNN’s efficacy in identifying intricate patterns within the dataset, attributed to its distance-based categorization methodology. The authors recommended incorporating these models into clinical decision support systems to enable early diagnosis, emphasizing the significance of preprocessing approaches such as feature scaling to improve model efficacy.
Perumal and Kaladevi (2020) introduced a comprehensive methodology that amalgamates Principal Component Analysis (PCA) with classification models such as KNN, Support Vector Machine (SVM), and Logistic Regression (LR). The research employed the Cleveland dataset, implementing PCA to lower dimensionality while maintaining predictive efficacy. Logistic Regression attained the greatest accuracy at 87%, succeeded by SVM at 85% and KNN at 69%. The authors highlighted PCA’s function in alleviating the curse of dimensionality, thus enhancing computing efficiency and model interpretability. This study emphasizes the interplay between feature reduction and categorization in the early prediction of coronary heart disease.
Nahar et al. (2013) proposed a medical knowledge-driven feature selection (MFS) method to improve the accuracy of cardiac disease prediction. The research contrasted MFS with computerized feature selection (CFS) and assessed Naïve Bayes, Support Vector Machines (SVM), DT (J48), AdaBoost, and Instance-Based Learning (IBK) using the Cleveland dataset. The MFS-enhanced SVM model attained the best accuracy at 96.04%, succeeded by Naïve Bayes at 92.08% and J48 at 88.12%. By emphasizing medically significant characteristics, MFS improved both predictive precision and clinical comprehensibility. This demonstrates the importance of integrating domain knowledge into computational frameworks. Musa and Muhammad (2022) suggested a Bayesian Network technique that incorporates feature selection to enhance predictive performance. The research investigated Naïve Bayes, KNN, Logistic Regression, and Bayesian Networks using the Cleveland dataset, applying three feature selection methodologies: Wrapper Feature Subset Selection, Consistency Subset Evaluation, and Correlation-Based Feature Selection (CFS). The Wrapper technique discerned critical features (age, sex, chest pain kind, exercise-induced angina, old peak, slope, number of main vessels, and thalassemia), allowing the Bayesian Network to attain an accuracy of 88.5%, surpassing Naïve Bayes and Logistic Regression, which achieved 86.8% accuracy. This study highlights the significance of probabilistic modeling and feature selection in improving prediction accuracy.
El-Bialy et al. (2015) examined decision tree methodologies, incorporating Fast DT (FDT) and pruned C4.5 algorithms across various datasets (Cleveland, Hungarian, V.A. Heart Disease, and Statlog Project). The integrated dataset attained an accuracy of 78.06% with FDT and 77.5% with C4.5 by extracting significant features and merging datasets, exceeding the performance of individual datasets. The research emphasizes the capability of dataset integration and feature selection to enhance classification accuracy in diagnosing heart disease.
Ensemble learning approaches
Paul (2024c) proposed an ensemble learning framework utilizing bagging, boosting, and stacking techniques to enhance heart disease prediction. The study employed RF, AdaBoost, and GB, optimizing hyperparameters on the Cleveland dataset. The stacking ensemble model achieved an impressive accuracy of 98.2%, with 97.5% precision and an AUC–ROC of 0.96, outperforming individual models. The authors highlighted the robustness of ensemble methods in handling complex feature interactions, making them suitable for clinical decision-making and early diagnosis. Suryawanshi (2024) introduced a Voting Classifier ensemble combining LR, GB, and SVM models. The study explored various feature combinations and classification strategies on the Cleveland dataset, achieving an accuracy of 97.9%. The ensemble approach significantly improved precision, offering a valuable tool for early cardiovascular disease diagnosis. The authors emphasized the synergy of combining diverse classifiers to enhance predictive performance.
Ogunpola et al. (2024) suggested a hybrid methodology that combines ensemble and DL models, specifically KNN, SVM, Logistic Regression, Convolutional Neural Network (CNN), Gradient Boost, XGBoost, and RF. The research handled imbalanced datasets using oversampling, feature scaling, normalization, and dimensionality reduction. XGBoost attained superior performance on the Cleveland and Cardiovascular Heart Disease datasets, achieving an accuracy of 98.50%, precision of 99.14%, recall of 98.29%, and F1-score of 98.71%. The research highlights the significance of hyperparameter optimization and data preparation in enhancing ensemble models. Srinivas and Katarya (2022) created an optimized XGBoost model utilizing the OPTUNA hyperparameter optimization framework, tested on the Cleveland, Heart Failure Kaggle, and Heart Disease UCI Kaggle datasets. The model attained 94.7% accuracy on the Cleveland dataset, illustrating the effectiveness of tree-based ensemble learning coupled with effective parameter optimization. The research underscores OPTUNA’s contribution to improving prediction accuracy across various datasets.
Mienye et al. (2020) introduced an accuracy-based weighted aging classifier ensemble (AB-WAE) employing mean-based splitting and Classification and Regression Tree (CART). The method attained accuracies of 93 and 91% when assessed on the Cleveland and Framingham datasets, respectively. The authors illustrated that dividing datasets into smaller subgroups prior to implementing ensemble learning markedly enhanced prediction accuracy, providing a systematic method for medical diagnoses. Jawalkar et al. (2023) introduced a decision tree-based random forest (DTRF) classifier with loss optimization, attaining 96% accuracy, 86% precision, 86% recall, and 85% F1-score on the Cleveland dataset. The HDP-DTRF methodology surpassed conventional methodologies, underscoring the efficacy of improved ensemble strategies in predicting cardiac disease. Suryawanshi (2024) presented a hybrid ensemble learning method with a Voting Classifier that combines Logistic Regression, GB, and SVM. By employing feature selection and hyperparameter optimization, the model attained an accuracy of 97.9%, surpassing the performance of individual models (GB: 97.8%, SVM: 97.3%, Logistic Regression: 95.5%). The research underscores the efficacy of ensemble learning in improving clinical decision-making.
Almulihi et al. (2022) proposed a stacking ensemble combining CNN-LSTM and CNN-GRU models with SVM as the meta-learner. Recursive Feature Elimination (RFE) was used for feature selection, and the model achieved 97.17% accuracy on the Cleveland dataset, outperforming traditional ML models. The study highlights the synergy of DL and ensemble techniques for improved prediction accuracy.
Laftah and Al-Saedi (2024) developed an explainable ensemble learning model using a soft voting ensemble of RF, GB, CatBoost, K-Nearest Neighbor, Naïve Bayes, Support Vector Machine, and AdaBoost. The model achieved 98.54% accuracy on the Cleveland dataset, with Local Interpretable Model-Agnostic Explanations (LIME) providing clinical interpretability. The study sets a benchmark for diagnostic accuracy and transparency.
Deep learning approaches
Shrivastava et al. (2023) proposed HCBiLSTM, a hybrid DL model integrating Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) with an Extra Tree Classifier for feature selection. The model addressed missing and imbalanced data through preprocessing, achieving 96.66% accuracy, 96.84% precision, 96.66% recall, and 96.63% F1-score on the Cleveland dataset. The CNN extracted spatial features, while BiLSTM captured temporal dependencies, enhancing predictive performance.
Kayalvizhi et al. (2023) proposed an Optimized CNN-BiLSTM model with an Attention Mechanism and Newton–Raphson-based Optimizer (NRO) for parameter tuning. Evaluated on the Cleveland dataset, the model achieved 95.3% accuracy. The attention mechanism improved classification by focusing on critical features, while NRO enhanced model convergence, demonstrating the potential of integrated DL architectures for robust heart disease prediction.
Hybrid and neuro-fuzzy approaches
Abushariah et al. (2014) proposed a hybrid artificial intelligence approach integrating Artificial Neural Networks and Adaptive Neuro-Fuzzy Inference Systems (ANFIS). Using an 80% training and 20% testing split on the Cleveland dataset, the ANN model achieved 87.04% accuracy, while ANFIS reached 75.93%. The study optimized ANN through hidden neuron and epoch tuning and ANFIS via the genfis2 parameter, highlighting the potential of neuro-fuzzy models in medical decision support systems.
Nursyahrina et al. (2024) proposed a Rough Neural Network (RNN) model integrating Rough Set Theory (RST) for feature selection with ANN. RST identified nine key features, reducing data complexity while maintaining predictive accuracy. The RNN model achieved 88.52% accuracy, 88.14% F1-score, and 88.85% AUC, outperforming traditional ANN (86.88% accuracy, 86.67% F1-score, 87.34% AUC). The study demonstrates the efficacy of hybrid intelligent systems for early heart disease detection.
Kahramanli and Allahverdi (2008) proposed a hybrid neural network approach combining ANN and Fuzzy Neural Networks (FNN) for heart disease and diabetes prediction. Evaluated on the Cleveland dataset using k-fold cross-validation, the model achieved 86.8% accuracy. The approach effectively handled both crisp and fuzzy medical data, demonstrating the potential of hybrid systems in medical diagnostics.
Kanagarathinam et al. (2022) introduced the “Sathvi” dataset, combining the Hungarian, Switzerland, Cleveland, and Long Beach datasets to address missing data issues. The study evaluated Naïve Bayes, XGBoost, KNN, MLP, SVM, and CatBoost, with CatBoost achieving a mean accuracy of 94.34% (ranging from 88.67 to 98.11%) via 10-fold cross-validation. The work highlights the importance of dataset enhancement for robust cardiovascular disease prediction.
Explainable AI and advanced comparative studies
Paul (2024b) proposed an Explainable AI (XAI)-enhanced ML framework integrating SHAP, Local Interpretable Model-agnostic Explanations (LIME), and Feature Importance Analysis (FIA). The study evaluated DT, Support Vector Machines (SVM), RF, and Neural Networks on the Cleveland dataset. RF achieved 92.7% accuracy, followed by Neural Networks (91.5%) and SVM (89.8%). SHAP and LIME highlighted key risk factors (cholesterol, age, resting blood pressure), enhancing model transparency and clinical trust.
Paul (2024a) proposed an advanced ML framework evaluating Support Vector Machines (SVM), RF, XGBoost, and ANN. The study applied preprocessing techniques (missing value imputation, feature scaling, encoding) and used FIA to identify critical features (age, cholesterol, exercise-induced angina, resting blood pressure). XGBoost achieved 89.1% accuracy, followed by RF (87.3%) and ANN (85.5%), reinforcing the superiority of ensemble learning in handling complex feature interactions.
Shrestha (2024a) conducted a comparative study of Logistic Regression, RF, GB, XGBoost, and Long Short-Term Memory (LSTM) networks on the Cleveland dataset. Extensive preprocessing addressed missing values and categorical variables. XGBoost achieved 90% accuracy and 0.94 AUC–ROC, with SHAP values highlighting key risk factors (chest pain type, major vessels, ST depression). The study underscores the potential of ensemble learning for clinical decision-making.
Shrestha (2024b) compared LSTM, RF, GB, XGBoost, and Logistic Regression on the Cleveland dataset. After preprocessing (missing value handling, target binarization), XGBoost achieved 90% accuracy and 0.94 AUC–ROC. SHAP-based interpretability emphasized features like major vessels, chest pain type, and thalassemia, enhancing model transparency.
Paudel et al. (2023) proposed an XAI-based approach for early heart attack detection, integrating LIME with AdaBoost, RF, GB, and LGB. Evaluated on the Heart Attack Classification Dataset, LGB achieved 99.33% accuracy, identifying “kcm” and “troponin” as key predictors. The study demonstrates the power of XAI in improving diagnostic interpretability.
Table 1 summarizes the related works for the CHD dataset.
Table 1. Summary of related works on CHD prediction
Study | Proposed idea | Models used | Best outcome | Outcome summary |
|---|---|---|---|---|
Sharma et al. (2017) | ML-based heart disease prediction using various models | DT, MARS, RF, TMGA | Accuracy: 93.24% (DT) | DT performed best in terms of accuracy and efficiency, highlighting its suitability for clinical applications |
Nahar et al. (2013) | Medical knowledge-driven feature selection (MFS) for improved classification | Naïve Bayes, SVM, DT (J48), AdaBoost, IBK | Accuracy: 96.04% (SVM) | MFS enhanced SVM performance by selecting medically relevant features, improving accuracy and interpretability |
Abushariah et al. (2014) | Hybrid AI approach integrating neural and neuro-fuzzy models | ANN (MLP), ANFIS | Accuracy: 87.04% (ANN) | ANN outperformed ANFIS; neuro-fuzzy models show potential for medical decision support with optimized parameters |
Kahramanli and Allahverdi (2008) | Hybrid neural network for heart disease and diabetes prediction | ANN, Fuzzy Neural Network (FNN) | Accuracy: 86.8% (Hybrid ANN-FNN) | Hybrid system effectively handled crisp and fuzzy data, improving diagnostic accuracy |
El-Bialy et al. (2015) | Decision tree-based prediction with dataset integration | Fast DT (FDT), Pruned C4.5 | Accuracy: 78.06% (FDT) | Dataset integration and feature selection improved accuracy over individual datasets |
Shah et al. (2020) | Supervised learning for heart disease prediction | Naïve Bayes, DT, KNN, RF | Accuracy: 90.78% (KNN) | KNN excelled in capturing complex patterns, suitable for early diagnosis with proper preprocessing |
Perumal and Kaladevi (2020) | PCA with classification for early coronary heart disease prediction | KNN, SVM, Logistic Regression | Accuracy: 87% (Logistic Regression) | PCA improved efficiency; Logistic Regression balanced accuracy and interpretability |
Mienye et al. (2020) | Accuracy-based weighted aging classifier ensemble (AB-WAE) | CART, AB-WAE Ensemble | Accuracy: 93% (AB-WAE) | Data partitioning before ensemble learning enhanced prediction accuracy, offering a structured approach |
Musa and Muhammad (2022) | Bayesian Networks with feature selection techniques | Naïve Bayes, KNN, Logistic Regression, Bayesian Network | Accuracy: 88.5% (Bayesian Network) | Wrapper Feature Subset Selection improved prediction; probabilistic modeling enhances performance |
Srinivas and Katarya (2022) | Optimized XGBoost with OPTUNA hyperparameter tuning | XGBoost | Accuracy: 94.7% (XGBoost) | OPTUNA-enhanced XGBoost achieved high accuracy, highlighting the role of parameter optimization |
Almulihi et al. (2022) | Stacking ensemble with DL models | CNN-LSTM, CNN-GRU, SVM (meta-learner) | Accuracy: 97.17% (Stacking Ensemble) | DL and ensemble integration with RFE improved accuracy over traditional models |
Kanagarathinam et al. (2022) | Dataset integration (“Sathvi”) with ML models | Naïve Bayes, XGBoost, KNN, MLP, SVM, CatBoost | Mean Accuracy: 94.34% (CatBoost) | CatBoost excelled on integrated dataset, addressing missing data challenges effectively |
Shrivastava et al. (2023) | Hybrid CNN-BiLSTM with feature selection | CNN, BiLSTM, Extra Tree Classifier | Accuracy: 96.66% (HCBiLSTM) | CNN-BiLSTM captured spatial and temporal features, achieving high accuracy with preprocessing |
Kayalvizhi et al. (2023) | Optimized CNN-BiLSTM with attention mechanism | CNN, BiLSTM, Attention Mechanism | Accuracy: 95.3% (CNN-BiLSTM) | Attention mechanism and NRO tuning enhanced classification, improving robustness |
Paudel et al. (2023) | XAI-based early heart attack detection with LIME | AdaBoost, RF, GB, LGB | Accuracy: 99.33% (LGB) | LGB with LIME identified key predictors, enhancing accuracy and interpretability |
Paul (2024c) | Ensemble learning for optimized heart disease prediction | RF, AdaBoost, GB, Stacking Ensemble | Accuracy: 98.2%, Precision: 97.5%, AUC: 0.96 | Stacking ensemble outperformed individual models; ensemble learning boosts robustness and reliability |
Ogunpola et al. (2024) | Hybrid ensemble and DL with data preprocessing | KNN, SVM, Logistic Regression, CNN, Gradient Boost, XGBoost, RF | Accuracy: 98.50% (XGBoost) | XGBoost excelled with preprocessing, highlighting hyperparameter tuning and imbalance handling |
Nursyahrina et al. (2024) | Rough Neural Network with Rough Set Theory | ANN, RNN | Accuracy: 88.52% (RNN) | RST-based feature selection improved RNN performance, enhancing efficiency and interpretability |
Laftah and Al-Saedi (2024) | Explainable ensemble with soft voting and LIME | RF, GB, CatBoost, KNN, Naïve Bayes, SVM, AdaBoost | Accuracy: 98.54% (Soft Voting Ensemble) | Soft voting and LIME ensured high accuracy and clinical interpretability, setting a benchmark |
Paul (2024b) | XAI-enhanced ML with SHAP, LIME, and FIA | DT, SVM, RF, Neural Networks | Accuracy: 92.7% (RF) | RF with XAI tools provided transparent predictions, highlighting key risk factors |
Paul (2024a) | Advanced ML framework with preprocessing | SVM, RF, XGBoost, ANN | Accuracy: 89.1% (XGBoost) | XGBoost outperformed others, with FIA identifying critical features for complex interactions |
Shrestha (2024a) | Comparative study with ensemble and DL | Logistic Regression, RF, GB, XGBoost, LSTM | Accuracy: 90% (XGBoost) | XGBoost with SHAP achieved high accuracy and interpretability, suitable for clinical use |
Shrestha (2024b) | Advanced comparative study with preprocessing | LSTM, RF, GB, XGBoost, Logistic Regression | Accuracy: 90% (XGBoost) | XGBoost with SHAP emphasized key features, enhancing transparency and performance |
Jawalkar et al. (2023) | Decision tree-based random forest with loss optimization | DTRF Classifier | Accuracy: 96% (DTRF) | DTRF outperformed traditional methods, balancing accuracy and performance metrics |
Suryawanshi (2024) | Hybrid ensemble with Voting Classifier | Logistic Regression, GB, SVM | Accuracy: 97.9% (Voting Classifier) | Voting Classifier excelled, reinforcing ensemble learning’s clinical applicability |
Related works on FHD dataset
The Framingham Heart Study (FHS) dataset, a longitudinal cohort study comprising thousands of patient records with key cardiovascular risk factors, has been extensively utilized for heart disease prediction. The following studies explore various ML, ensemble learning, DL, genetic optimization, and explainable AI approaches to predict cardiovascular disease (CVD) and related outcomes using the FHS dataset. They are organized by methodological focus to elucidate advancements, challenges, and trends in predictive modeling for heart disease.
Traditional machine learning approaches
Mahmoud et al. (2021) investigated the application of traditional ML algorithms for heart disease prediction using the FHS dataset. The study evaluated KNN, SVM, DT, LR, and RF. Preprocessing involved median imputation for missing values and outlier handling using boxplots. RF achieved the highest accuracy of 85.05%, followed by LR (84.89%), DT (84.82%), SVM (84.5%), and KNN (83.95%).
The authors emphasized the robustness of RF owing to its ensemble characteristics, indicating that it is appropriate for clinical prediction tasks. This study emphasizes the significance of preprocessing to improve model efficacy in conventional ML frameworks. Demir and Selvitopi (2023) performed a comparative analysis of ML and DL methodologies, emphasizing KNN and ANN for predicting heart disease. The FHS dataset was processed via the Hotdecking technique to address absent values. The ANN surpassed the KNN, with an accuracy of 85.85% in contrast to KNN’s 85.50%. The study highlighted the exceptional generalization skills of artificial neural networks (ANN) and the essential function of imputation methods such as Hotdecking in enhancing forecast precision. This study emphasizes the superiority of neural-based models compared to conventional distance-based algorithms in difficult medical datasets. Kahouadji (2024) conducted a comparative examination of eight ML classification methods to forecast heart disease, focusing on data distribution shifts and model generalizability.
The evaluated models included Extreme Gradient Boosting (XGB), SVM, RF, Logistic Regression (Logit), Linear Discriminant (LD), Quadratic Discriminant (QD), Double Discriminant Scoring Type 1 (DDS1), and Type 2 (DDS2). The study identified DDS1 as the most generalizable model, achieving a true positive rate above 75% on the FHS dataset. Key predictive features included age, diastolic blood pressure, cigarettes smoked per day, total cholesterol, BMI, and heart rate. The authors proposed a methodology for optimal variable selection, emphasizing fairness and robustness in healthcare AI models.
Ensemble learning approaches
Chen et al. (2020) proposed an entropy-based rule model for cardiovascular disease prediction, integrating five classification techniques: Rough Set (RS), DT, RF, Multilayer Perceptron (MLP), and SVM. The model was evaluated on the FHS dataset and a hospital dataset from Taiwan. In the FHS dataset, RS achieved the highest accuracy of 85.11%, while SVM excelled in the hospital dataset with 99.67% accuracy, 99.93% sensitivity, and 99.71% specificity. The study highlighted the efficacy of entropy-based knowledge rules and classifier selection in tailoring models to specific datasets, demonstrating RS’s strength in handling the FHS dataset’s complexity.
Suhatril et al. (2024) evaluated eight ML algorithms for CVD prediction: DT, Naïve Bayes (NB), KNN, SVM, RF, LR, Neural Network (NN), and GB. The FHS dataset, comprising 4240 records (reduced to 3658 after preprocessing), was split into 75% training and 25% testing sets.
The RF model attained the maximum accuracy of 85%, whilst Naïve Bayes achieved the optimal AUC score of 0.72. The research highlighted the superiority of ensemble approaches such as RF and GB in identifying intricate patterns, recommending their application in clinical risk evaluation.
Gupta and Seth (2022) conducted a comparative analysis of ML and DL models, applying DT, RF, KNN, SVM, and a Multi-Layer Perceptron (MLP) on the FHS and UCI Heart Disease datasets. RF achieved the highest accuracy of 97.13% on the FHS dataset, with MLP performing well across both datasets.
Feature importance analysis indicated age, systolic blood pressure, total cholesterol, body mass index, and glucose levels as significant predictors. The research underscored hyperparameter optimization and feature selection to improve model efficacy, accentuating the robustness of RF in ensemble learning.
Deep learning and neural network approaches
Narain et al. (2016) introduced a Quantum Neural Network (QNN) for cardiovascular disease (CVD) prediction, comparing it with the conventional Framingham Risk Score (FRS). The QNN was trained using 689 patient records and evaluated on the FHS dataset, which included 5,209 CVD patients. It attained an accuracy of 98.57%, notably surpassing FRS, which achieved an accuracy of 19.22%. The research criticised FRS’s old risk parameters and illustrated QNN’s capacity to dynamically adjust to patient data, providing a robust alternative for real-time cardiovascular disease risk evaluation. Krishnan et al. (2023) proposed an Enhanced Recurrent Neural Network (RNN) employing Gated Recurrent Units (GRU) to mitigate data imbalance and vanishing gradient problems. The FHS dataset was calibrated utilizing the Synthetic Minority Over-sampling Technique (SMOTE), and the model, executed with TensorFlow and Keras, attained an accuracy of 98.78%. The research emphasized the efficacy of GRU-based RNNs and data balance in enhancing predictive performance, establishing this method as a cutting-edge option for heart disease risk prediction.
Genetic and optimization-based approaches
Gupta and Sedamkar (2020) suggested a Genetic Algorithm (GA) approach for feature selection and hyperparameter optimization of SVM and Neural Network (NN) models.
GA optimized SVM’s Radial Basis Function (RBF) kernel parameters (C and G) and NN’s Multi-Layer Perceptron (MLP) architecture (hidden layers, nodes, learning rate, momentum). Using a reduced feature set from the FHS dataset, the approach improved sensitivity and precision compared to Grid Search tuning. The study demonstrated GA’s efficiency in multi-objective optimization, enhancing model performance for heart disease prediction.
Explainable AI and specialized models
Zhang et al. (2022) developed a ML framework to identify DNA methylation-regulated genes as biomarkers for coronary heart disease (CHD) using methylome and transcriptome data from the Framingham Heart Study. Focusing on five genes (ATG7, BACH2, CDKN1B, DHCR24, MPO), the study employed LightGBM, XGBoost, and RF models. LightGBM achieved the highest performance, with an AUC of 0.834, sensitivity of 0.672, and specificity of 0.864 in the validation set. The findings highlight DNA methylation’s role in CHD and the efficacy of ML for biomarker discovery, advancing explainable AI in medical diagnostics by elucidating epigenetic mechanisms. Ejiyi et al. (2024a2024b) have significantly improved cardiovascular disease prediction using advanced ML techniques, strengthening early detection strategies.
Table 2 summarizes the related works for the FHD dataset.
Table 2. Summary of related works on FHD prediction
Study | Proposed idea | Models used | Best outcome | Outcome summary |
|---|---|---|---|---|
Narain et al. (2016) | Quantum Neural Network for CVD prediction | Quantum Neural Network (QNN), Framingham Risk Score (FRS) | Accuracy: 98.57% (QNN) | QNN significantly outperformed FRS, offering dynamic adaptation for real-time CVD risk assessment |
Chen et al. (2020) | Entropy-based rule model for CVD prediction | Rough Set (RS), DT, RF, Multilayer Perceptron (MLP), SVM | Accuracy: 85.11% (RS) | RS excelled on FHS dataset, leveraging entropy-based rules for robust prediction |
Mahmoud et al. (2021) | Traditional ML for heart disease prediction | KNN, SVM, DT, LR, RF | Accuracy: 85.05% (RF) | RF’s ensemble approach achieved highest accuracy, enhanced by preprocessing techniques |
Gupta and Sedamkar (2020) | Genetic Algorithm for feature selection and hyperparameter optimization | SVM, Neural Network (MLP) | Improved sensitivity and precision (GA-optimized SVM/MLP) | GA enhanced model performance, outperforming Grid Search with reduced feature set |
Gupta and Seth (2022) | Comparative analysis of ML and DL models | DT, RF, KNN, SVM, Multi-Layer Perceptron (MLP) | Accuracy: 97.13% (RF) | RF achieved high accuracy with feature selection, identifying key risk factors |
Zhang et al. (2022) | ML for identifying DNA methylation biomarkers | LightGBM, XGBoost, RF | AUC: 0.834 (LightGBM) | LightGBM excelled in biomarker prediction, highlighting epigenetic regulation in CHD |
Demir and Selvitopi (2023) | Comparative study of ML and DL | KNN, ANN | Accuracy: 85.85% (ANN) | ANN outperformed KNN, with Hotdecking imputation improving predictive accuracy |
Krishnan et al. (2023) | Enhanced RNN with GRU for heart disease prediction | Recurrent Neural Network (RNN), Gated Recurrent Units (GRU) | Accuracy: 98.78% (RNN-GRU) | GRU-based RNN with SMOTE balancing achieved state-of-the-art accuracy |
Orfanoudaki et al. (2020) | Non-Linear Framingham Stroke Risk Score using OCT | Optimal Classification Trees (OCT), Revised Framingham Stroke Risk Score (R-FSRS) | AUC: 87.43% (N-SRS) | N-SRS outperformed R-FSRS, capturing non-linear risk patterns for stroke prediction |
Suhatril et al. (2024) | Evaluation of ML models for CVD prediction | DT, Naïve Bayes (NB), KNN, SVM, RF, LR, Neural Network (NN), GB | Accuracy: 85% (RF) | RF led in accuracy, with Naïve Bayes excelling in AUC, highlighting ensemble strengths |
Kahouadji (2024) | Comparative study addressing data distribution shifts | Extreme Gradient Boosting (XGB), SVM, RF, Logistic Regression (Logit), Linear Discriminant (LD), Quadratic Discriminant (QD), Double Discriminant Scoring Type 1 (DDS1), Type 2 (DDS2) | True Positive Rate:>75% (DDS1) | DDS1 achieved high generalizability, with key features identified for robust prediction |
Related works on SHD dataset
The Switzerland Heart Disease (SHD) dataset, part of the UCI Machine Learning Repository, contains clinical and demographic features for heart disease prediction, often characterized by its imbalanced nature and smaller sample size compared to other datasets like Cleveland. The following studies explore traditional ML, ensemble learning, explainable AI, federated learning, and feature selection approaches to predict cardiovascular disease using the SHD dataset, either standalone or as part of integrated datasets. They are organized by methodological focus to elucidate advancements, challenges, and trends in predictive modeling for heart disease.
Traditional machine learning approaches
Luukka and Lampinen (2010) developed a classification method integrating Principal Component Analysis (PCA) with Differential Evolution (DE) for heart disease diagnosis. The study utilized the SHD dataset alongside the Cleveland and Hungarian datasets, applying PCA for dimensionality reduction and DE for optimizing class vectors and the Minkowski distance parameter. The approach achieved a classification accuracy of 94.5% on the SHD dataset and a mean accuracy of 82.0% across multiple datasets. The authors emphasized the method’s capacity to improve accuracy while decreasing computational time, rendering it appropriate for extensive clinical applications. Chen (2024) introduced a ML methodology for diagnosing heart disease, emphasizing KNN, SVM, and Logistic Regression. The research utilized these models on the Cleveland and Hungarian datasets, with the Support Vector Machine employing a Radial Basis Function (RBF) kernel attaining 85.56% accuracy on the Cleveland dataset and 87.78% on the Hungarian dataset. The SHD dataset was not expressly described, implying comparable applicability from its inclusion in related studies. The research highlighted the capability of SVM to enhance diagnostic precision and mitigate the hazards of misdiagnosis in clinical environments. Lewandowicz and Kisiała (2023) performed a comparative examination using SVM, Naïve Bayes, and KNN for the classification of heart disease utilizing the SHD dataset, which includes variables such as cholesterol, blood pressure, and heart rate. The SVM classifier attained the highest accuracy and stability at 82.47%, surpassing Naïve Bayes and kNN. The research emphasized the robustness of SVM for the early identification of heart disease, showcasing its trustworthiness in managing the complexity of the SHD dataset.
Ensemble learning approaches
Rahman et al. (2024) created a diagnostic support system utilizing various heart disease datasets, including SHD, Cleveland, and Hungary. Seven ML methods were assessed: SVM, DT, KNN, Naïve Bayes (NB), LR, RF, and Multilayer Perceptron (MLP). The MLP and LR models attained the maximum prediction accuracy of 89.4%. The research emphasized feature selection and hyperparameter optimization as key contributors to improved performance, demonstrating the effectiveness of ensemble and neural approaches for clinical diagnosis.
Explainable AI approaches
Nazary et al. (2024) introduced a heart disease prediction model that combines Large Language Models (LLMs) with conventional ML inside zero-shot and few-shot learning frameworks, utilizing the SHD dataset. Explainable AI (XAI) methodologies encompassing feature importance assessment and improved domain-specific interpretability. In zero-shot learning, RF attained an F1-score of 0.741 and an F3-score of 0.921, whereas XGBoost (XGB) achieved an F1-score of 0.914 and an F3-score of 0.936. In few-shot learning, RF achieved an F1-score of 0.848 and an F3-score of 0.879, while XGB achieved an F1-score of 0.902 and an F3-score of 0.936. The study highlighted LLMs’ risk-sensitive accuracy, outperforming traditional models like Logistic Regression and KNN.
Mesquita and Marques (2024) introduced an explainable ML approach using the IEEEDataPort CHD dataset, which includes SHD. The study employed SHAP to enhance interpretability for healthcare professionals. Classifiers tested included RF, CatBoost, and Extra Trees (ET), with RF, optimized via Bayesian optimization, achieving 95.65% accuracy, 1.0 sensitivity, and 0.960 F1-score. The study demonstrated a balance between high performance and transparency, crucial for clinical decision support systems.
Ningthoujam et al. (2025) developed an explainable ML approach for coronary heart disease prediction using the SHD, Cleveland, and Hungarian datasets, focusing on ECG-derived features (ST slope, chest pain type, old peak). XGBoost and RF models were employed, with SHAP and Local Interpretable Model-Agnostic Explanations (LIME) enhancing interpretability. XGBoost achieved 83.9% accuracy and 0.885 AUROC, outperforming RF (80.5% accuracy). The study emphasized the importance of interpretability alongside performance for clinical applicability.
Federated learning approaches
Tsoumplekas et al. (2024) proposed a federated learning approach with Balanced-MixUp regularization to address class imbalance in cardiovascular disease prediction. The study tested the method across four datasets, including SHD, Cleveland, Long Beach, and Framingham. Balanced-MixUp achieved 83.33% accuracy and 45.45% F-score on the SHD dataset, outperforming other techniques. Federated learning ensured data privacy and resource efficiency, making the approach suitable for decentralized healthcare systems.
Rodriguez and Nafea (2024) compared centralized and federated learning (FL) approaches for heart disease prediction using the UCI dataset, including SHD, Hungarian, and USA data. Interpretability was enhanced through Shapley-values. The centralized SVM classifier achieved 83.3% accuracy, while federated SVM reached 73.8% across aggregation strategies (FedAvg, FedAdam, FedYogi). The study highlighted federated learning’s role in maintaining patient privacy while achieving competitive performance.
Feature selection and optimization approaches
Spencer et al. (2020) proposed a feature selection and classification method using a combined dataset from Cleveland, Long Beach VA, Hungarian, and SHD. Feature selection techniques included Principal Component Analysis (PCA), Chi-Squared, ReliefF, and Symmetrical Uncertainty (SU). Models tested were BayesNet, RF, KNN, and Stochastic Gradient Descent (SGD). The Chi-Squared feature selection with BayesNet classifier achieved 85.00% accuracy, 84.73% precision, and 85.56% recall. The study underscored the importance of feature selection in enhancing predictive power and identifying critical diagnostic features.
Table 3 summarizes the related works for the SHD dataset.
Table 3. Summary of related works on SHD prediction
Study | Proposed idea | Models used | Best outcome | Outcome summary |
|---|---|---|---|---|
Narain et al. (2016) | Quantum Neural Network for CVD prediction | Quantum Neural Network (QNN), Framingham Risk Score (FRS) | Accuracy: 98.57% (QNN) | QNN outperformed FRS, enabling dynamic, real-time CVD risk assessment |
Chen et al. (2020) | Entropy-based rule model for CVD prediction | Rough Set (RS), DT, RF, Multilayer Perceptron (MLP), SVM | Accuracy: 85.11% (RS) | RS leveraged entropy-based rules for robust CVD prediction on the FHS dataset |
Gupta and Sedamkar (2020) | Genetic Algorithm for feature selection and hyperparameter optimization | SVM, Neural Network (MLP) | Improved sensitivity and precision (GA-optimized SVM/MLP) | GA-optimized models surpassed Grid Search, enhancing performance with fewer features |
Orfanoudaki et al. (2020) | Non-Linear Framingham Stroke Risk Score using OCT | Optimal Classification Trees (OCT), Revised Framingham Stroke Risk Score (R-FSRS) | AUC: 87.43% (N-SRS) | N-SRS captured non-linear risk patterns, outperforming R-FSRS for stroke prediction |
Mahmoud et al. (2021) | Traditional ML for heart disease prediction | KNN, SVM, DT, LR, RF | Accuracy: 85.05% (RF) | RF’s ensemble approach achieved top accuracy, bolstered by effective preprocessing |
Gupta and Seth (2022) | Comparative analysis of ML and DL models | DT, RF, KNN, SVM, Multi-Layer Perceptron (MLP) | Accuracy: 97.13% (RF) | RF excelled with feature selection, identifying critical predictors like age and cholesterol |
Zhang et al. (2022) | Machine learning for DNA methylation biomarker identification | LightGBM, XGBoost, RF | AUC: 0.834 (LightGBM) | LightGBM highlighted epigenetic biomarkers, advancing CHD prediction via XAI |
Demir and Selvitopi (2023) | Comparative study of ML and DL | KNN, ANN | Accuracy: 85.85% (ANN) | ANN outperformed KNN, with Hotdecking enhancing predictive accuracy |
Krishnan et al. (2023) | Enhanced RNN with GRU for heart disease prediction | Recurrent Neural Network (RNN), Gated Recurrent Units (GRU) | Accuracy: 98.78% (RNN-GRU) | GRU-based RNN with SMOTe achieved state-of-the-art accuracy for CVD prediction |
Suhatril et al. (2024) | Evaluation of ML models for CVD prediction | DT, Naïve Bayes (NB), KNN, SVM, RF, LR, Neural Network (NN), GB | Accuracy: 85% (RF) | RF led in accuracy, with Naïve Bayes topping AUC, showcasing ensemble strengths |
Kahouadji (2024) | Comparative study addressing data distribution shifts | Extreme Gradient Boosting (XGB), SVM, RF, Logistic Regression (Logit), Linear Discriminant (LD), Quadratic Discriminant (QD), Double Discriminant Scoring Type 1 (DDS1), Type 2 (DDS2) | True Positive Rate:>75% (DDS1) | DDS1 offered high generalizability, identifying key features for robust prediction |
Proof of novelty and background study
Existing studies on heart disease prediction have primarily focused on classical ML models, DL (DL) architectures, or explainable AI (XAI) techniques separately. While research efforts such as those by Sharma et al. (2017) and Paul (2024c), and Musa and Muhammad Musa and Muhammad (2022) have investigated various ML-based classifiers using standard datasets like the Cleveland Heart Disease dataset, they largely overlook hybrid approaches that integrate ML, DL, and XAI in a unified framework. Similarly, advanced ensemble methods proposed by Suryawanshi (2024) and Jawalkar et al. (2023) demonstrate improvements in predictive accuracy but lack the interpretability needed for clinical decision-making.
Furthermore, DL models such as those introduced by Shrivastava et al. (2023) and Kayalvizhi et al. (2023) have leveraged CNN and BiLSTM architectures, achieving competitive results but often treating prediction as a black-box process without detailed feature explainability. This limitation is addressed in studies like those by Paul (2024b) and Laftah and Al-Saedi (2024), who incorporate explainability techniques such as SHAP and LIME but do not explore their interaction with DL models or hybrid frameworks.
The proposed XAI-HD framework distinguishes itself by:
Integration of ML, DL, and XAI: Unlike prior works that examine these approaches in isolation, XAI-HD seamlessly combines classical ML classifiers, DL architectures (CNN, MHA), and explainability tools (SHAP, LIME) to improve both predictive accuracy and model transparency.
Multi-Dataset Utilization: While most existing studies rely on a single dataset, XAI-HD incorporates three major datasets-Cleveland Heart Disease (CHD), Framingham Heart Disease (FHD), and Switzerland Heart Disease (SHD)-ensuring a more comprehensive evaluation across diverse patient demographics and feature distributions.
Enhanced Data Balancing Techniques: Many studies address class imbalance using traditional oversampling methods (e.g., SMOTE), whereas XAI-HD incorporates undersampling techniques like One-Sided Selection (OSS) and Neighbourhood Cleaning Rule (NCR), along with hybrid approaches such as SMOTETomek and SMOTEENN, leading to improved generalization.
Rigorous Statistical Validation and Complexity Analysis: Unlike previous works that primarily report accuracy and AUC–ROC, our framework extends evaluation using Cohen’s Kappa, Matthews Correlation Coefficient (MCC), confidence intervals, Wilcoxon signed-rank test (p-value), and complexity metrics like training and inference time analysis.
Proposed methodology
The proposed framework, XAI-HD (Explainable Artificial Intelligence for Heart Disease Detection), is designed to deliver both high predictive performance and interpretability in heart disease diagnosis by integrating classical ML & DL, and XAI techniques, as illustrated in Fig. 1. The methodology begins with the acquisition of three widely used heart disease datasets-Cleveland Heart Disease (CHD), Framingham Heart Disease (FHD), and Switzerland Heart Disease (SHD)—each containing diverse clinical, demographic, and lifestyle attributes. An initial Exploratory Data Analysis (EDA) is conducted to gain insights into feature correlations and distributions. This includes generating heatmaps separately for demographic, clinical, and lifestyle attributes, aiding in the understanding of inter-feature relationships.
[See PDF for image]
Fig. 1
Proposed XAI-HD Approach for Heart Disease Detection: This diagram outlines the XAI-HD architecture, integrating ML, DL and explainable AI techniques. The framework enhances heart disease detection through advanced data preprocessing, class balancing, and interpretability tools like SHAP and LIME
Following EDA, the data undergoes rigorous preprocessing. Missing values in categorical columns are imputed using the mode, while numerical columns are filled using the mean. Z-score normalization is applied to standardize the numerical features, and categorical variables are encoded using label encoding to ensure compatibility with ML models. To address the issue of class imbalance—a common challenge in medical datasets—various data balancing techniques are employed. These include undersampling methods such as One-Sided Selection (OSS) and Neighbourhood Cleaning Rule (NCR), oversampling strategies like SMOTEN and ADASYN, and hybrid approaches that combine both, namely SMOTETomek and SMOTEENN. These techniques help in producing a more balanced class distribution, thereby reducing model bias toward the majority class and enhancing generalization to both normal and diseased cases.
The heart disease classification task is then approached using a combination of ML and DL models, including RUSBoostClassifier (RBC), BalancedBaggingClassifier (BBC), Logistic Regression (LRC), MLPClassifier (MLP), LGBClassifier (LGB), Convolutional Neural Networks (CNN), and Multi-Head Attention (MHA) mechanisms. These models are chosen for their shown efficacy in capturing difficult, nonlinear interactions within organized medical data. XAI-HD incorporates SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) to guarantee model interpretability, a vital necessity in healthcare, by providing clear insights into model decisions and highlighting significant features influencing predictions. Performance evaluation is executed utilizing an extensive array of metrics, encompassing accuracy, precision, recall, F1-score, Cohen’s Kappa, Matthews Correlation Coefficient (MCC), sensitivity, specificity, confusion matrices, as well as ROC and precision-recall curves. Furthermore, statistical validation is conducted employing confidence intervals and the Wilcoxon signed-rank test (p-value) to evaluate the significance of the findings. Complexity analysis further evaluates the models’ efficiency by assessing training and inference times. A combination of predictive power, statistical robustness, computing efficiency, and interpretability renders XAI-HD an effective and dependable framework for the identification of heart disease in real-world scenarios.
The novelty of our proposed XAI-HD framework lies in its fine-grained customization of class balancing techniques and the clinically grounded use of XAI tools across multiple datasets. Unlike prior works that apply standard oversampling techniques such as SMOTE without further tuning, we systematically optimize SMOTEENN parameters, including the number of nearest neighbors (k), Tomek link removal strategies, and the sampling ratio between minority and majority classes. These customizations are specifically tailored to the clinical complexity of heart disease data, where decision boundaries are often ambiguous. Our approach results in improved generalization performance across the CHD, FHD, and SHD datasets—an aspect not addressed in most prior works, which typically rely on default configurations and a single dataset. Furthermore, while SHAP and LIME have been employed in previous studies, our approach is novel in its application of feature attribution analysis across all three datasets to identify both consistent and dataset-specific predictors of heart disease. In addition to visualizing local and global model explanations, we align the most influential features (e.g., cholesterol, blood pressure, smoking status) with established clinical risk factors, thus reinforcing interpretability from a medical standpoint. This clinical validation of XAI insights is not commonly found in prior models. In summary, XAI-HD presents a rigorously tuned data balancing strategy, robust generalization across diverse datasets, and clinically interpretable XAI integration-contributions that extend beyond existing multimodal pipelines for heart disease detection.
Data collection
To establish and verify the proposed heart disease prediction system, three publicly accessible and often cited datasets were utilized: the Cleveland Heart Disease (CHD) (Janosi et al. 1989), Framingham Heart Disease (FHD) (Kaggle 2022), and Switzerland Heart Disease (SHD) (Janosi et al. 1989) datasets. These datasets include various populations and feature essential clinical, demographic, and lifestyle characteristics, rendering them ideal for assessing the efficacy and generalizability of ML models in medical diagnosis.
CHD dataset
The Cleveland Heart Disease (CHD) dataset, sourced from the UCI Machine Learning Repository, comprises 303 patient records encompassing 14 attributes. The features encompass demographic factors, including age and sex, alongside clinical markers such as resting blood pressure, cholesterol levels, fasting blood sugar, and exercise-induced angina. The target variable, referred to as
FHD dataset
The Framingham Heart Disease (FHD) dataset, sourced from the longitudinal Framingham Heart Study, contains 4240 entries with 16 attributes. It covers a broader spectrum of cardiovascular risk factors, including smoking status, body mass index (BMI), systolic and diastolic blood pressure, cholesterol levels, glucose levels, and relative history of hypertension or stroke. The binary target variable,
SHD dataset
The Switzerland Heart Disease (SHD) dataset, similarly obtained from the UCI repository, is relatively smaller, comprising 123 cases and 14 attributes that reflect the structure of the CHD dataset. Comparable attributes, including chest pain kind, cholesterol levels, resting ECG findings, and maximum heart rate, are incorporated. The target attribute
Original visualization
Figure 2 illustrates the original class distributions for the CHD, FHD, and SHD datasets. These visualizations are essential for revealing the inherent class imbalance present in each dataset, where the proportion of normal and disease cases is unevenly distributed. Understanding these disparities is crucial, as they directly influence the choice of preprocessing techniques and the selection of suitable resampling strategies to enhance model performance and mitigate bias during training.
[See PDF for image]
Fig. 2
Original data distribution of CHD, FHD and SHD datasets—the figure illustrates the class distributions for the CHD, FHD, and SHD datasets, showing moderate to severe imbalances with CHD containing 164 normal and 139 diseased cases, FHD with 3596 normal and 644 diseased cases, and SHD having only 8 normal and 115 diseased cases
The novelty of the proposed XAI-HD framework lies in its holistic integration of classical ML, DL, and explainable AI techniques to form a unified, end-to-end pipeline for heart disease detection. Unlike prior works that often focus on either prediction accuracy or interpretability in isolation, XAI-HD advances the state of the art by addressing both simultaneously and systematically. Specifically, the inclusion of hybrid data balancing strategies (such as SMOTETomek and SMOTEENN), the parallel use of diverse models (ML and DL) like RUSBoost, LGB, CNN, and Multi-Head Attention, and the dual explanation mechanisms (SHAP and LIME) within a single framework provide a comprehensive approach to diagnosis. Furthermore, this methodology extends beyond conventional practices by conducting rigorous statistical and complexity analyses to ensure the significance and efficiency of the model’s predictions. This level of integration, especially in the context of multimodal interpretability and balanced data representation, represents a significant advancement over traditional heart disease prediction models and highlights the practical applicability of XAI-HD in clinical settings.
Exploratory data analysis (EDA)
To better understand the underlying relationships and dependencies within our datasets, we conducted an EDA focusing on the correlations among demographic, clinical, and lifestyle features. As illustrated in Figs. 3, 4, and 5, we generated heatmaps for the CHD, FHD, and SHD datasets, respectively. These visualizations highlight significant patterns, such as strong correlations between age, cholesterol levels, blood pressure, and heart disease presence. Furthermore, dependencies between lifestyle factors like smoking and physical activity were found to influence clinical indicators. This analysis not only provides a comprehensive overview of feature interactions but also a guide for the model interpretation process in subsequent stages of the study.
[See PDF for image]
Fig. 3
Heatmaps for feature dependencies on CHD dataset—the figure displays heatmaps showing the correlation between different features in the CHD dataset, where the color intensity indicates the strength and direction of the relationships between features, helping to identify key patterns and dependencies crucial for predicting heart disease
[See PDF for image]
Fig. 4
Heatmaps for feature dependencies on FHD dataset—the figure shows heatmaps depicting the correlations between various features in the FHD dataset, where the color intensity represents the strength and direction of feature relationships, providing insights into the most influential factors for heart disease prediction in this dataset
[See PDF for image]
Fig. 5
Heatmaps for feature dependencies on SHD dataset—the figure presents heatmaps illustrating the correlations between features in the SHD dataset, with color intensity indicating the strength and direction of relationships, helping to uncover critical feature dependencies relevant for heart disease prediction in this dataset
Data preprocessing
Data preprocessing is a critical step in the ML pipeline that ensures data quality, consistency, and readiness for model training. Raw medical data, such as that used in heart disease prediction, often contains missing values, inconsistent formats, and varying scales across features. These issues can negatively impact model performance, leading to unreliable or biased predictions. To address these challenges, we implemented a series of preprocessing techniques, including imputation, normalization, and encoding. Each step is described below with relevant mathematical formulations.
Handling missing values
Missing data in medical datasets can occur due to non-response, measurement errors, or improper data entry. We addressed missing values separately for categorical and numerical features.
Categorical columns - mode imputation:
For categorical variables, missing values were filled using the mode (most frequent value) of each column. Given a categorical feature , the mode is defined as:Each missing value is replaced with , preserving the feature’s distribution.
Numerical columns - mean imputation:
For numerical features, we replaced missing values with the mean of the respective column to maintain statistical consistency. Given a numerical feature vector , the mean is computed as:Each missing value is then imputed as .
Z-score normalization
To ensure all numerical features contribute equally to model training and to eliminate scale bias, we applied Z-score normalization (also known as standardization). For a given numerical feature , Z-score normalization transforms each value as:where is the mean and is the standard deviation of the feature. This transformation results in a new distribution with mean 0 and standard deviation 1, allowing models to converge faster and perform better, especially those sensitive to feature scaling like SVM and KNN.
Label encoding
Machine learning algorithms require numerical input; hence, categorical features must be converted into numerical form. We used Label Encoding, which maps each unique category label in a feature to a unique integer value.
For a categorical feature , Label Encoding is defined as a function:This transformation retains the uniqueness of each category while ensuring compatibility with ML algorithms.
Data balancing techniques
In the domain of medical diagnostics, particularly for conditions with low prevalence like heart disease, datasets often exhibit significant class imbalance—where the number of negative (healthy) samples far exceeds the number of positive (disease) cases. This imbalance biases ML models toward the majority class, resulting in poor sensitivity and overall generalization. To mitigate this issue, we applied a range of data balancing strategies, categorized into three types: undersampling, oversampling, and hybrid methods (a combination of both). These techniques ensure balanced representation and enable the model to learn more accurate and robust decision boundaries.
Undersampling techniques
Undersampling addresses class imbalance by removing samples from the majority class, aiming to match the minority class distribution. While efficient, this may lead to loss of valuable information.
One-sided selection (OSS):
One-sided selection combines the condensed nearest neighbor (CNN) rule and Tomek Links to filter out redundant and noisy majority samples.
Tomek Links: Two samples form a Tomek Link if: where , , and denotes Euclidean distance. If a Tomek Link exists, the majority class instance is removed to reduce boundary noise.
NCR enhances data quality using the edited nearest neighbors (ENN) rule to eliminate ambiguous or misclassified majority class instances. For a sample and its -nearest neighbors:where is an indicator function and is the class label of the neighbor.
Oversampling techniques
Oversampling generates synthetic samples from the minority class to balance the dataset without discarding existing data, improving minority class learning.
SMOTEN (SMOTE for nominal features):
SMOTEN is tailored for nominal (categorical) data. It generates new samples by choosing a random minority class instance and replacing categorical features with the most frequent values from its nearest neighbors:ADASYN (adaptive synthetic sampling):
ADASYN emphasizes generating more synthetic data in regions where the minority class is harder to learn. The imbalance degree for a sample is defined as:where is the number of majority class neighbors among -nearest neighbors. The number of synthetic samples generated is:with being the total number of synthetic examples.
Combine techniques (oversampling + undersampling)
Hybrid techniques combine the benefits of both oversampling and undersampling to simultaneously enrich the minority class and remove noisy samples.
SMOTETomek:
SMOTETomek integrates SMOTE-based oversampling with Tomek Link-based cleaning to enhance class separability.
Step 1: Apply SMOTE to synthetically generate new samples for the minority class. For each selected pair , a synthetic instance is created as:
Step 2: Apply Tomek Links to the augmented dataset. If a pair of instances from different classes form a Tomek Link, the majority class sample is removed.
SMOTEENN (Proposed Technique):
Our proposed technique for heart disease detection is SMOTEENN—a hybrid of SMOTE and Edited Nearest Neighbors (ENN). It simultaneously increases minority class representation and reduces noise in the dataset.
Step 1: Apply SMOTE to generate synthetic samples: where is a minority class instance, is one of its -nearest neighbors, and is a random number in the range .
Step 2: Apply ENN to remove misclassified instances. For each sample (majority or minority), compute its -nearest neighbors. If the sample’s class does not match the majority of its neighbors, it is removed:
Balanced visualization
Figure 6 illustrates the post-processing class distribution for the CHD, FHD, and SHD datasets after applying the SMOTEENN technique. As depicted in the subfigures, the previously imbalanced class distributions have been effectively equalized, resulting in a more uniform representation of both classes in terms of count. This balance plays a crucial role in enhancing the learning capacity of classification models, allowing them to identify patterns in the minority class without bias toward the majority. By combining synthetic minority oversampling with the Edited Nearest Neighbors cleaning strategy, SMOTEENN not only introduces new informative minority samples but also removes noisy and ambiguous data points. This results in a cleaner, more balanced dataset that supports more reliable and generalized model performance across all three datasets.
[See PDF for image]
Fig. 6
Data distribution of CHD, FHD, and SHD Datasets using SMOTEENN—the figure illustrates the class distribution of Normal and Diseased cases in the CHD, FHD, and SHD datasets after applying the SMOTEENN technique, showing a more balanced representation of both classes through synthetic sample generation and noise reduction, which enhances model training for imbalanced datasets
Justification of data balancing techniques
In the context of heart disease prediction, imbalanced datasets pose a significant challenge by biasing models toward the majority class, often leading to underdiagnosis of minority (diseased) cases. To overcome this, XAI-HD integrates a carefully curated set of data balancing techniques, each chosen for its unique strengths in handling different imbalance scenarios.
One-sided selection (OSS) and neighbourhood cleaning rule (NCR) are efficient undersampling techniques that eliminate noisy or unnecessary majority samples, therefore maintaining essential decision boundaries. Conversely, oversampling methodologies such as SMOTEN (designed for nominal attributes) and ADASYN (which adaptively produces minority samples, determined by difficulty, enhance the representation of the minority class without merely replicating current data. Moreover, hybrid techniques like SMOTETomek and SMOTEENN integrate oversampling with data cleansing processes, providing an equilibrium between synthetic data enhancement and noise minimization. The selected combinations are based on existing literature demonstrating their exceptional efficacy in healthcare-related imbalanced classification issues, especially in cardiovascular research, where misclassifying at-risk patients can lead to serious consequences. Empirical comparisons in preliminary tests demonstrated that hybrid techniques such as SMOTEENN produced enhanced F1-scores and MCC values across all three datasets, validating their incorporation into the final framework. Consequently, the integration of these varied balancing procedures guarantees strong generalization and reduces model bias, thereby improving the dependability of predictions across different population distributions.
ML and DL models
This study assesses various machine learning (ML) and deep learning (DL) models to identify the most effective method for heart disease identification. The objective of the experiment is to examine several models, evaluate their efficacy, and choose the most appropriate one for heart disease categorization tasks. Machine learning and DL models are frequently utilized in healthcare applications, where their capacity to manage unbalanced datasets and generate precise predictions is essential. A comprehensive delineation of each model employed in our experiment, accompanied by its mathematical explanation:
RBC - RUSBoostClassifier: RBC is an ensemble technique that integrates Random Under-sampling (RUS) with Boosting to address the problem of class imbalance. It operates by modifying the weight of misclassified cases through recurrent training, enhancing weak classifiers to create a more robust predictive model. The model can be mathematically expressed as: where is the final prediction function. is the number of boosting iterations. is the weight assigned to the (mth) classifier. is the base classifier at the (mth) iteration.
BBC - BalancedBaggingClassifier: BBC is a variant of Bagging (Bootstrap Aggregating) designed to address class imbalance by resampling the dataset. It uses Random Sampling and trains multiple classifiers to reduce variance and increase the model’s stability. The class distribution is modified during training for imbalanced data. The model is represented by the equation: where is the number of base classifiers in the ensemble. is the (kth) base classifier.
LRC - Logistic Regression Classifier: Logistic Regression is a linear model employed for binary classification. It forecasts the likelihood that an instance is categorized within a specific class utilizing a logistic (sigmoid) function. The mathematical representation of logistic regression prediction is as follows: where is the feature vector. is the weight vector. is the bias term. is the probability of the positive class.
MLP - Multi-Layer Perceptron Classifier: A multi-layer perceptron (MLP) is a type of feedforward artificial neural network that consists of an input layer, one or more hidden layers, and an output layer. It employs backpropagation for training and is capable of modeling complex, non-linear relationships between input features and target variables. Mathematically, the output of an MLP for a given input is: where is the activation function (commonly ReLU or sigmoid). and are the weight matrices and biases for the (ith) layer. The MLP architecture facilitates the learning of intricate patterns via backpropagation, making it especially useful for applications like heart disease detection that involve non-linear interactions. Its advantages encompass the capacity to identify complex patterns within extensive datasets, resilience to noisy data, and enhanced efficacy when combined with data balancing methodologies such as SMOTEENN, establishing it as the most optimal selection among competing models.
LGB - Light Gradient Boosting Machine: LGB is a gradient boosting framework that uses a tree-based model to boost weak learners. It optimizes for both speed and accuracy by utilizing leaf-wise growth, making it particularly suitable for large datasets with high-dimensional features. Mathematically, LGB minimizes the following objective function: where is the loss function (commonly squared error or log loss). is a regularization term to penalize the complexity of the tree. are the trees in the ensemble.
CNN - Convolutional Neural Network: CNN is a DL model particularly effective for image and sequence data. It uses convolutional layers to automatically extract features from the input data and applies pooling to reduce dimensionality. The mathematical operation for a convolutional layer is: where is the input data (e.g., an image or feature map). is the filter (or kernel). denotes convolution operation. is the bias term. is the activation function (commonly ReLU or sigmoid).
MHA - Multi-Head Attention: Multi-Head Attention is a mechanism used in transformers to learn relationships between different parts of the input sequence. It computes attention scores for multiple heads simultaneously, allowing the model to focus on different aspects of the data in parallel. The multi-head attention operation can be expressed as: where , , and are the query, key, and value matrices. is the dimension of the key vectors. The softmax function ensures that attention scores are normalized.
The goal is to identify the most suitable model based on a comprehensive evaluation of performance metrics such as accuracy, precision, recall, and F1-score, ultimately selecting the model that offers the most robust and effective solution for heart disease classification.
The MLPClassifier was chosen for its strong ability to learn and generalize from complex, high-dimensional, and non-linear clinical datasets. The model is applied after rigorous preprocessing, normalization, and class balancing steps, which help optimize its performance on heterogeneous data drawn from three diverse heart disease datasets: Cleveland, Framingham, and Switzerland. MLP is particularly well-suited to structured medical data as it can learn intricate patterns across clinical, demographic, and lifestyle attributes through its multi-layer feedforward neural architecture. Unlike tree-based models, which depend on hierarchical decision rules, MLP uses backpropagation and activation functions that enable it to discover deep, abstract feature representations, especially when paired with balanced datasets and normalized features.
We re-implemented all baseline models within a unified experimental framework using the same three datasets-Cleveland Heart Disease (CHD), Framingham Heart Disease (FHD), and Switzerland Heart Disease (SHD)-to enable fair and consistent comparisons. Each model was trained and evaluated using an identical preprocessing pipeline, which included imputation of missing values (median for numerical and mode for categorical features), z-score normalization for continuous variables, label encoding for categorical features, and stratified 5-fold cross-validation to maintain statistical reliability across varying class distributions. The selection of baseline models was guided by their frequent application and reported effectiveness in recent literature on heart disease prediction. The comparative models include: RUS Boost Classifier (RBC): Robust against class imbalance through undersampling. Balanced Bagging Classifier (BBC): Employs ensemble learning with balanced subsamples. Logistic Regression Classifier (LRC): A widely used statistical baseline. Light Gradient Boosting (LGB): A high-performance gradient boosting framework. Multi-Layer Perceptron (MLP): Capture complex nonlinear patterns in structured tabular medical data. Convolutional Neural Network (CNN) and Multi-Head Attention (MHA): DL architectures used to capture complex feature interactions.
Hyperparameter optimization for each model was conducted using a combination of grid search and random search techniques to ensure optimal and comparable performance across all methods. By maintaining consistent preprocessing, hyperparameter tuning, and evaluation metrics (accuracy, precision, recall, F1-score, AUC, etc.), this experimental setup ensures that the reported comparisons are rigorous, reproducible, and meaningful. This enables a clear and fair assessment of our proposed XAI-HD framework against established baseline methods.
XAI-HD model and data balancing techniques
This study explored the impact of integrating various data balancing methods into the XAI-HD pipeline, focusing on how these techniques affect both feature distributions and classifier performance. Given the class imbalance prevalent in heart disease datasets (e.g., CHD, FHD, SHD), addressing this challenge is critical for building reliable models.
Effect of balancing methods on feature distributions and model accuracy
We employed a diverse set of resampling strategies, including One-Sided Selection (OSS), Neighborhood Cleaning Rule (NCR), SMOTEN, ADASYN, SMOTE+Tomek Links (SMOTETomek), and SMOTE+Edited Nearest Neighbor (SMOTEENN). Each method was applied after preprocessing but prior to model training. The effects on feature space were visually and statistically analyzed through distribution plots and variance analysis. It was observed that hybrid sampling strategies such as SMOTEENN preserved minority class structure more effectively and improved class separation in feature space, leading to enhanced generalization.
Comparative effectiveness of SMOTEENN
Among the evaluated methods, SMOTEENN consistently outperformed other hybrid approaches such as SMOTE+ADASYN in terms of improving classifier performance metrics (Precision, Recall, F1-score, and ROC–AUC). This is primarily due to SMOTEENN’s two-phase process: synthetic sample generation (SMOTE) followed by noise and borderline instance cleaning (ENN), which collectively reduce overlapping classes and enhance model discriminability. Empirical evidence from experiments across all three datasets (CHD, FHD, SHD) showed that models trained with SMOTEENN-balanced data delivered the most stable and accurate predictions.
Statistical significance and confidence intervals
To confirm the statistical significance of the performance improvements brought by data balancing and model enhancements, nonparametric tests such as the Wilcoxon signed-rank test were employed. Furthermore, we computed 95% confidence intervals for key performance metrics (e.g., accuracy, F1-score, ROC–AUC) across multiple runs using bootstrapping techniques. These statistical validations support the claim that XAI-HD’s performance improvements are not only consistent but also statistically significant when compared with baseline and alternative balancing strategies.
In summary, the integration of hybrid resampling methods-particularly SMOTEENN-into the XAI-HD framework significantly contributes to better class balance, improved model robustness, and statistically validated predictive gains, affirming its reliability for real-world cardiovascular risk assessment.
Explainable AI (XAI) techniques
To enhance the interpretability and trustworthiness of the proposed heart disease prediction framework, we incorporated state-of-the-art Explainable AI (XAI) techniques, including SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-Agnostic Explanations), and Feature Importance Analysis (FIA).
SHAP was utilized to provide global and local interpretability by assigning each feature an importance value based on cooperative game theory. SHAP values explain how much each feature contributes—either positively or negatively—to the final prediction for each individual instance. We adopted SHAP because of its theoretical consistency, local accuracy, and strong alignment with human intuition. This makes it particularly suitable for clinical applications where model decisions must be transparent and justifiable to medical professionals.
LIME provides localized interpretability by perturbing input data around a specific prediction and constructing an interpretable surrogate model to approximate the local decision boundary. LIME was chosen to complement SHAP in explaining feature contributions for particular patient cases, offering insight into the logic behind individual forecasts. This is especially valuable for personalized diagnosis and decision-making in a medical setting.
Feature Importance Analysis (FIA) was utilized to prioritize input features according to their total impact on the model’s output throughout the entire dataset. This enabled us to identify the key risk factors for predicting heart disease. Prominent variables of significant importance included age, kind of chest discomfort, resting blood pressure, cholesterol level, fasting blood sugar, and maximum heart rate attained. The FIA helped validate the relevance of clinically known risk indicators, ensuring our model’s predictions are not only data-driven but also medically consistent.
The combination of SHAP, LIMEwith FIA provides both global and local interpretability, allowing healthcare professionals to better understand model behavior, gain confidence in the predictions, and potentially uncover new clinical insights. These XAI techniques were chosen due to their complementary nature and widespread validation in medical AI research.
Necessity of XAI techniques
The integration of SHAP, LIME with FIA into the XAI-HD framework was strategically designed to ensure comprehensive interpretability of the heart disease prediction models at both global and individual levels. SHAP enhances transparency by quantifying the contribution of each feature to the final prediction using Shapley values derived from cooperative game theory. This method not only identifies key predictors across the dataset but also explains how and why individual predictions are made-making it particularly valuable in clinical decision support where each case must be interpretable and justifiable. LIME enhances SHAP by emphasizing local fidelity: it modifies data points adjacent to a prediction to construct a reduced, interpretable model that approximates the original model’s behavior around that instance. This enables professionals to identify the features that most significantly impacted a particular prediction, such as a diagnosis for an individual patient, hence fostering trust and facilitating individualized insights. FIA offers a comprehensive summary by ranking features according to their overall impact on model performance throughout the dataset. This not only proves the model’s consistency with established medical knowledge (e.g., the significance of age, kind of chest pain, and cholesterol levels) but also affirms that the model is discerning clinically pertinent patterns. Collectively, these three XAI methodologies provide a comprehensive, stratified framework for interpretability: SHAP guarantees global consistency and equity, LIME facilitates localized, case-specific explanations, and FIA links the model’s emphasis with domain expertise. Their joint utilization enables physicians to make informed, transparent, and accountable decisions, hence enhancing trust in the AI-driven diagnostic system.
Hyperparameter optimization strategy for XAI-HD framework
To ensure optimal predictive performance and generalization in the XAI-HD framework for heart disease diagnosis, we implement a robust hyperparameter tuning strategy. The selection of hyperparameters is guided by cross-validation techniques, performance metrics, and domain-specific considerations. Our methodology includes:
Grid search and random search
We employ a combination of grid search and random search for hyperparameter selection across different classifiers. Grid search systematically explores predefined hyperparameter values, while random search efficiently samples values to cover a broader range.
Cross-validation approach
A 5-fold cross-validation strategy is used to mitigate overfitting and ensure model stability. This divides the dataset into five subsets, iteratively training and validating models on different partitions.
Classifier-specific hyperparameters
Each model within the framework undergoes tailored hyperparameter tuning:
RUSBoostClassifier (RBC): Number of estimators () ranging from 10 to 100 to balance bias-variance tradeoff.
BalancedBaggingClassifier (BBC): Optimized for ensemble stability.
Logistic Regression (LRC): Adjusted regularization parameter (C) to refine model complexity, testing different penalties (l1, l2) and solvers (liblinear, lbfgs).
MLPClassifier (MLP): Optimized layer configurations () and activation functions (relu, tanh) to enhance non-linearity learning.
LGBClassifier (LGB): Fine-tuning boosting parameters (, ) to improve convergence speed and prediction accuracy.
Performance analysis
The performance evaluation of the proposed XAI-HD framework was carried out using three publicly available heart disease datasets: CHD, FHD, and SHD. These datasets encompass diverse patient populations and clinical features, offering a comprehensive testbed for assessing the generalizability and robustness of our approach. To ensure balanced learning and mitigate the effects of class imbalance inherent in the datasets, we employed various data balancing techniques, including undersampling, oversampling, and hybrid approaches. Among them, the SMOTEENN method was selected as the primary technique due to its consistent improvement in classification accuracy and class distribution balance across datasets.
Implementation details
Comprehensive experiments were conducted on Google Colaboratory (Colab), a cloud-based platform offering scalable computational resources. The Colab environment employed an Intel Xeon CPU including 2 virtual cores functioning at 2.20 GHz and around 13 GB of RAM. Colab provided access to NVIDIA Tesla T4 GPUs for GPU-accelerated activities, featuring 16 GB of GDDR6 VRAM and 2560 CUDA cores, with a memory bandwidth of 320 GB/s and FP32 performance of 8.1 TFLOPS. The software environment was standardized on both platforms to ensure uniformity. Python was the principal programming language, utilizing key libraries such as NumPy and Pandas for data manipulation, Matplotlib and Seaborn for data visualization, and Scikit-learn for executing ML algorithms. DL models were constructed utilizing the TensorFlow and Keras libraries. The Imbalanced-learn library was utilized to reduce class imbalance in datasets, using approaches like SMOTE and its derivatives. All tests were conducted via Colab Notebook interfaces, guaranteeing an interactive and reproducible approach.
Experimental metrics
The suggested heart disease detection models were thoroughly assessed using a wide range of performance criteria, providing a solid evaluation on both balanced and imbalanced datasets. Accuracy served as a fundamental metric for assessing the model’s overall correctness, calculated as the ratio of accurately predicted occurrences to the total number of predictions. To mitigate the accuracy constraints in imbalanced situations, supplementary metrics like precision, recall, and F1-score were employed. Precision quantifies the ratio of accurately detected positive instances to all predicted positives, whereas recall (or sensitivity) evaluates the model’s capacity to detect true positive examples. The F1-score, which denotes the harmonic mean of precision and recall, offers a balanced assessment in the context of class imbalance. The Cohen’s Kappa statistic was utilized to measure the concordance between predicted and real labels, factoring in chance agreement, while the Matthews Correlation Coefficient (MCC) was used to provide a robust assessment of binary classification efficacy, even in imbalanced datasets. Sensitivity and specificity were computed to assess the model’s proficiency in accurately identifying positive and negative situations, respectively. Alongside scalar metrics, visual assessment through Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves provided elucidated the model’s discriminative ability at different thresholds. Statistical rigor was maintained by calculating confidence intervals and employing non-parametric tests, such as the Wilcoxon signed-rank test (p-value) and the Friedman test, facilitating robust comparisons among various classifiers and experimental circumstances. These measures provided a comprehensive evaluation framework to assess the model’s predicted accuracy, dependability, and generalizability.
Result analysis
This work thoroughly assessed the efficacy of the proposed SMOTEENN+MLP model utilizing three established benchmark datasets for heart disease prediction. The aim was to evaluate the impact of diverse ML and DL models alongside various data balancing techniques. The assessment was conducted utilizing an extensive array of performance criteria, encompassing accuracy, precision, recall, F1-score, Cohen’s Kappa, MCC, sensitivity, and specificity. These metrics offer a comprehensive assessment of categorization performance, facilitating a rigorous comparative investigation of the predictive efficacy of the proposed hybrid framework relative to existing methods.
Analysis of CHD
Quantitative analysis: The performance evaluation of the CHD dataset, as presented in Table 4, reveals significant differences among various ML and DL models when coupled with distinct data balancing strategies. Among all tested configurations, the combination of SMOTEENN and MLP achieved outstanding results, attaining a perfect score (100%) across all evaluation metrics—accuracy, precision, recall, F1-score, kappa, MCC, sensitivity, and specificity—demonstrating the robustness and predictive strength of this hybrid approach. Notably, the LRC and CNN models, also using SMOTEENN, yielded competitive accuracies of 94.87 and 92.31%, respectively. Other balancing methods such as SMOTETomek and SMOTEN, when combined with MLP and CNN, also showed promising results, reaching up to 91.80 and 86.36% accuracy, respectively. In contrast, the ADASYN and OSS techniques led to relatively weaker performance, particularly in DL models such as CNN and MLP, indicating that these methods may be less effective for addressing class imbalance in CHD classification tasks.
Statistical analysis: The statistical significance of the model performances was assessed using non-parametric tests, specifically the Wilcoxon signed-rank test and the Friedman test. Across all configurations, the Wilcoxon P-value remained consistently at 0.016, indicating statistically significant differences in model performance across balancing methods. The Friedman test further substantiated these findings, yielding robust statistics such as 37.200 (OSS), 40.209 (NCR), and 28.763 (SMOTEENN), among others, confirming that model ranking is not random and that some configurations consistently outperform others. These statistical insights reinforce the conclusion that the SMOTEENN+MLP configuration offers a significantly more reliable and generalizable solution for CHD classification, effectively addressing the underlying class imbalance issue.
Table 4. Performance analysis of ML & DL models on CHD dataset
Data Balancing | Model | Accuracy | Precision | Recall | F1score | Kappa | MCC | Sensitivity | Specificity | Confidence Interval | Wilcoxon P-value | Friedman Statistic |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
OSS | RBC | 78.95 | 78.95 | 78.95 | 78.95 | 57.88 | 57.88 | 78.94 | 78.94 | 0.046 | 0.016 | 37.200 |
BBC | 73.68 | 73.90 | 73.68 | 73.59 | 47.25 | 47.52 | 73.58 | 73.58 | 0.050 | |||
LRC | 77.19 | 77.54 | 77.19 | 77.15 | 54.46 | 54.76 | 77.28 | 77.28 | 0.047 | |||
MLP | 78.95 | 79.12 | 78.95 | 78.93 | 57.93 | 58.08 | 79.00 | 79.00 | 0.046 | |||
LGB | 80.70 | 80.73 | 80.70 | 80.69 | 61.37 | 61.41 | 80.67 | 80.67 | 0.044 | |||
CNN | 68.42 | 70.82 | 68.42 | 67.31 | 36.43 | 38.94 | 68.10 | 68.10 | 0.052 | |||
MHA | 68.42 | 68.71 | 68.42 | 68.23 | 36.67 | 37.04 | 68.29 | 68.29 | 0.052 | |||
NCR | RBC | 78.43 | 79.95 | 78.43 | 78.43 | 57.27 | 58.39 | 79.19 | 79.19 | 0.046 | 0.016 | 40.209 |
BBC | 76.47 | 78.52 | 76.47 | 76.42 | 53.57 | 55.07 | 77.41 | 77.41 | 0.048 | |||
LRC | 78.43 | 79.07 | 78.43 | 78.48 | 56.95 | 57.34 | 78.80 | 78.80 | 0.046 | |||
MLP | 82.35 | 83.00 | 82.35 | 82.39 | 64.77 | 65.22 | 82.76 | 82.76 | 0.043 | |||
LGB | 76.47 | 78.52 | 76.47 | 76.42 | 53.57 | 55.07 | 77.41 | 77.41 | 0.048 | |||
CNN | 78.43 | 81.24 | 78.43 | 78.32 | 57.60 | 59.82 | 79.58 | 79.58 | 0.046 | |||
MHA | 80.39 | 84.16 | 80.39 | 80.21 | 61.60 | 64.73 | 81.75 | 81.75 | 0.045 | |||
SMOTEN | RBC | 80.30 | 81.45 | 80.30 | 80.42 | 60.61 | 61.31 | 81.02 | 81.02 | 0.045 | 0.016 | 38.160 |
BBC | 81.82 | 82.62 | 81.82 | 81.92 | 63.47 | 63.94 | 82.33 | 82.33 | 0.043 | |||
LRC | 80.30 | 81.45 | 80.30 | 80.42 | 60.61 | 61.31 | 81.02 | 81.02 | 0.045 | |||
MLP | 77.27 | 78.42 | 77.27 | 77.40 | 54.55 | 55.18 | 77.91 | 77.91 | 0.047 | |||
LGB | 84.85 | 85.12 | 84.85 | 84.90 | 69.27 | 69.40 | 84.96 | 84.96 | 0.040 | |||
CNN | 86.36 | 86.85 | 86.36 | 86.43 | 72.47 | 72.78 | 86.75 | 86.75 | 0.039 | |||
MHA | 84.85 | 84.85 | 84.85 | 84.85 | 68.98 | 68.98 | 84.49 | 84.49 | 0.040 | |||
ADASYN | RBC | 76.56 | 76.50 | 76.56 | 76.51 | 52.19 | 52.22 | 75.99 | 75.99 | 0.048 | 0.016 | 30.406 |
BBC | 79.69 | 79.64 | 79.69 | 79.64 | 58.57 | 58.60 | 79.17 | 79.17 | 0.045 | |||
LRC | 70.31 | 70.45 | 70.31 | 70.36 | 39.92 | 39.94 | 70.04 | 70.04 | 0.051 | |||
MLP | 67.19 | 68.92 | 67.19 | 67.24 | 35.14 | 35.99 | 68.06 | 68.06 | 0.053 | |||
LGB | 84.38 | 84.63 | 84.38 | 84.42 | 68.50 | 68.64 | 84.52 | 84.52 | 0.041 | |||
CNN | 70.31 | 72.11 | 70.31 | 70.36 | 41.31 | 42.31 | 71.23 | 71.23 | 0.051 | |||
MHA | 68.75 | 71.91 | 68.75 | 68.60 | 38.93 | 40.88 | 70.24 | 70.24 | 0.052 | |||
SMOTETomek | RBC | 86.89 | 87.03 | 86.89 | 86.86 | 73.74 | 73.89 | 86.83 | 86.83 | 0.038 | 0.016 | 36.295 |
BBC | 86.89 | 87.06 | 86.89 | 86.88 | 73.79 | 73.95 | 86.94 | 86.94 | 0.038 | |||
LRC | 85.25 | 85.61 | 85.25 | 85.22 | 70.53 | 70.87 | 85.32 | 85.32 | 0.040 | |||
MLP | 91.80 | 91.85 | 91.80 | 91.80 | 83.61 | 83.66 | 91.83 | 91.83 | 0.031 | |||
LGB | 83.61 | 83.61 | 83.61 | 83.61 | 67.20 | 67.20 | 83.60 | 83.60 | 0.042 | |||
CNN | 78.69 | 78.73 | 78.69 | 78.69 | 57.39 | 57.42 | 78.71 | 78.71 | 0.046 | |||
MHA | 88.52 | 88.57 | 88.52 | 88.52 | 77.06 | 77.10 | 88.55 | 88.55 | 0.036 | |||
SMOTEENN | RBC | 94.87 | 94.87 | 94.87 | 94.87 | 89.57 | 89.57 | 94.79 | 94.79 | 0.025 | 0.016 | 28.763 |
BBC | 76.92 | 76.84 | 76.92 | 76.83 | 52.76 | 52.83 | 76.20 | 76.20 | 0.047 | |||
LRC | 94.87 | 95.41 | 94.87 | 94.89 | 89.71 | 90.19 | 95.45 | 95.45 | 0.025 | |||
MLP | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 0.000 | |||
LGB | 82.05 | 82.02 | 82.05 | 81.98 | 63.26 | 63.34 | 81.42 | 81.42 | 0.043 | |||
CNN | 92.31 | 92.37 | 92.31 | 92.28 | 84.25 | 84.37 | 91.84 | 91.84 | 0.030 | |||
MHA | 92.31 | 92.37 | 92.31 | 92.28 | 84.25 | 84.37 | 91.84 | 91.84 | 0.030 |
Analysis for FHD
Quantitative analysis: The performance evaluation ML and DL models across various data balancing techniques on the FHD dataset (Table 5) reveals that the Multilayer Perceptron (MLP) consistently outperformed all other models, particularly under the SMOTEENN balancing method. MLP achieved the highest accuracy (92.71%), precision (92.79%), recall (92.71%), F1-score (92.62%), kappa (83.93), and MCC (84.18), demonstrating superior predictive capabilities and balanced classification performance. Among the other models, LGB and BBC also performed strongly, with LGB reaching an accuracy of 91.86% and BBC achieving 91.1% under SMOTEENN. The lowest performing models included RBC and LRC, especially when paired with under-sampling techniques like OSS and NCR. Overall, ML model like MLP consistently outperformed traditional ML/DL models, especially when combined with synthetic oversampling techniques such as SMOTE, SMOTEENN, and ADASYN.
Statistical analysis: The statistical robustness of the performance outcomes was assessed using the Wilcoxon signed-rank test and the Friedman test across all model and balancing combinations. The Friedman test yielded statistically significant values for each balancing group (e.g., 36.514 for SMOTEENN, 38.26 for ADASYN, and 34.395 for OSS), indicating substantial differences in performance rankings across models. The associated Wilcoxon p-values were all less than 0.05, with the lowest p-value observed for MLP under SMOTEENN (0.008), confirming that the improvements in performance were statistically significant and not due to random variation. These results suggest that both the choice of model and data balancing technique have a statistically significant impact on classification performance, and the use of hybrid ML models with synthetic balancing methods can provide a statistically superior solution for imbalanced datasets.
Table 5. Performance analysis of ML & DL models on FHD dataset
Data balancing | Model | Accuracy | Precision | Recall | F1score | Kappa | MCC | Sensitivity | Specificity | Confidence interval | Wilcoxon P-value | Friedman statistic |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
OSS | RBC | 70.97 | 80.97 | 70.97 | 74.64 | 17.59 | 19.52 | 51.3 | 74.21 | 0.01 | 0.016 | 34.395 |
BBC | 73.55 | 79.27 | 73.55 | 75.97 | 13.44 | 14.02 | 37.39 | 79.51 | 0.01 | |||
LRC | 86.47 | 83.79 | 86.47 | 81.89 | 13.52 | 21.2 | 9.57 | 99.14 | 0.01 | |||
MLP | 84.87 | 79.84 | 84.87 | 81.02 | 10.84 | 13.57 | 10.43 | 97.13 | 0.01 | |||
LGB | 85.73 | 81.32 | 85.73 | 81.42 | 11.78 | 16.49 | 9.57 | 98.28 | 0.01 | |||
CNN | 85.98 | 81.95 | 85.98 | 81.41 | 11.37 | 17.08 | 8.7 | 98.71 | 0.01 | |||
MHA | 85.61 | 80.89 | 85.61 | 81.17 | 10.53 | 14.98 | 8.7 | 98.28 | 0.01 | |||
NCR | RBC | 69.34 | 78.53 | 69.34 | 71.91 | 30 | 32.86 | 70 | 69.17 | 0.014 | 0.016 | 31.091 |
BBC | 71.97 | 76.73 | 71.97 | 73.68 | 28.63 | 29.6 | 57.69 | 75.83 | 0.014 | |||
LRC | 80.82 | 78.26 | 80.82 | 77.75 | 28.09 | 31.43 | 26.92 | 95.42 | 0.012 | |||
MLP | 81.97 | 80.07 | 81.97 | 80.08 | 36.64 | 38.55 | 36.92 | 94.17 | 0.012 | |||
LGB | 81.31 | 79.03 | 81.31 | 78.38 | 30.18 | 33.63 | 28.46 | 95.62 | 0.012 | |||
CNN | 81.48 | 80.29 | 81.48 | 76.98 | 24.71 | 31.92 | 20 | 98.12 | 0.012 | |||
MHA | 80 | 77.87 | 80 | 78.4 | 31.97 | 32.98 | 36.15 | 91.88 | 0.012 | |||
SMOTEN | RBC | 83.04 | 83.06 | 83.04 | 83.05 | 66.08 | 66.09 | 83.52 | 82.59 | 0.011 | 0.016 | 33.267 |
BBC | 88.19 | 88.4 | 88.19 | 88.16 | 76.32 | 76.56 | 83.95 | 92.24 | 0.01 | |||
LRC | 78.39 | 78.39 | 78.39 | 78.38 | 56.75 | 56.75 | 77.41 | 79.32 | 0.012 | |||
MLP | 85.82 | 85.82 | 85.82 | 85.82 | 71.63 | 71.63 | 84.94 | 86.67 | 0.011 | |||
LGB | 88.46 | 88.65 | 88.46 | 88.44 | 76.88 | 77.08 | 84.52 | 92.24 | 0.01 | |||
CNN | 84.02 | 84.91 | 84.02 | 83.88 | 67.91 | 68.84 | 75.28 | 92.38 | 0.011 | |||
MHA | 77.69 | 77.87 | 77.69 | 77.68 | 55.43 | 55.57 | 80.82 | 74.69 | 0.013 | |||
ADASYN | RBC | 76.07 | 76.11 | 76.07 | 76.06 | 52.14 | 52.18 | 78.12 | 74.01 | 0.013 | 0.016 | 38.26 |
BBC | 89.84 | 90.03 | 89.84 | 89.83 | 79.68 | 79.87 | 93.34 | 86.32 | 0.009 | |||
LRC | 64.21 | 64.34 | 64.21 | 64.12 | 28.4 | 28.54 | 69.29 | 59.1 | 0.014 | |||
MLP | 84.39 | 85.55 | 84.39 | 84.26 | 68.76 | 69.92 | 93.48 | 75.24 | 0.011 | |||
LGB | 90.18 | 90.37 | 90.18 | 90.17 | 80.37 | 80.56 | 86.82 | 93.57 | 0.009 | |||
CNN | 70.69 | 71.81 | 70.69 | 70.29 | 41.33 | 42.46 | 82.2 | 59.1 | 0.014 | |||
MHA | 66.39 | 67.69 | 66.39 | 65.73 | 32.72 | 34.03 | 80.16 | 52.53 | 0.014 | |||
SMOTETomek | RBC | 78.8 | 78.8 | 78.8 | 78.8 | 57.59 | 57.59 | 78.81 | 78.78 | 0.012 | 0.016 | 31.859 |
BBC | 88.73 | 88.73 | 88.73 | 88.73 | 77.47 | 77.47 | 88.84 | 88.63 | 0.01 | |||
LRC | 64.94 | 64.94 | 64.94 | 64.94 | 29.87 | 29.87 | 64.27 | 65.6 | 0.014 | |||
MLP | 84.53 | 84.82 | 84.53 | 84.51 | 69.09 | 69.36 | 88.84 | 80.31 | 0.011 | |||
LGB | 89.99 | 90.15 | 89.99 | 89.98 | 79.97 | 80.13 | 86.72 | 93.2 | 0.009 | |||
CNN | 71.17 | 71.27 | 71.17 | 71.12 | 42.29 | 42.41 | 67.09 | 75.17 | 0.014 | |||
MHA | 69.21 | 70.58 | 69.21 | 68.75 | 38.55 | 39.83 | 81.64 | 57 | 0.014 | |||
SMOTEENN | RBC | 80.97 | 81.5 | 80.97 | 81.13 | 59.74 | 59.91 | 82.27 | 78.7 | 0.012 | 0.016 | 36.514 |
BBC | 91.1 | 91.07 | 91.1 | 91.08 | 80.7 | 80.71 | 93.59 | 86.75 | 0.009 | |||
LRC | 75.57 | 75.06 | 75.57 | 75.02 | 45.33 | 45.75 | 85.54 | 58.18 | 0.013 | |||
MLP | 92.71 | 92.79 | 92.71 | 92.62 | 83.93 | 84.18 | 97.02 | 85.19 | 0.008 | |||
LGB | 91.86 | 91.88 | 91.86 | 91.77 | 82.11 | 82.29 | 95.98 | 84.68 | 0.008 | |||
CNN | 81.16 | 81.39 | 81.16 | 81.25 | 59.75 | 59.79 | 83.76 | 76.62 | 0.012 | |||
MHA | 79.92 | 79.64 | 79.92 | 79.62 | 55.54 | 55.8 | 87.63 | 66.49 | 0.012 |
Analysis for SHD
Quantitative analysis: The comparative performance analysis of ML and DL models across various data balancing techniques on the SHD dataset reveals several important trends (Table 6). Among the balancing techniques, SMOTEENN and SMOTEN consistently deliver superior classification results. Notably, the MLP model achieves a perfect score (100%) across all evaluated metrics under SMOTEENN, highlighting its effectiveness in capturing data distribution and class boundaries. Additionally, models like LGB and BBC perform strongly when used with SMOTEN, with BBC reaching an accuracy and F1-score of 95.65%. Conversely, under original sampling strategies such as OSS and NCR, most models exhibit lower performance, with notable exceptions like LRC and CNN maintaining around 90% accuracy. These results emphasize the crucial role of data balancing in improving model generalizability, especially for imbalanced health datasets.
Statistical analysis: Statistical tests further validate the performance differences observed across models and balancing methods. The Wilcoxon signed-rank test reports a consistent p-value of 0.016 across all balancing strategies, indicating statistically significant differences in model performance at a 5% confidence level. The Friedman test reinforces this observation, with increasing statistics values-from 25.658 under SMOTEENN to 40.356 under SMOTEN-demonstrating meaningful variations across model-balance combinations. The confidence intervals also narrow significantly in high-performing configurations (e.g., 0.000 under MLP-SMOTEENN), indicating robust model stability. Collectively, these statistical measures confirm that advanced oversampling methods, particularly SMOTEN and SMOTEENN, substantially enhance classification reliability and should be considered as vital preprocessing steps in ML/DL pipelines for SHD prediction.
Table 6. Performance analysis of ML & DL models on SHD dataset
Data balancing | Model | Accuracy | Precision | Recall | F1score | Kappa | MCC | Sensitivity | Specificity | Confidence interval | Wilcoxon P-value | Friedman Statistic |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
OSS | RBC | 80.95 | 93.65 | 80.95 | 84.59 | 41.67 | 51.30 | 78.95 | 100.0 | 0.069 | 0.016 | 29.842 |
BBC | 61.90 | 92.38 | 61.90 | 69.52 | 20.75 | 34.03 | 57.89 | 100.0 | 0.086 | |||
LRC | 90.48 | 81.86 | 90.48 | 85.95 | 0.00 | 0.00 | 100.0 | 0.00 | 0.052 | |||
MLP | 90.48 | 81.86 | 90.48 | 85.95 | 0.00 | 0.00 | 100.0 | 0.00 | 0.052 | |||
LGB | 90.48 | 81.86 | 90.48 | 85.95 | 0.00 | 0.00 | 100.0 | 0.00 | 0.052 | |||
CNN | 90.48 | 81.86 | 90.48 | 85.95 | 0.00 | 0.00 | 100.0 | 0.00 | 0.052 | |||
MHA | 90.48 | 81.86 | 90.48 | 85.95 | 0.00 | 0.00 | 100.0 | 0.00 | 0.052 | |||
NCR | RBC | 81.82 | 88.13 | 81.82 | 84.37 | 24.14 | 26.09 | 85.00 | 50.00 | 0.068 | 0.016 | 32.621 |
BBC | 72.73 | 86.74 | 72.73 | 78.03 | 13.16 | 16.14 | 75.00 | 50.00 | 0.079 | |||
LRC | 90.91 | 82.64 | 90.91 | 86.58 | 0.00 | 0.00 | 100.0 | 0.00 | 0.051 | |||
MLP | 90.91 | 82.64 | 90.91 | 86.58 | 0.00 | 0.00 | 100.0 | 0.00 | 0.051 | |||
LGB | 90.91 | 82.64 | 90.91 | 86.58 | 0.00 | 0.00 | 100.0 | 0.00 | 0.051 | |||
CNN | 90.91 | 82.64 | 90.91 | 86.58 | 0.00 | 0.00 | 100.0 | 0.00 | 0.051 | |||
MHA | 90.91 | 82.64 | 90.91 | 86.58 | 0.00 | 0.00 | 100.0 | 0.00 | 0.051 | |||
SMOTEN | RBC | 93.48 | 93.55 | 93.48 | 93.47 | 86.91 | 86.99 | 95.83 | 90.91 | 0.044 | 0.016 | 40.356 |
BBC | 95.65 | 95.99 | 95.65 | 95.64 | 91.25 | 91.61 | 100.0 | 90.91 | 0.036 | |||
LRC | 91.30 | 91.30 | 91.30 | 91.30 | 82.58 | 82.58 | 91.67 | 90.91 | 0.050 | |||
MLP | 95.65 | 95.99 | 95.65 | 95.64 | 91.25 | 91.61 | 100.0 | 90.91 | 0.036 | |||
LGB | 95.65 | 95.99 | 95.65 | 95.64 | 91.25 | 91.61 | 100.0 | 90.91 | 0.036 | |||
CNN | 95.65 | 95.65 | 95.65 | 95.65 | 91.29 | 91.29 | 95.83 | 95.45 | 0.036 | |||
MHA | 95.65 | 95.99 | 95.65 | 95.64 | 91.25 | 91.61 | 100.0 | 90.91 | 0.036 | |||
ADASYN | RBC | 93.62 | 93.61 | 93.62 | 93.56 | 85.57 | 85.67 | 87.50 | 96.77 | 0.043 | 0.016 | 27.039 |
BBC | 93.62 | 93.61 | 93.62 | 93.56 | 85.57 | 85.67 | 87.50 | 96.77 | 0.043 | |||
LRC | 82.98 | 82.71 | 82.98 | 82.68 | 60.91 | 61.21 | 68.75 | 90.32 | 0.066 | |||
MLP | 97.87 | 97.94 | 97.87 | 97.85 | 95.19 | 95.30 | 93.75 | 100.0 | 0.026 | |||
LGB | 95.74 | 96.00 | 95.74 | 95.67 | 90.23 | 90.66 | 87.50 | 100.0 | 0.036 | |||
CNN | 72.34 | 84.74 | 72.34 | 72.67 | 48.53 | 56.60 | 100.0 | 58.06 | 0.079 | |||
MHA | 61.70 | 74.08 | 61.70 | 61.94 | 29.62 | 35.39 | 87.50 | 48.39 | 0.086 | |||
SMOTETomek | RBC | 84.78 | 85.47 | 84.78 | 84.76 | 69.68 | 70.28 | 79.17 | 90.91 | 0.063 | 0.016 | 36.764 |
BBC | 84.78 | 85.47 | 84.78 | 84.76 | 69.68 | 70.28 | 79.17 | 90.91 | 0.063 | |||
LRC | 76.09 | 76.67 | 76.09 | 76.05 | 52.35 | 52.80 | 70.83 | 81.82 | 0.075 | |||
MLP | 93.48 | 94.26 | 93.48 | 93.47 | 87.01 | 87.75 | 87.50 | 100.0 | 0.044 | |||
LGB | 91.30 | 92.64 | 91.30 | 91.27 | 82.71 | 83.97 | 83.33 | 100.0 | 0.050 | |||
CNN | 71.74 | 77.76 | 71.74 | 69.66 | 42.17 | 48.35 | 95.83 | 45.45 | 0.080 | |||
MHA | 71.74 | 82.24 | 71.74 | 69.71 | 44.73 | 53.67 | 45.83 | 100.0 | 0.080 | |||
SMOTEENN | RBC | 95.00 | 95.00 | 95.00 | 95.00 | 89.97 | 89.97 | 94.74 | 95.24 | 0.039 | 0.016 | 25.658 |
BBC | 95.00 | 95.00 | 95.00 | 95.00 | 89.97 | 89.97 | 94.74 | 95.24 | 0.039 | |||
LRC | 97.50 | 97.61 | 97.50 | 97.50 | 94.97 | 95.10 | 94.74 | 100.0 | 0.028 | |||
MLP | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 0.000 | |||
LGB | 97.50 | 97.62 | 97.50 | 97.50 | 95.00 | 95.12 | 100.0 | 95.24 | 0.028 | |||
CNN | 92.50 | 93.52 | 92.50 | 92.49 | 85.07 | 86.04 | 100.0 | 85.71 | 0.047 | |||
MHA | 72.50 | 75.21 | 72.50 | 71.36 | 43.88 | 46.98 | 52.63 | 90.48 | 0.079 |
Significance of the study
The significance of our study lies in its comprehensive evaluation of the performance of advanced ML and DL models on various healthcare datasets, specifically focusing on the impact of different data balancing strategies. By exploring the effectiveness of techniques such as OSS, NCR, SMOTEN, ADASYN, SMOTETomek, and SMOTEENN, our analysis highlights the critical role of hybrid balancing methods in improving model accuracy and generalization, particularly when addressing class imbalance in complex classification tasks like disease prediction. The results demonstrate that the combination of SMOTEENN with MLP consistently outperforms other configurations, achieving perfect or near-perfect performance across key evaluation metrics, including accuracy, precision, recall, F1-score, Kappa, and MCC. This underscores the exceptional robustness and reliability of the SMOTEENN+MLP model, which stands out as the most effective solution across multiple datasets, including SHD, FHD, and CHD. Moreover, the statistical validation using Wilcoxon and Friedman tests further strengthens the reliability and significance of our findings, confirming the efficacy of the proposed hybrid approach. Our study contributes to the growing body of research that emphasizes the importance of data preprocessing and balancing techniques, showcasing their transformative impact on model performance in medical and healthcare applications. By improving the predictive power and generalization ability of models, particularly in the face of imbalanced datasets, our work paves the way for more accurate and reliable disease prediction systems, ultimately advancing the field of ML for healthcare.
Classification analysis
The analysis of the classification report for ML and DL models employing the SMOTEENN balancing technique on the CHD, FHD, and SHD datasets offers a comprehensive assessment of model performance across various metrics, including precision, recall, and F1-score for both normal and disease classes (illustrated in Table 7). The MLP model consistently surpassed all others, attaining perfect scores (100%) in both precision and recall across all datasets, underscoring its exceptional capability to accurately classify both normal and diseased occurrences. This exceptional performance is especially apparent in the SHD dataset, where MLP attained perfect scores across all criteria, demonstrating its strong generalization capability. The LRC and LGB exhibited competitive outcomes, with the LRC achieving an impeccable recall of 100% for the disease class in both the CHD and SHD datasets. Furthermore, the outcomes for alternative models, including CNN and MHA, demonstrated differing levels of efficacy, with CNN attaining excellent results on the SHD dataset and MHA displaying superior performance on the FHD dataset. The investigation demonstrates that the combination of SMOTEENN and MLP yields the most dependable and high-performing model across various datasets, markedly enhancing classification results, especially for imbalanced datasets. The uniformity of these results across several performance metrics highlights the significance of data balancing methods such as SMOTEENN in improving the prediction efficacy of ML and DL models for healthcare applications.
The classification error distributions for the CHD, FHD, and SHD datasets, shown in Figs. 7, 8, and 9, respectively, provide significant insights into the model’s prediction efficacy. These visualizations illustrate variations in classification accuracy among several datasets, facilitating the evaluation of the model’s strengths and areas needing enhancement.
The boxplot (a) showcases the variation in precision, recall, and F1-score across the Normal and Disease classes, highlighting differences in predictive accuracy. The histogram (b) further emphasizes the frequency distribution of these classification errors, offering a detailed perspective on error trends. Together, these visualizations enhance interpretability by presenting the model’s strengths and weaknesses in differentiating between Normal and Disease cases, thereby supporting informed clinical decision-making.
Table 7. Classification reports of ML & DL model using SMOTEEN on CHD, FHD and SHD datasets
CHD | FHD | SHD | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | Class | Precision | Recall | F1-score | Class | Precision | Recall | F1-score | Class | Precision | Recall | F1-score |
RBC | Normal | 94.12 | 94.12 | 94.12 | Normal | 71.80 | 78.70 | 75.09 | Normal | 95.24 | 95.24 | 95.24 |
Disease | 95.45 | 95.45 | 95.45 | Disease | 87.07 | 82.27 | 84.60 | Disease | 94.74 | 94.74 | 94.74 | |
BBC | Normal | 75.00 | 70.59 | 72.73 | Normal | 88.59 | 86.75 | 87.66 | Normal | 95.24 | 95.24 | 95.24 |
Disease | 78.26 | 81.82 | 80.00 | Disease | 92.49 | 93.59 | 93.04 | Disease | 94.74 | 94.74 | 94.74 | |
LRC | Normal | 89.47 | 100.00 | 94.44 | Normal | 69.78 | 58.18 | 63.46 | Normal | 95.45 | 100.00 | 97.67 |
Disease | 100.00 | 90.91 | 95.24 | Disease | 78.10 | 85.54 | 81.65 | Disease | 100.00 | 94.74 | 97.30 | |
MLP | Normal | 100.00 | 100.00 | 100.00 | Normal | 94.25 | 85.19 | 89.50 | Normal | 100.00 | 100.00 | 100.00 |
Disease | 100.00 | 100.00 | 100.00 | Disease | 91.95 | 97.02 | 94.42 | Disease | 100.00 | 100.00 | 100.00 | |
LGB | Normal | 81.25 | 76.47 | 78.79 | Normal | 92.35 | 84.68 | 88.35 | Normal | 100.00 | 95.24 | 97.56 |
Disease | 82.61 | 86.36 | 84.44 | Disease | 91.61 | 95.98 | 93.74 | Disease | 95.00 | 100.00 | 97.44 | |
CNN | Normal | 93.75 | 88.24 | 90.91 | Normal | 73.02 | 76.62 | 74.78 | Normal | 100.00 | 85.71 | 92.31 |
Disease | 91.30 | 95.45 | 93.33 | Disease | 86.20 | 83.76 | 84.96 | Disease | 86.36 | 100.00 | 92.68 | |
MHA | Normal | 93.75 | 88.24 | 90.91 | Normal | 75.52 | 66.49 | 70.72 | Normal | 67.86 | 90.48 | 77.55 |
Disease | 91.30 | 95.45 | 93.33 | Disease | 82.01 | 87.63 | 84.73 | Disease | 83.33 | 52.63 | 64.52 | |
[See PDF for image]
Fig. 7
Classification error distributions on CHD dataset: Boxplots and histograms display classification metrics (Precision, Recall, and F1-score) for both Normal and Disease classes. The x-axis represents classification metrics, while the y-axis shows performance values. Higher values indicate better classification performance
[See PDF for image]
Fig. 8
Classification error distributions on FHD dataset: Boxplots and histograms display classification metrics (Precision, Recall, and F1-score) for both Normal and Disease classes. The x-axis represents classification metrics, while the y-axis shows performance values. Higher values indicate better classification performance
[See PDF for image]
Fig. 9
Classification error distributions on SHD dataset: Boxplots and histograms display classification metrics (Precision, Recall, and F1-score) for both Normal and Disease classes. The x-axis represents classification metrics, while the y-axis shows performance values. Higher values indicate better classification performance
Confusion matrix analysis
The confusion matrix analysis highlights the classification performance of the SMOTEENN+MLP model across three datasets: CHD, FHD, and SHD (shown in Fig. 10). The model achieves perfect classification for CHD and SHD, with no misclassifications, demonstrating its strong predictive capability. For FHD, the model maintains high accuracy, though minor misclassifications occur, where 57 Normal cases are incorrectly predicted as Disease, and 20 Disease cases are misclassified as Normal. Overall, the analysis suggests that SMOTEENN+MLP is highly effective in distinguishing between Normal and Disease instances, with FHD exhibiting slight room for improvement. This reinforces the model’s reliability in handling imbalanced datasets and ensuring precise disease classification.
[See PDF for image]
Fig. 10
Confusion matrix of SMOTEENN+MLP model on CHD, FHD and SHD datasets: each confusion matrix visualizes classification results, with the x-axis representing predicted labels and the y-axis showing actual labels. The diagonal elements indicate correct classifications, while off-diagonal elements denote misclassifications
ROC curves and precision-recall curves
The ROC curve and precision-recall curve are essential tools for evaluating the performance of your SMOTEENN + MLP model for heart disease prediction on the CHD dataset (shown in Fig. 11a). The ROC curve illustrates the trade-off between sensitivity (true positive rate) and specificity (false positive rate), providing insight into the model’s ability to distinguish between patients with and without heart disease. An high AUC score, which is 1.00 in this instance, indicates nearly ideal categorization performance. The precision-recall curve is more instructive for imbalanced datasets, as it emphasizes the model’s ability to sustain precision while enhancing recall. The MLP model attains an AP of 1.00, underscoring its ability in efficiently balancing false positives and false negatives. The evaluation indicates that the SMOTEENN + MLP model surpasses existing classifiers, establishing it as a highly dependable predictive instrument for heart disease detection. These curves confirm the model’s efficacy in healthcare applications, where precise prediction is essential for timely action.
[See PDF for image]
Fig. 11
ROC and Pecison Recall CURVE of SMOTEENN+MLP Model on CHD Dataset—the x-axis represents the False Positive Rate (FPR), and the y-axis shows the True Positive Rate (TPR). The area under the curve (AUC) quantifies the model’s classification accuracy, where higher AUC values indicate better performance
The ROC curve and precision-recall curve for the SMOTEENN + MLP model on the FHD dataset offer essential insights into its predictive efficacy for heart disease detection (shown in Fig. 11b). The ROC curve shows the model’s efficacy in distinguishing between positive and negative cases, with an AUC of 0.96, signifying robust classification performance. The precision-recall curve, especially beneficial for imbalanced datasets, demonstrates the model’s capacity to retain precision while enhancing recall, resulting in an AP score of 0.96. Comparisons with other classifiers, such as LGB (AUC = 0.98, AP = 0.99) and BBC (AUC = 0.96, AP = 0.98), demonstrate competitive results, but the SMOTEENN + MLP model remains a robust choice for accurate and reliable heart disease prediction. These findings emphasize the importance of balancing techniques and DL approaches in optimizing predictive accuracy, making the model a valuable tool for healthcare applications (Fig. 12).
[See PDF for image]
Fig. 12
ROC and Pecison Recall Curve of SMOTEENN+MLP Model on FHD Dataset—the x-axis represents the False Positive Rate (FPR), and the y-axis shows the True Positive Rate (TPR). The area under the curve (AUC) quantifies the model’s classification accuracy, where higher AUC values indicate better performance
The ROC curve and precision-recall curve for the SMOTEENN + MLP model on the SHD dataset offer valuable insights into its classification performance for heart disease detection (shown in Fig. 13).
The ROC curve illustrates the model’s efficacy in differentiating between positive and negative examples, with models such as MLP, LRC, and CNN attaining an AUC of 1.00, signifying perfect classification. Simultaneously, the precision-recall curve, especially advantageous for imbalanced datasets, underscores the model’s capacity to preserve high accuracy while enhancing recall, demonstrating an average precision (AP) of 1.00 for MLP, CNN, and LRC, indicating exceptional reliability. Comparisons with alternative classifiers, like BBC (AUC = 0.97, AP = 0.94) and MHA (AUC = 0.89, AP = 0.89), underscore the robust predictive efficacy of the SMOTEENN + MLP model on the SHD dataset. These findings validate that this method improves classification accuracy, rendering it a promising instrument for efficient cardiac disease detection and healthcare applications.
[See PDF for image]
Fig. 13
ROC and pecison recall curve of SMOTEENN+MLP model on SHD dataset—the x-axis represents the False Positive Rate (FPR), and the y-axis shows the True Positive Rate (TPR). The area under the curve (AUC) quantifies the model’s classification accuracy, where higher AUC values indicate better performance
Stratified 5-fold cross validation reports
Table 8 presents the performance of five ML models-BBC, LGB, LRC, MLP, and RBC-evaluated using stratified 5-fold cross-validation on three distinct datasets: CHD, FHD, and SHD. The metrics reported include accuracy, precision, recall, F1-score, Kappa, MCC, MAE, MSE, RMSE, AUC, sensitivity, and specificity, each expressed as mean ± standard deviation. Across all datasets, the MLP model consistently outperforms the others, achieving the highest accuracy (e.g., 0.9897 ± 0.014 on CHD), perfect precision (1.0), and perfect AUC (1.0) in several cases, indicating its robustness and generalization ability. Notably, the MLP model also maintains the lowest error values (MAE, MSE, RMSE), demonstrating high prediction accuracy and stability. Other models such as LGB and RBC also show competitive performance but are marginally lower than MLP. Interestingly, the LRC model performs reasonably well on CHD and SHD but underperforms on FHD, suggesting dataset-specific sensitivity. Overall, the table highlights the superior capability of MLP across diverse datasets, validating its effectiveness for classification tasks in the studied domain.
Table 8. Performance of stratified 5-fold cross validation of mean ± SD of metrics using SMOTEENN
Datasets | Model | Accuracy | Precision | Recall | F1-score | Kappa | MCC | MAE | MSE | RMSE | AUC | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CHD | BBC | 0.9282 ± 0.0556 | 0.9484 ± 0.0359 | 0.9105 ± 0.0964 | 0.9271 ± 0.061 | 0.8565 ± 0.1105 | 0.86 ± 0.1056 | 0.0718 ± 0.0556 | 0.0718 ± 0.0556 | 0.2518 ± 0.1023 | 0.9897 ± 0.0097 | 0.9105 ± 0.0964 | 0.9468 ± 0.0372 |
LGB | 0.959 ± 0.05 | 0.9814 ± 0.0255 | 0.94 ± 0.1084 | 0.9568 ± 0.0567 | 0.9181 ± 0.0993 | 0.924 ± 0.0885 | 0.041 ± 0.05 | 0.041 ± 0.05 | 0.1677 ± 0.127 | 0.9953 ± 0.0054 | 0.94 ± 0.1084 | 0.9784 ± 0.0296 | |
LRC | 0.9846 ± 0.014 | 1.0 ± 0.0 | 0.9705 ± 0.027 | 0.9849 ± 0.0138 | 0.9692 ± 0.0281 | 0.97 ± 0.0274 | 0.0154 ± 0.014 | 0.0154 ± 0.014 | 0.0961 ± 0.0877 | 0.9984 ± 0.0024 | 0.9705 ± 0.027 | 1.0 ± 0.0 | |
MLP | 0.9897 ± 0.014 | 1.0 ± 0.0 | 0.9805 ± 0.0267 | 0.99 ± 0.0137 | 0.9795 ± 0.0281 | 0.98 ± 0.0274 | 0.0103 ± 0.014 | 0.0103 ± 0.014 | 0.0641 ± 0.0877 | 1.0 ± 0.0 | 0.9805 ± 0.0267 | 1.0 ± 0.0 | |
RBC | 0.9744 ± 0.0314 | 0.9814 ± 0.0255 | 0.97 ± 0.0671 | 0.9743 ± 0.0332 | 0.9487 ± 0.0626 | 0.9512 ± 0.0585 | 0.0256 ± 0.0314 | 0.0256 ± 0.0314 | 0.1195 ± 0.1191 | 0.9989 ± 0.0014 | 0.97 ± 0.0671 | 0.9784 ± 0.0296 | |
FHD | BBC | 0.9136 ± 0.0094 | 0.9234 ± 0.011 | 0.9422 ± 0.0065 | 0.9327 ± 0.0071 | 0.812 ± 0.021 | 0.8125 ± 0.0208 | 0.0864 ± 0.0094 | 0.0864 ± 0.0094 | 0.2936 ± 0.0158 | 0.964 ± 0.0049 | 0.9422 ± 0.0065 | 0.8637 ± 0.0209 |
LGB | 0.9209 ± 0.009 | 0.9282 ± 0.0116 | 0.949 ± 0.0031 | 0.9385 ± 0.0066 | 0.8278 ± 0.0202 | 0.8283 ± 0.0199 | 0.0791 ± 0.009 | 0.0791 ± 0.009 | 0.2809 ± 0.016 | 0.974 ± 0.006 | 0.949 ± 0.0031 | 0.872 ± 0.0222 | |
LRC | 0.7697 ± 0.0133 | 0.7952 ± 0.0126 | 0.8589 ± 0.0079 | 0.8258 ± 0.0092 | 0.4877 ± 0.0317 | 0.4909 ± 0.031 | 0.2303 ± 0.0133 | 0.2303 ± 0.0133 | 0.4797 ± 0.0139 | 0.8306 ± 0.0137 | 0.8589 ± 0.0079 | 0.6143 ± 0.0289 | |
MLP | 0.9333 ± 0.0087 | 0.9234 ± 0.0057 | 0.976 ± 0.009 | 0.949 ± 0.0067 | 0.8531 ± 0.019 | 0.8558 ± 0.0194 | 0.0667 ± 0.0087 | 0.0667 ± 0.0087 | 0.2577 ± 0.0175 | 0.961 ± 0.0048 | 0.976 ± 0.009 | 0.8591 ± 0.0106 | |
RBC | 0.7626 ± 0.0138 | 0.7878 ± 0.0106 | 0.8577 ± 0.0339 | 0.8209 ± 0.0136 | 0.4703 ± 0.0255 | 0.4754 ± 0.0277 | 0.2374 ± 0.0138 | 0.2374 ± 0.0138 | 0.4871 ± 0.0141 | 0.8164 ± 0.0167 | 0.8577 ± 0.0339 | 0.5968 ± 0.0359 | |
SHD | BBC | 0.9592 ± 0.0291 | 0.9436 ± 0.0375 | 0.9647 ± 0.0526 | 0.9532 ± 0.0342 | 0.9171 ± 0.0595 | 0.9185 ± 0.059 | 0.0408 ± 0.0291 | 0.0408 ± 0.0291 | 0.1775 ± 0.1076 | 0.9899 ± 0.0177 | 0.9647 ± 0.0526 | 0.9553 ± 0.0308 |
LGB | 0.9641 ± 0.0429 | 0.9638 ± 0.0333 | 0.9529 ± 0.0767 | 0.9575 ± 0.0522 | 0.9265 ± 0.0884 | 0.9277 ± 0.0867 | 0.0359 ± 0.0429 | 0.0359 ± 0.0429 | 0.1414 ± 0.141 | 0.992 ± 0.0098 | 0.9529 ± 0.0767 | 0.9727 ± 0.0249 | |
LRC | 0.9644 ± 0.0291 | 1.0 ± 0.0 | 0.9176 ± 0.0671 | 0.956 ± 0.0369 | 0.9263 ± 0.0605 | 0.93 ± 0.0562 | 0.0356 ± 0.0291 | 0.0356 ± 0.0291 | 0.1642 ± 0.1041 | 0.9761 ± 0.0301 | 0.9176 ± 0.0671 | 1.0 ± 0.0 | |
MLP | 0.9899 ± 0.0139 | 1.0 ± 0.0 | 0.9765 ± 0.0322 | 0.9879 ± 0.0166 | 0.9792 ± 0.0285 | 0.9797 ± 0.0278 | 0.0101 ± 0.0139 | 0.0101 ± 0.0139 | 0.0636 ± 0.0872 | 0.9995 ± 0.0012 | 0.9765 ± 0.0322 | 1.0 ± 0.0 | |
RBC | 0.959 ± 0.0532 | 0.9521 ± 0.0517 | 0.9529 ± 0.0767 | 0.9522 ± 0.0629 | 0.9163 ± 0.1089 | 0.9168 ± 0.1085 | 0.041 ± 0.0532 | 0.041 ± 0.0532 | 0.1489 ± 0.1535 | 0.9914 ± 0.0116 | 0.9529 ± 0.0767 | 0.9636 ± 0.038 |
Comparative analysis with baseline ML models
Comparison with advanced deep learning architectures The proposed XAI-HD framework evaluates a diverse set of both traditional ML models (e.g., RBC, BBC, LRC, LGB) and DL architectures such as Convolutional Neural Networks (CNN) and Multi-Head Attention (MHA). These models were tested on structured/tabular clinical datasets (CHD, FHD, SHD) rather than time-series data. Although XAI-HD does not currently target time-series cardiovascular event prediction, the included DL models are capable of capturing complex feature interactions, and future extensions may incorporate temporal modeling techniques.
Statistical validation of improvements To ensure the statistical significance of performance improvements, we employed nonparametric statistical testing using the Wilcoxon signed-rank test. This test confirmed that XAI-HD consistently outperforms baseline models across all three datasets. While the Wilcoxon test validates pairwise model improvements, we acknowledge the importance of multi-model comparisons and plan to incorporate additional tests like the Friedman test and Nemenyi post-hoc analysis in future studies to provide more comprehensive validation.
Handling of missing data in real-world EHR datasets XAI-HD applies systematic preprocessing techniques to address missing data—a common issue in electronic health record (EHR) datasets. Categorical variables are imputed using the mode, while numerical features are filled using the mean, ensuring completeness for model training. Although the datasets used in this study are structured and not longitudinal in nature, we recognize the need for more sophisticated imputation methods (e.g., KNN, MICE, or model-based imputation) and plan to integrate dynamic approaches for handling time-dependent missing data in future iterations targeting EHR-based time-series applications.
Ablation study
To systematically evaluate the impact of different components of our proposed XAI-HD framework, we conducted an ablation study focusing on two critical dimensions: data balancing techniques and model selection.
Balancing techniques exploration
Imbalanced data significantly impairs model learning and prediction in medical domains, particularly in heart disease classification. We examined several resampling strategies including OSS, NCR, SMOTEN, ADASYN, SMOTETomek, and SMOTEENN. Among these, the hybrid SMOTEENN technique demonstrated superior performance by effectively balancing the dataset while preserving essential class characteristics. It consistently led to improvements in key metrics such as recall, F1-score, and ROC–AUC across CHD, FHD, and SHD datasets, highlighting its ability to address class imbalance without sacrificing model performance.
Model selection evaluation
To determine the most effective learning algorithm, we experimented with various ML and DL models including RBC, BBC, LRC, MLP, LGB, CNN, and MHA. The ablation results confirmed that the MLP significantly outperformed other models in terms of accuracy, generalization, and robustness. Its architecture facilitated complex feature extraction and non-linear pattern recognition, making it highly effective for cardiovascular disease prediction tasks. This supports the selection of MLP as the backbone of our XAI-HD framework.
Overall, the ablation study validates that the integration of SMOTEENN with MLP offers a powerful and balanced solution, achieving both high predictive accuracy and resilience to data imbalance.
Key exploration note
Impact of Balancing Techniques: Our exploration revealed that hybrid data balancing strategies, particularly SMOTEENN, significantly improved the models’ capacity to handle class imbalance. This was evident from marked gains in evaluation metrics such as recall, F1-score, and ROC–AUC across all three datasets (CHD, FHD, and SHD), thus improving forecast accuracy in unbalanced real-world situations.
Superior Performance of MLP: The MLP, as the fundamental element of our proposed XAI-HD system, frequently surpassed other ML and DL models for accuracy, robustness, and generalization. The deep layered architecture, coupled with backpropagation-driven learning, facilitated efficient feature abstraction and sophisticated pattern recognition essential to heart disease identification.
Generalization Across Datasets: The suggested XAI-HD model demonstrated its flexibility and scalability in a variety of cardiovascular prediction scenarios by demonstrating consistent and excellent performance across all datasets, including CHD, FHD, and SHD. This constant behavior emphasizes the model’s potential for practical application in diverse healthcare settings. Thus enhancing the reliability of predictions in real-world imbalanced scenarios.
XAI visualization and analysis
To improve the interpretability of our heart disease prediction models, we utilized two popular Explainable AI (XAI) methodologies: SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations). SHAP quantified the contribution of each input feature to the model’s output, hence providing global feature importance, whereas LIME delivered local reasons for specific predictions. These methodologies facilitated the visualization and validation of the impact of essential parameters, including age, cholesterol levels, and smoking status, across the CHD, FHD, and SHD datasets. The insights derived from XAI analysis enhance confidence in our MLP-based model and facilitate its practical use in healthcare environments.
SHAP analysis for CHD dataset
Figure 14 presents the SHAP-based FIA for the CHD dataset, offering an in-depth view of how individual features contribute to the classification of Normal and Disease cases. In the SHAP - Normal Class visualization, features such as trestbps, thal, and sex exhibit the highest positive SHAP values, indicating their strong influence in classifying individuals as normal. Conversely, attributes such as restecg, slope, and cp contribute negatively to the classification output, signifying their minimal role in maintaining normal cardiovascular health. For the SHAP - Disease Class, the ranking of influential features shifts, with cp, thalach, and exang emerging as dominant contributors toward disease prediction. These features demonstrate a substantial positive SHAP impact, reinforcing their significance in detecting CHD cases. Meanwhile, factors such as slope, restecg, and trestbps exhibit negative SHAP values, suggesting a lesser influence on disease classification. By leveraging SHAP, this analysis enhances model interpretability, ensuring transparency in feature contributions. These insights are crucial for identifying key risk factors, improving predictive accuracy, and supporting early detection strategies for CHD in clinical practice.
[See PDF for image]
Fig. 14
SHAP-based FIA on CHD Dataset—the x-axis represents SHAP values, indicating the impact of each feature on classification decisions, while the y-axis lists features such as trestbps, thal, sex, age, oldpeak, ca, slope, exang, cp, thalach, chol, fbs, and restecg. Higher absolute SHAP values denote greater influence on predictions
LIME analysis for CHD dataset
Figure 15 illustrates the local feature contributions obtained using LIME for the CHD dataset. The visualization highlights how individual features influence the model’s decision-making process. Notably, attributes such as thal, trestbps, and sex show positive contributions toward the disease prediction, while features like restecg, slope, and cp negatively impact the classification output. LIME’s interpretability at the instance level allows domain experts to understand and validate the rationale behind predictions, reinforcing the reliability and transparency of the model in sensitive applications like cardiovascular disease detection.
[See PDF for image]
Fig. 15
LIME-based FIA on CHD dataset–the x-axis represents the magnitude of feature contributions, indicating their influence on classification decisions, while the y-axis lists key features. Larger bars correspond to features with a stronger local effect on predictions
SHAP analysis for FHD dataset
Figure 16 illustrates the SHAP-based FIA for the FHD dataset, highlighting how various features influence the classification of Normal and Disease cases. In the SHAP - Normal Class visualization, attributes such as age, prevalent hypertension, and BMI exhibit strong positive SHAP values, signifying their major contributions to the classification of normal individuals. Meanwhile, features such as male, cigsPerDay, and glucose show negative SHAP values, suggesting a lower impact or contrasting influence. On the other hand, the SHAP - Disease Class representation emphasizes the role of key risk factors, with age, BMI, and blood pressure emerging as dominant contributors toward disease prediction. Additionally, variables such as diabetes, heart rate, and glucose levels display notable effects, underscoring their relevance in identifying individuals at risk. By utilizing SHAP values, this analysis enhances model interpretability, offering a transparent view of how specific features drive predictions. Such insights are crucial for improving early diagnosis strategies and optimizing clinical decision-making in heart disease risk assessment.
[See PDF for image]
Fig. 16
SHAP-based FIA FHD dataset—the x-axis represents SHAP values, quantifying each feature’s influence on model predictions, while the y-axis lists key features such as age, BMI, cholesterol, blood pressure, and smoking status. Higher absolute SHAP values indicate stronger importance in determining classification outcomes
LIME analysis for FHD dataset
Figure 17 illustrates the local feature contributions obtained using LIME for the FHD dataset. The visualization highlights how individual features influence the model’s decision-making process. Notably, attributes such as prevalent stroke, age, diabetes, and systolic blood pressure (sysBP) show positive contributions toward the disease prediction, while features like BPMeds, education, and total cholesterol negatively impact the classification output. LIME’s interpretability at the instance level allows domain experts to understand and validate the rationale behind predictions, reinforcing the reliability and transparency of the model in sensitive applications like cardiovascular disease detection.
[See PDF for image]
Fig. 17
LIME-based FIA on FHD dataset—the x-axis represents the magnitude of feature contributions, indicating their influence on classification decisions, while the y-axis lists key features. Larger bars correspond to features with a stronger local effect on predictions
SHAP analysis for SHD dataset
Figure 18 illustrates the SHAP-based FIA for the SHD dataset, offering insights into how different attributes contribute to the classification of Normal and Disease cases. In the SHAP - Normal Class visualization, features such as cp, exang, and thalach exhibit the highest negative SHAP values, indicating their strong influence in maintaining a normal classification. Meanwhile, attributes like age, sex, and oldpeak show minimal impact, reinforcing their lesser significance in differentiating normal cases. On the other hand, the SHAP - Disease Class representation highlights features such as cp, exang, and thalach as having prominent positive SHAP values, strongly contributing to disease predictions. Variables such as slope, restecg, and trestbps demonstrate moderate influences, further refining the classification procedure. This analysis utilizes SHAP to improve model interpretability, guaranteeing transparency in feature contributions. These insights are essential for comprehending individual risk factors and enhancing predicted accuracy in heart disease identification.
[See PDF for image]
Fig. 18
SHAP-based FIA on SHD dataset—the x-axis represents SHAP values, quantifying each feature’s influence on model predictions, while the y-axis lists key features such as cp, exang, thalach, restecg, trestbps, slope, ca, fbs, thal, oldpeak, sex, and age. Higher absolute SHAP values indicate stronger importance in determining classification outcomes
LIME analysis for SHD dataset
Figure 19 shows the local feature contributions derived from LIME for the SHD dataset, highlighting the impact of specific variables on model predictions. The visualization emphasizes critical attributes such as cp, thalach, and exang, which demonstrate significant favorable impacts on the classification of heart disease.
Meanwhile, features like slope, restecg, and trestbps demonstrate negative impacts on disease prediction, reinforcing their role in distinguishing normal cases. LIME’s instance-level interpretability aids in understanding and validating the rationale behind the model’s predictions, ensuring transparency and reliability in cardiovascular disease detection and risk assessment.
[See PDF for image]
Fig. 19
LIME-based FIA on SHD Dataset—the x-axis represents the magnitude of feature contributions, indicating their influence on classification decisions, while the y-axis lists key features. Larger bars correspond to features with a stronger local effect on predictions
It is important to note that SHAP provides a global explanation by aggregating feature importance across the entire dataset, while LIME offers local interpretations by focusing on specific instances. As a result, certain features such as thal, cp, or restecg may appear in LIME’s output but not be prominently ranked in SHAP’s global importance due to their context-specific influence. These variations are expected and are, in fact, beneficial, as they allow clinicians to understand both generalized trends and patient-specific predictions. Additionally, our model substantiated the inclusion of features like age and sex (male) by aligning them with established medical literature. Numerous clinical studies, including those from the Framingham Heart Study, confirm that advancing age and male gender are significant risk factors for cardiovascular disease. Our model’s high attribution of importance to these variables-evident in both SHAP and LIME analyses-reinforces its alignment with empirical evidence. This not only validates the model’s learning patterns but also strengthens its clinical credibility, ensuring that predictions are grounded in medically relevant associations.
Computational complexity analysis
To evaluate the efficiency and scalability of the proposed XAI-HD approach, which leverages MLP as its core model, we performed a comprehensive computational complexity analysis across CHD, FHD, and SHD datasets. Table 9 summarizes the key performance indicators: training time, prediction speed, total computational cost, and memory usage. Compared to traditional ML models (e.g., LRC, RBC, BBC), XAI-HD introduces moderate computational overhead due to its DL and explainability components. However, it maintains a favorable balance between transparency, performance, and resource efficiency. While logistic regression (LRC) demonstrated the lowest overhead (total time 0.281s, memory 0.33MB in SHD), the proposed XAI-HD (MLP-based) model delivers superior predictive performance with minimal trade-offs in computational demand. In both CHD and SHD datasets, XAI-HD achieves low prediction latency (0.002s) and moderate memory consumption (0.4–0.6MB), making it feasible for real-time clinical inference.
In comparison, advanced DL models like CNN and MHA imposed significantly higher costs. For example, MHA required 43.893s and 7.653MB in FHD, indicating potential limitations for deployment in low-resource environments. XAI-HD, by contrast, strikes a practical trade-off by offering high accuracy with lower resource usage.
Table 9. Computational complexity analysis of ML & DL models on CVD datasets
Dataset | Model name | Training time (s) | Prediction time (s) | Total time (s) | Memory utilization (MB) |
|---|---|---|---|---|---|
CHD | RBC | 4.105 | 0.003 | 4.108 | 0.332 |
BBC | 3.397 | 0.006 | 3.402 | 0.428 | |
LRC | 0.285 | 0.001 | 0.287 | 0.310 | |
MLP | 3.542 | 0.002 | 3.544 | 0.402 | |
LGB | 3.317 | 0.005 | 3.321 | 1.409 | |
CNN | 9.612 | 0.681 | 10.293 | 5.046 | |
MHA | 17.380 | 1.310 | 18.690 | 7.417 | |
FHD | RBC | 3.658 | 0.036 | 3.693 | 3.053 |
BBC | 11.754 | 0.053 | 11.806 | 3.246 | |
LRC | 0.385 | 0.005 | 0.389 | 2.030 | |
MLP | 57.678 | 0.004 | 57.682 | 1.778 | |
LGB | 12.052 | 0.013 | 12.065 | 2.526 | |
CNN | 24.586 | 0.405 | 24.991 | 4.575 | |
MHA | 43.306 | 0.587 | 43.893 | 7.653 | |
SHD | RBC | 1.390 | 0.003 | 1.393 | 0.316 |
BBC | 2.309 | 0.006 | 2.315 | 0.243 | |
LRC | 0.280 | 0.001 | 0.281 | 0.330 | |
MLP | 3.766 | 0.002 | 3.768 | 0.634 | |
LGB | 1.207 | 0.002 | 1.210 | 1.334 | |
CNN | 6.881 | 0.484 | 7.365 | 5.146 | |
MHA | 16.596 | 1.309 | 17.905 | 7.529 |
Feasibility for real-time clinical use
Due to its rapid prediction time and compact memory footprint, XAI-HD demonstrates strong potential for real-time use in clinical decision-support systems. Its ability to deliver near-instantaneous predictions (within milliseconds) makes it viable for integration into high-frequency health monitoring tools or hospital information systems.
Cloud and edge deployment potential
The lightweight inference capabilities of XAI-HD make it highly suitable for deployment on cloud-based platforms and edge-AI devices, such as portable diagnostic tools or embedded systems in healthcare IoT ecosystems. The model’s small size and fast execution time align well with the hardware constraints typically associated with edge computing.
Optimization strategies
To further enhance XAI-HD’s scalability and responsiveness, ongoing work includes model pruning, quantization, and knowledge distillation. These strategies are aimed at reducing model size and speeding up inference while retaining accuracy. Additionally, we plan to implement asynchronous data streaming and parallel computation techniques to optimize real-time performance in large-scale deployments.
In summary, XAI-HD provides a stable and interpretable framework for predicting heart disease, demonstrating significant computational efficiency. Its streamlined architecture guarantees dependable performance under resource limitations, while also being appropriately designed for future integration into real-time, edge, and cloud-based clinical applications.
Ethical considerations
The use of artificial intelligence (AI) in healthcare, especially in predictive functions like heart disease detection, requires careful ethical considerations to guarantee responsible and reliable implementation. Primary concerns are data privacy, flexibility, potential biases, and patient safety.
Data Privacy: Securing patient information is essential, particularly due to the delicate nature of medical records. Ensuring data anonymization and adherence to regulations like HIPAA and GDPR is crucial when managing patient information. Previous research, like that conducted by Li et al. (2024), emphasizes the necessity for rigorous processes in managing patient data, especially with sensitive diagnostic information.
Transparency and Explainability: To gain clinical trust and regulatory approval, AI models must be transparent and interpretable. Black-box models risk being untrustworthy, particularly in high-stakes environments like cardiology. Our study incorporates SHAP, LIME, and Feature Importance Analysis to enhance interpretability, ensuring that clinicians understand and trust the model’s decision-making process. This aligns with recommendations from Jia et al. (2024), who emphasized explainable model designs in biomedical image analysis tasks.
Bias and Fairness: Algorithmic bias is a critical ethical concern, especially when datasets may be unbalanced across gender, age, or ethnicity. Models trained on biased data can lead to unequal treatment outcomes. For instance, Huang et al. (2022) highlight the importance of balanced gene selection in phenotype classification, a concept that parallels the fairness requirement in cardiovascular diagnosis. Zhao et al. (2023) also note the significance of personalized approaches in cardiovascular interventions, implying the risk of generic model training without demographic consideration.
Clinical Relevance and Safety: The predictive outputs of AI models should not replace clinical judgment but rather assist in early detection and risk assessment. Models must be validated extensively across multiple cohorts to prevent misdiagnosis. Ethical implementation also includes ensuring that interventions based on predictions are actionable and do not induce harm, as emphasized in works like Bing et al. (2024) and Zhang et al. (2024).
Long-Term Monitoring and Accountability: AI-driven models should be subject to continuous monitoring and periodic auditing. Liu et al. (2024) advocate for longitudinal assessments in pediatric cardiology, which is equally relevant to AI deployment in adult cardiology for sustained evaluation of system performance and outcome impact.
Discussion
The XAI-HD is a significant advancement in heart disease prediction, integrating ML, DL, and explainable AI into a unified and interpretable framework. As evidenced in Table 10, the proposed model surpasses numerous state-of-the-art (SOA) approaches on the Cleveland dataset, achieving an unprecedented accuracy of 100% using an MLP model in conjunction with the SMOTEENN data balancing technique. This result highlights the critical role of rigorous preprocessing, particularly class balancing, in mitigating model bias and enhancing generalization performances. In contrast to many existing models that overlook the impact of data imbalance or feature representation, XAI-HD applies tailored strategies-such as Z-score normalization and hybrid balancing methods-to ensure robustness and fairness across diverse patient profiles. Although most studies utilized high-performing classifiers such as XGBoost, ensemble models, or CNN variations, few integrated explainability techniques, with only a limited number employing SHAP or LIME. This underlines XAI-HD’s novel contribution in pairing top-tier performance with model transparency, crucial in clinical applications. Additionally, the incorporation of statistical validation and complexity analysis further strengthens the reliability and reproducibility of the results. Overall, the integration of performance, interpretability, and computational rigor positions XAI-HD as a highly effective and trustworthy tool for real-world deployment in healthcare diagnostics.
Table 10. Comparison analysis of our proposed model with existing state-of-the-art (SOA) works on the Cleveland dataset
No. | Author | Dataset | Data balancing | Model | Accuracy | XAI |
|---|---|---|---|---|---|---|
1 | Sharma et al. (2017) | Cleveland | – | DT | 93.24% | – |
2 | Paul (2024c) | Cleveland | – | Stacking Ensemble | 98.2% | – |
3 | Musa and Muhammad (2022) | Cleveland | – | BN | 88.5% | – |
4 | Abushariah et al. (2014) | Cleveland | – | ANN | 87.04% | – |
5 | Nahar et al. (2013) | Cleveland | – | SMO | 96.04% | – |
6 | Ogunpola et al. (2024) | Cleveland | RO | XGBoost | 98.50% | – |
7 | Nursyahrina et al. (2024) | Cleveland | – | RNN | 88.52% | - |
8 | Srinivas and Katarya (2022) | Cleveland | – | XGBoost, OPTUNA | 94.7% | – |
9 | El-Bialy et al. (2015) | Cleveland | – | FDT, C4.5 | 78.06% | – |
10 | Mienye et al. (2020) | Cleveland | Mean-based partitioning | CART, AB-WAE | 93% | – |
11 | Shah et al. (2020) | Cleveland | – | KNN | 90.78% | – |
12 | Perumal and Kaladevi (2020) | Cleveland | PCA | SVM | 87% | – |
13 | Shrestha (2024a) | Cleveland | - | XGBoost | 90% | SHAP |
14 | Paul (2024b) | Cleveland | – | RF | 92.7% | SHAP, LIME |
15 | Paul (2024a) | Cleveland | – | XGBoost | 89.1% | (FIA) |
16 | Shrestha (2024b) | Cleveland | – | XGBoost | 90% | SHAP |
17 | Suryawanshi (2024) | Cleveland | – | Voting Classifier | 97.9% | – |
18 | Shrivastava et al. (2023) | Cleveland | – | CNN-BiLSTM | 96.66% | – |
19 | Kanagarathinam et al. (2022) | Cleveland | – | CatBoost | 94.34% | – |
20 | Kahramanli and Allahverdi (2008) | Cleveland | – | ANN-FNN | 86.8% | – |
21 | Almulihi et al. (2022) | Cleveland | – | Stacking | 97.17% | – |
22 | Kayalvizhi et al. (2023) | Cleveland | – | Hybrid | 95.3% | – |
23 | Laftah and Al-Saedi (2024) | Cleveland | – | Soft Voting | 98.54% | LIME |
14 | Proposed Model | Ceveland | SMOTEENN | MLP | 100.0% | SHAP, LIME |
Table 11 presents a comprehensive comparison of the proposed XAI-HD framework with several existing state-of-the-art (SOA) methods applied to the Framingham dataset. Our proposed model achieved a remarkable accuracy of 92.71% using an MLP architecture combined with SMOTEENN for data balancing, outperforming the majority of previous works that typically employed classical ML models without advanced preprocessing techniques. While some models, such as the Quantum Neural Network (QNN) by Narain et al. and the RNN+GRU model by Krishnan et al., reported slightly higher accuracy, these methods often lack reproducibility, interpretability, or computational feasibility in clinical environments. Moreover, many prior studies either neglected data imbalance or relied on basic models like RFs and CNNs, resulting in lower predictive performance. Our results emphasize the critical impact of hybrid balancing strategies like SMOTEENN in addressing class skewness, which is a known challenge in clinical datasets such as Framingham. Although few models incorporated explainability techniques (e.g., SHAP or Saliency Maps), our framework prioritizes model robustness and transparency, which is essential for deployment in sensitive healthcare settings. The XAI-HD model not only provides competitive accuracy but also offers a balanced approach by integrating preprocessing, DL, and interpretability potential, ensuring that clinical predictions are both reliable and understandable.
Table 11. Comparison analysis of our proposed model with existing state-of-the-art (SOA) works on the Framingham dataset
No. | Author | Dataset | Data balancing | Model | Accuracy | XAI |
|---|---|---|---|---|---|---|
1 | Mahmoud et al. (2021) | Framingham | – | RF | 85.05% | – |
2 | Chen et al. (2020) | Framingham | – | RS | 85.11% | – |
3 | Narain et al. (2016) | Framingham | – | QNN | 98.57% | – |
4 | Suhatril et al. (2024) | Framingham | – | RF | 85% | – |
5 | Gupta and Sedamkar (2020) | Framingham | GA | GA-Optimized MLP | 84% | – |
6 | Gupta and Seth (2022) | Framingham | – | Optimized RF | 97.13% | FIA |
7 | Demir and Selvitopi (2023) | Framingham | – | ANN | 85.85% | – |
8 | Zhang et al. (2022) | Framingham | – | LightGBM | AUC: 0.834 | SHAP |
9 | Krishnan et al. (2023) | Framingham | SMOTE | RNN +GRU | 98.78% | – |
10 | Orfanoudaki et al. (2020) | Framingham | – | N-SRS (OCT) | 87.43% | – |
11 | Kahouadji (2024) | Framingham | – | DDS1 | 75% | – |
12 | Proposed Model | Framingham | SMOTEENN | MLP | 92.71% | SHAP, LIME |
In the case of the Switzerland dataset, as presented in Table 12, the proposed XAI-HD framework significantly outperforms existing state-of-the-art models by achieving a perfect accuracy of 100.0% through the combination of MLP and SMOTEENN. This marks a substantial improvement over the best-reported accuracy of 95.65% by Mesquita & Marques using a RF classifier with SHAP-based explainability. Other competitive approaches, such as PCA+DE by Luukka & Lampinen (94.50%) and XGBoost (Few-shot) by Nazary et al. (F1-score = 0.902), failed to reach similar predictive certainty. The remarkable gain in accuracy (4.35–10.6 percentage points) underscores the pivotal role of hybrid preprocessing, particularly the impact of SMOTEENN in mitigating class imbalance, which was largely unaddressed in prior works. Furthermore, while some models integrated explainability through tools like SHAP or feature importance, they lacked deep neural architectures or robust balancing methods. The XAI-HD model uniquely combines both elements-DL for high predictive power and the groundwork for integrating interpretability-making it a compelling benchmark. This blend of preprocessing precision, modeling strength, and interpretability potential establishes new performance standards for heart disease prediction on the Switzerland dataset.
Table 12. Comparison analysis of our proposed model with existing state-of-the-art (SOA) works on the Switzerland dataset
No. | Author | Dataset | Data balancing | Model | Accuracy | XAI |
|---|---|---|---|---|---|---|
1 | Nazary et al. (2024) | Switzerland | – | XGBoost (Few-shot) | F1 = 0.902 | Feature Importance |
2 | Mesquita and Marques (2024) | Switzerland | – | RF | 95.65% | SHAP |
3 | Rahman et al. (2024) | Switzerland | – | MLP | 89.40% | – |
4 | Luukka and Lampinen (2010) | Switzerland | – | PCA + DE | 94.50% | – |
5 | Lewandowicz and Kisiała (2023) | Switzerland | – | SVM | 82.47% | – |
6 | Tsoumplekas et al. (2024) | Switzerland | Balanced-MixUp | Federated Model | 83.33% | – |
7 | Chen (2024) | Switzerland | – | SVM | 87.78% | – |
8 | Ningthoujam et al. (2025) | Switzerland | – | XGBoost | 83.90% | SHAP, LIME |
9 | Rodriguez and Nafea (2024) | Switzerland | – | Centralized SVM | 83.30% | SHAP |
10 | Spencer et al. (2020) | Switzerland | – | BayesNet | 85% | – |
14 | Proposed Model | Switzerland | SMOTEENN | MLP | 100.0% | SHAP, LIME |
The XAI-HD framework demonstrates superior performance across all three datasets, achieving 100% accuracy on both the Cleveland and Switzerland datasets and 92.71% on the Framingham dataset. These results are complemented by consistent improvements in precision, recall, and F1-score, underscoring the model’s reliability in detecting true cases while minimizing false positives. As detailed in Tables 10, 11, and 12, XAI-HD surpasses high-performing models such as Paul’s stacking ensemble (98.2%), Krishnan’s RNN-GRU (98.78%), and Mesquita’s RF (95.65%) not only in predictive accuracy but also in transparency, fairness, and clinical deployability. This performance is largely attributed to the strategic integration of Z-score normalization, hybrid balancing using SMOTEENN, and the use of explainable AI tools like SHAP and LIME-components often missing in prior approaches. Notably, our model outperforms traditional and DL models, including XGBoost, QNN, and hybrid voting classifiers, by addressing critical gaps in data imbalance handling and interpretability. Furthermore, statistical validations and complexity analyses confirm the robustness and generalizability of our approach, making XAI-HD a replicable and trustworthy solution for real-world healthcare applications.
To evaluate the scalability and generalizability of the XAI-HD framework across diverse populations, regions, and healthcare systems, we rigorously tested it on three heterogeneous heart disease datasets: Cleveland (CHD), Framingham (FHD), and Switzerland (SHD), each reflecting distinct demographic and regional characteristics. The model consistently delivered strong performance-achieving 100% accuracy on CHD and SHD, and 92.71% on FHD-demonstrating its robust adaptability to varied data distributions.
The scalability of XAI-HD is driven by key design principles. First, the adoption of universally applicable preprocessing strategies-such as Z-score normalization, hybrid balancing (SMOTEENN), and explainability tools (SHAP and LIME)-enables effective handling of diverse data scales, distributions, and missing-value patterns commonly encountered in real-world clinical datasets. Second, the use of modular and interpretable classifiers facilitates straightforward substitution or fine-tuning according to infrastructure constraints or population-specific characteristics. This modularity is particularly beneficial when extending the framework to new healthcare systems with differing levels of digitization, data quality, or regulatory requirements. Overall, XAI-HD has been intentionally designed with scalability and adaptability at its core, ensuring its applicability across a wide range of clinical environments.
Validation of research questions
To ensure the robustness and credibility of the XAI-HD framework, each research question is validated through systematic experimentation, statistical analysis, and interpretability assessments.
Validation of RQ1: Model Performance Evaluation To determine which ML or DL models achieve the highest accuracy and reliability in heart disease detection, the following steps are undertaken:
Performance Comparison: A diverse set of ML and DL models-including RBC, BBC, LRC, MLP, LGB, CNN, and MHA—are trained and evaluated on multiple datasets (CHD, FHD, SHD).
Metric-Based Assessment: Models are assessed using accuracy, precision, recall, F1-score, Cohen’s Kappa, MCC, sensitivity, specificity, Precision-Recall Curve and AUC–ROC curves.
Cross-Dataset Generalization: To validate reliability across diverse datasets, models are trained and tested on datasets with different feature distributions.
Validation of RQ2: effectiveness of data preprocessing and balancing The impact of data preprocessing and class balancing techniques on model performance and fairness is validated through:
Class Balancing Strategies: Different balancing methods (OSS, NCR, SMOTEN, ADASYN, SMOTETomek, SMOTEENN) are applied, and their effects on model fairness and accuracy are analyzed.
Validation of RQ3: explainability and trustworthiness To assess whether SHAP and LIME provide transparent insights into model predictions:
FIA: SHAP and LIME are used to identify the most influential features affecting heart disease predictions.
Expert Validation: Domain experts review SHAP and LIME explanations to verify whether they align with clinical knowledge.
Validation of RQ4: scalability and efficiency The scalability and real-world applicability of XAI-HD are validated through:
Complexity Analysis: Training and inference times are recorded for each model to determine computational efficiency.
Statistical Significance Tests: Confidence intervals and the Wilcoxon signed-rank test (p-value) are used to verify the significance of performance differences.
Clinical implications of the study
The findings of this study hold significant clinical implications for the early diagnosis, risk stratification, and treatment planning of heart disease using AI-driven methods. By integrating ML, DL, and XAI techniques, the proposed XAI-HD framework enhances predictive accuracy while ensuring transparency in medical decision-making.
Improved Early Detection and Risk Assessment Heart disease remains a leading cause of mortality worldwide, and early identification of at-risk individuals is crucial for preventive interventions. The XAI-HD framework enables clinicians to:
Utilize AI-enhanced predictive models that offer superior accuracy in heart disease classification compared to traditional scoring methods.
Identify high-risk patients at an early stage, facilitating timely lifestyle modifications and medical interventions.
Improve individualized risk stratification by analyzing diverse clinical, demographic, and lifestyle attributes.
Enhanced Decision-Making Through Explainability A major barrier to AI adoption in healthcare is the lack of interpretability in complex models. By integrating SHAP and LIME, the study provides:
Transparent explanations of model predictions, ensuring that AI-generated results align with clinical reasoning.
Practical insights into the most significant risk variables, empowering physicians to make educated diagnostic and therapeutic decisions.
Enhanced trust among healthcare practitioners, resulting in increased acceptance and incorporation of AI technologies into standard clinical practices.
Addressing Data Bias and Enhancing Generalizability Medical datasets frequently exhibit class imbalance, resulting in skewed predictions. The integration of sophisticated data balancing methodologies in XAI-HD:
Alleviates bias against majority classes, guaranteeing impartial and equitable predictions across various patient demographics.
Improves model generalization by refining training data distribution, minimizing overfitting, and enhancing performance across multiple clinical environments.
Promotes the creation of more inclusive AI-driven diagnostic models that address diverse populations.
Prospective Integration into Clinical Practice The research highlights the practicality of implementing AI-based cardiac disease prediction algorithms in actual clinical settings. Principal domains of integration encompass:
Electronic Health Records (EHR) Systems: Incorporating the XAI-HD infrastructure within EHR systems for instantaneous risk evaluation and automated notifications.
Telemedicine and Remote Patient Monitoring: Employing AI-driven insights for virtual consultations and ongoing cardiac health surveillance.
Personalized Treatment Optimization: Using model explanations to customize treatment plans according to the unique characteristics of each patient.
Practical applications and deployment feasibility
The XAI-HD architecture has been developed for practical clinical use, namely for incorporation into Clinical Decision Support Systems (CDSS). The transparent architecture and explained outputs-backed by SHAP and LIME-empower healthcare practitioners to comprehend forecasts and evaluate cardiovascular risk with confidence. XAI-HD may be integrated into hospital CDSS platforms to facilitate automated risk stratification by providing patient-specific insights based on demographic, clinical, and lifestyle data.
In order to provide continuous instruction and adaptability, XAI-HD is designed to support dynamic retraining using recently obtained patient data. This enables the model to progress over time, enhancing predicting accuracy and preserving relevance as clinical data and population health patterns change. XAI-HD has been evaluated for scalability and demographic generalizability across three distinct datasets (CHD, FHD, SHD), reflecting variations in population characteristics such as age, gender, and ethnicity. The uniform performance across different datasets indicates a significant level of generalizability. Nevertheless, for extensive implementation in diverse clinical settings, methodologies like federated learning and transfer learning are being investigated as components of forthcoming research. These solutions will enable XAIHD to adjust to institution-specific data distributions while safeguarding patient privacy, thus improving its relevance in multi-institutional and cross-regional healthcare environments.
The effective incorporation of the XAI-HD model into clinical workflows depends on its capacity to deliver interpretable insights that improve decision-making for healthcare practitioners. Utilizing explainable AI methodologies enables doctors to comprehend model predictions, facilitating educated and transparent medical decisions. User-friendly solutions, such interactive dashboards, visual analytics, and automated report generation, can be utilized to promote adoption, assuring accessibility for practitioners with diverse technical skills. Moreover, the seamless connection with electronic health records (EHR) and clinical decision support systems (CDSS) will enhance practical application, facilitating tailored patient care and augmenting diagnostic precision. Future developments must prioritize enhancing usability, resolving interoperability issues, and executing pilot studies in healthcare environments to determine the model’s practical efficacy.
Conclusions
This paper presents XAI-HD, an explainable AI framework for heart disease detection that integrates machine learning, deep learning, and interpretability techniques. By addressing key challenges such as class imbalance, missing data, and model transparency, XAI-HD achieves improved diagnostic accuracy and robust generalization across multiple datasets. Rigorous preprocessing, advanced class balancing, and comprehensive evaluation contribute to its enhanced prediction performance.
The incorporation of SHAP and LIME ensures explainability, providing clinicians with transparent insights into model predictions and fostering trust in AI-assisted diagnostics. Statistical validation and complexity analysis further establish the reliability and clinical relevance of the framework, highlighting its potential for integration into electronic health records, telemedicine platforms, and personalized treatment planning.
While XAI-HD demonstrates strong feasibility for real-world deployment, certain challenges remain. Sensitivity to dataset variations may affect generalizability, and computational efficiency must be optimized to support large-scale, real-time clinical use.
Future research will focus on extending XAI-HD to multi-modal data, including ECG signals and genetic markers, to strengthen its robustness in complex clinical scenarios. Incorporating federated learning will further enhance data privacy and scalability across healthcare institutions.
In summary, XAI-HD bridges the gap between advanced AI techniques and medical practice, paving the way for more transparent, equitable, and effective heart disease prediction. This contribution not only advances AI-driven cardiology but also supports broader efforts to improve early diagnosis, risk stratification, and patient outcomes in cardiovascular healthcare.
Acknowledgements
The authors would like to extend their sincere appreciation to the Ongoing Research Funding program (ORF-2025-301), King Saud University, Riyadh, Saudi Arabia.
Author Contributions
Md. Alamin Talukder: Conceptualization, Data curation, Methodology, Software, Resource, Visualization, Formal Analysis, Supervision, Writing–original draft and review & editing. Amira Samy Talaat: Conceptualization, Methodology, Software, Resource, Visualization, Formal Analysis, Writing–original draft and review & editing. Mohsin Kazi: Formal Analysis, Visualization, Validation, Investigation, writing-review & editing. Ansam Khraisat: Visualization, validation, formal analysis, investigation and writing (review & editing)
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions
Data Availability
The selected datasets are sourced from free and open-access sources, such as: Cleveland and Switzerland Heart Disease Dataset: https://archive.ics.uci.edu/static/public/45/heart+disease.zip and Framingham Heart Disease Dataset: https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset.
Declarations
Conflict of interest
The authors have no conflict of interest to declare that they are relevant to the content of this article.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent to publish
Not applicable.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Abdullahi, A; Ali Barre, M; Hussein Elmi, A. A machine learning approach to cardiovascular disease prediction with advanced feature selection. Indonesian J Electr Eng Comput Sci; 2024; 33,
Abushariah, MA; Alqudah, AA; Adwan, OY; Yousef, RM. Automatic heart disease diagnosis system based on artificial neural network (ann) and adaptive neuro-fuzzy inference systems (anfis) approaches. J Softw Eng Appl; 2014; 7,
Almulihi, A; Saleh, H; Hussien, AM; Mostafa, S; El-Sappagh, S; Alnowaiser, K; Ali, AA; Refaat Hassan, M. Ensemble learning based on hybrid deep learning model for heart disease early prediction. Diagnostics; 2022; 12,
Bing, P; Liu, W; Zhai, Z; Li, J; Guo, Z; Xiang, Y; He, B; Zhu, L. A novel approach for denoising electrocardiogram signals to detect cardiovascular diseases using an efficient hybrid scheme. Front Cardiovasc Med; 2024; 11, 1277123. [DOI: https://dx.doi.org/10.3389/fcvm.2024.1277123]
Carvalho, M; Pinho, AJ; Brás, S. Resampling approaches to handle class imbalance: a review from a data perspective. J Big Data; 2025; 12,
Chen F (2024) Intelligent diagnosis of heart disease based on medical feature data. In: International Conference on Social Development and Intelligent Technology (SDIT2024)
Chen, Y-S; Cheng, C-H; Chen, S-F; Jhuang, J-Y. Identification of the framingham risk score by an entropy-based rule model for cardiovascular disease. Entropy; 2020; 22,
Demir S, Selvitopi H (2023) Machine learning and deep leaning in predicting coronary heart disease. In: International Conference on Deep Learning, Artificial Intelligence and Robotics, pp. 101–108. Springer
Ejiyi CJ, Qin Z, Nneji GU, Monday HN, Agbesi VK, Ejiyi MB, Ejiyi TU, Bamisile OO (2024a) Enhanced cardiovascular disease prediction modelling using machine learning techniques: a focus on cardiovitalnet. Network 36:716–748
Ejiyi CJ, Qin Z, Ukwuoma CC, Nneji GU, Monday HN, Ejiyi MB, Ejiyi TU, Okechukwu U, Bamisile OO (2024b) Comparative performance analysis of boruta, shap, and borutashap for disease diagnosis: a study with multiple machine learning algorithms. Network 36:507–544
Ejiyi, CJ; Qin, Z; Amos, J; Ejiyi, MB; Nnani, A; Ejiyi, TU; Agbesi, VK; Diokpo, C; Okpara, C. A robust predictive diagnosis model for diabetes mellitus using shapley-incorporated machine learning algorithms. Healthc Analyt; 2023; 3, [DOI: https://dx.doi.org/10.1016/j.health.2023.100166] 100166.
Ejiyi CJ, Qin Z, Monday H, Ejiyi MB, Ukwuoma C, Ejiyi TU, Agbesi VK, Agu A, Orakwue C (2024a) Breast cancer diagnosis and management guided by data augmentation, utilizing an integrated framework of shap and random augmentation. BioFactors 50(1):114–134
Ejiyi CJ, Cai D, Ejiyi MB, Chikwendu IA, Coker K, Oluwasanmi A, Bamisile OF, Ejiyi TU, Qin Z (2024b) Polynomial-shap analysis of liver disease markers for capturing of complex feature interactions in machine learning models. Comput Biol Med 182:109168
El-Bialy, R; Salamay, MA; Karam, OH; Khalifa, ME. Feature analysis of coronary artery heart disease data sets. Procedia Comput Sci; 2015; 65, pp. 459-468. [DOI: https://dx.doi.org/10.1016/j.procs.2015.09.132]
Gupta, P; Seth, D. Comparative analysis and feature importance of machine learning and deep learning for heart disease prediction. Indonesian J Electr Eng Comput Sci; 2022; 29,
Gupta S, Sedamkar R (2020) Genetic algorithm for feature selection and parameter optimization to enhance learning on framingham heart disease dataset. In: Intelligent Computing and Networking: Proceedings of IC-ICN 2020, pp. 11–25. Springer
Huang, H; Wu, N; Liang, Y; Peng, X; Shu, J. Slnl: a novel method for gene selection and phenotype classification. Int J Intell Syst; 2022; 37,
Janosi A, Steinbrunn W, Pfisterer M, Detrano R (1989) Heart Disease. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C52P4X
Jawalkar, AP; Swetcha, P; Manasvi, N; Sreekala, P; Aishwarya, S; Kanaka Durga, BP; Anjani, P. Early prediction of heart disease with data analysis using supervised learning with stochastic gradient boosting. J Eng Appl Sci; 2023; 70,
Jia, Y; Chen, G; Chi, H. Retinal fundus image super-resolution based on generative adversarial network guided with vascular structure prior. Sci Rep; 2024; 14,
Kaggle: Framingham Heart Study Dataset. Accessed: 2025-04-22 (2022). https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset
Kahouadji N (2024) Comparison of machine learning classification algorithms and application to the framingham heart study. arXiv preprint arXiv:2402.15005
Kahramanli, H; Allahverdi, N. Design of a hybrid system for the diabetes and heart diseases. Expert Syst Appl; 2008; 35,
Kanagarathinam, K; Sankaran, D; Manikandan, R. Machine learning-based risk prediction model for cardiovascular disease using a hybrid dataset. Data Knowl Eng; 2022; 140, [DOI: https://dx.doi.org/10.1016/j.datak.2022.102042] 102042.
Kayalvizhi, S; Nagarajan, S; Deepa, J; Hemapriya, K. Multi-modal iot-based medical data processing for disease diagnosis using heuristic-derived deep learning. Biomed Signal Process Control; 2023; 85, [DOI: https://dx.doi.org/10.1016/j.bspc.2023.104889] 104889.
Krishnan, S; Magalingam, P; Ibrahim, R. Enhanced recurrent neural network (rnn) for heart disease risk prediction using framingham datasets. Open Int J Informatics; 2023; 11,
Laftah, RH; Al-Saedi, KHK. Explainable ensemble learning models for early detection of heart disease. J Robot Control (JRC); 2024; 5,
Lewandowicz B, Kisiała K (2023) Comparison of support vector machine, naive bayes, and k-nearest neighbors algorithms for classifying heart disease. In: International Conference on Information and Software Technologies, pp. 274–285. Springer
Li, X; Liang, J; Hu, J; Ma, L; Yang, J; Zhang, A; Jing, Y; Song, Y; Yang, Y; Feng, Z. Screening for primary aldosteronism on and off interfering medications. Endocrine; 2024; 83,
Liu, Q; Li, C; Yang, L; Gong, Z; Zhao, M; Bovet, P; Xi, B. Weight status change during four years and left ventricular hypertrophy in chinese children. Front Pediatr; 2024; 12, 1371286. [DOI: https://dx.doi.org/10.3389/fped.2024.1371286]
Luukka P, Lampinen J (2010) A classification method based on principal component analysis and differential evolution algorithm applied for prediction diagnosis from clinical emr heart data sets. In: Computational Intelligence in Optimization: Applications and Implementations, pp. 263–283. Springer, ???
Mahmoud, WA; Aborizka, M; Amer, FAE. Heart disease prediction using machine learning and data mining techniques: application of framingham dataset. Turk J Comput Math Educ; 2021; 12,
Mbanze, I; Spracklen, TF; Jessen, N; Damasceno, A; Sliwa, K. Heart failure in low-income and middle-income countries. Heart; 2025; 111,
Meera T, Devi SP (2025) Integrating machine learning and deep learning approaches for accurate cardiovascular disease prediction from electronic health records. In: 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI), pp. 1090–1096. IEEE
Mesquita, F; Marques, G. An explainable machine learning approach for automated medical decision support of heart disease. Data Knowl Eng; 2024; 153, [DOI: https://dx.doi.org/10.1016/j.datak.2024.102339] 102339.
Mienye, ID; Sun, Y; Wang, Z. An improved ensemble learning approach for the prediction of heart disease risk. Informatics Med Unlocked; 2020; 20, [DOI: https://dx.doi.org/10.1016/j.imu.2020.100402] 100402.
Musa, U; Muhammad, S. Enhancing the performance of heart disease prediction from collecting cleveland heart dataset using bayesian network. J Appl Sci Environ Manag; 2022; 26,
Naeem, A; Abbas, SH; Yousaf, M; Ishtiaq, A; Murtaza, I. Global impact and strategies to reduce the mortality from cardiovascular diseases. Integrated science for sustainable development goal 3: empowering global wellness initiatives; 2024; Berlin, Springer: pp. 283-306. [DOI: https://dx.doi.org/10.1007/978-3-031-64288-3_12]
Nahar, J; Imam, T; Tickle, KS; Chen, Y-PP. Computational intelligence for heart disease diagnosis: a medical knowledge driven approach. Expert Syst Appl; 2013; 40,
Narain, R; Saxena, S; Goyal, AK. Cardiovascular risk prediction: a comparative study of framingham and quantum neural network based approach. Patient Prefer Adherence; 2016; 10, pp. 1259-1270. [DOI: https://dx.doi.org/10.2147/PPA.S108203]
Nazary F, Deldjoo Y, Di Noia T, Di Sciascio E (2024) Xai4llm. let machine learning models and llms collaborate for enhanced in-context learning in healthcare. arXiv preprint arXiv:2405.06270
Ningthoujam AS, Sharma S, Nandi A (2025) Explainable ai based coronary heart disease prediction: Enhancing model transparency in clinical decision making. bioRxiv, 2025-03
Nursyahrina, N; Sahri, A; Hafizhah, NA. Modeling heart disease classification using rough neural network: a data-driven approach to the cleveland heart disease dataset. J Sist Informasi Ilmu Komput; 2024; 7,
Ogunpola, A; Saeed, F; Basurra, S; Albarrak, AM; Qasem, SN. Machine learning-based predictive models for detection of cardiovascular diseases. Diagnostics; 2024; 14,
Orfanoudaki, A; Chesley, E; Cadisch, C; Stein, B; Nouh, A; Alberts, MJ; Bertsimas, D. Machine learning provides evidence that stroke risk is not linear: the non-linear framingham stroke risk score. PLoS ONE; 2020; 15,
Paudel P, Karna SK, Saud R, Regmi L, Thapa TB, Bhandari M (2023) Unveiling key predictors for early heart attack detection using machine learning and explainable ai technique with lime. In: Proceedings of the 10th International Conference on Networking, Systems and Security, pp. 69–78
Paul J (2024a) A comprehensive study of advanced machine learning algorithms for predicting heart disease using the cleveland dataset
Paul J (2024b) Analyzing the role of explainable ai in heart disease diagnosis using machine learning models and the cleveland dataset
Paul J (2024c) Optimizing heart disease prediction: Ensemble learning techniques with the cleveland heart dataset
Perumal, R; Kaladevi, A. Early prediction of coronary heart disease from cleveland dataset using machine learning techniques. Int J Adv Sci Technol; 2020; 29,
Rahman B, Mantoro T, Andryana S, Gunaryati A, Rishiwal V (2024) Heart disease prediction: A comprehensive exploration of optimal predictive ai. In: International Conference on Machine Learning, Advances in Computing, Renewable Energy and Communication, pp. 197–212. Springer
Rodriguez MP, Nafea M (2024) Centralized and federated heart disease classification models using uci dataset and their shapley-value based interpretability. arXiv preprint arXiv:2408.06183
Shah, D; Patel, S; Bharti, SK. Heart disease prediction using machine learning techniques. SN Comput Sci; 2020; 1,
Sharma, T; Verma, S et al. Prediction of heart disease using cleveland dataset: A machine learning approach. Int J Recent Res Asp; 2017; 4,
Shrestha D (2024a) Comparative analysis of machine learning algorithms for heart disease prediction using the Cleveland Heart Disease dataset. Preprints
Shrestha D (2024b) Advanced machine learning techniques for predicting heart disease: A comparative analysis using the cleveland heart disease dataset. Appl Med Informatics 46(3):91–102
Shrivastava, PK; Sharma, M; Kumar, A. Hcbilstm: A hybrid model for predicting heart disease using cnn and bilstm algorithms. Measur Sensors; 2023; 25, [DOI: https://dx.doi.org/10.1016/j.measen.2022.100657] 100657.
Spencer, R; Thabtah, F; Abdelhamid, N; Thompson, M. Exploring feature selection and classification methods for predicting heart disease. Digital Health; 2020; 6, 2055207620914777. [DOI: https://dx.doi.org/10.1177/2055207620914777]
Srinivas, P; Katarya, R. Hyoptxg: optuna hyper-parameter optimization framework for predicting cardiovascular disease using xgboost. Biomed Signal Process Control; 2022; 73, [DOI: https://dx.doi.org/10.1016/j.bspc.2021.103456] 103456.
Suhatril, RJ; Syah, RD; Hermita, M; Gunawan, B; Silfianti, W. Evaluation of machine learning models for predicting cardiovascular disease based on framingham heart study data. ILKOM J Ilmiah; 2024; 16,
Suryawanshi NS (2024) Accurate prediction of heart disease using machine learning: A case study on the cleveland dataset
Talukder, MA. A hybrid multiscale feature fusion model for enhanced cardiovascular arrhythmia detection. Results Eng; 2025; 25, [DOI: https://dx.doi.org/10.1016/j.rineng.2025.104244] 104244.
Talukder, MA; Talaat, AS; Kazi, M. Hxai-ml: a hybrid explainable artificial intelligence based machine learning model for cardiovascular heart disease detection. Results Eng; 2025; 25, [DOI: https://dx.doi.org/10.1016/j.rineng.2025.104370] 104370.
Tsoumplekas G, Siniosoglou I, Argyriou V, Moscholios ID, Sarigiannidis P (2024) Enhancing performance for highly imbalanced medical data via data regularization in a federated learning setting. In: International Conference on AI in Healthcare, pp. 302–315. Springer
Vinora, A; Lloyds, E; Soundarya, M. A complete analysis of explainable ai and its methods for healthcare prediction. Edge AI for industry 5.0 and healthcare 5.0 applications; 2025; Boca Raton, Auerbach Publications: pp. 104-118. [DOI: https://dx.doi.org/10.1201/9781003442066-7]
Zhang, X; Wang, C; He, D; Cheng, Y; Yu, L; Qi, D; Li, B; Zheng, F. Identification of dna methylation-regulated genes as potential biomarkers for coronary heart disease via machine learning in the framingham heart study. Clin Epigenetics; 2022; 14,
Zhang, Z; Wu, K; Wu, Z; Xiao, Y; Wang, Y; Lin, Q; Wang, C; Zhu, Q; Xiao, Y; Liu, Q. A case of pioneering subcutaneous implantable cardioverter defibrillator intervention in timothy syndrome. BMC Pediatr; 2024; 24,
Zhao, Y; Xiong, W; Li, C; Zhao, R; Lu, H; Song, S; Zhou, Y; Hu, Y; Shi, B; Ge, J. Hypoxia-induced signaling in the cardiovascular system: pathogenesis and therapeutic targets. Signal Transduct Target Ther; 2023; 8,
Zhu, F; Boersma, E; Tilly, M; Ikram, MK; Qi, H; Kavousi, M. Trends in population attributable fraction of modifiable risk factors for cardiovascular diseases across three decades. Eur J Prev Cardiol; 2024; 31,
© Crown 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.