Content area
Coronary illness can be treated as one of the major causes for mortality globally. On-time and Precise conclusion on the type of disease is significant for therapy and breakdown expectancy. Research scientists are working rigorously in their respective fields to reduce the death rate. Even though lot of research took place on this area still there is a scope for increasing the prediction accuracy. The fundamental point of our proposed work is to build up a hybrid methodology using genetic algorithm (GA) with (RBF) radial basis function (GA-RBF) for the detection of coronary sickness with increased accuracy using the feature selection mechanism. The proposed system performance achieved an accuracy of 85.40% using 14 attributes, and the prediction accuracy increased to 94.20% with nine characteristics where the functionality of the proposed system performed much better after attribute reduction.
Introduction
Healthcare became one of the fastest-growing sectors in the economy today and becoming expensive day by day. To help this situation, technologies like machine learning and big data have come into the picture to help doctors to provide the best medicine towards the cure of diseases with minimal cost. With the growing population, large datasets are available to help further investigations. We made a comparison analysis between most popular algorithms of machine learning, which allows the researchers to make decisions while choosing the algorithms. One of the underlying reasons for death because of coronary illness is due to ignoring the disease at the underlying phase of the infection. An ordinary clinical exam can assist with finding the disease at the early stage. If coronary disease got found in time, it could sufficiently be overseen or restored; appropriate eating regimens, drugs, and activities can treat fundamentally coronary illness.
The coronary disease is among the complex and way of deadliest human sicknesses on earth. When heart fails to push the necessary amount of blood to different organs to initiate the regular functionalities of the body. Thus, at long last, the cardiovascular breakdown happens [1]. The side effects of coronary illness incorporate fatigue of physical body, swollen feet, the brevity of breath, and Tiredness with signs, by Way of Example, raised jugular venous weight. Fringe edema brought about via cardiovascular or non-heart variations from the norm [2].
The coronary illness determination and treatment are perplexing, particularly in the developing nations, because of the rare accessibility of Doctors and other issues like lacking Devices that influence the Forecast and treatment of heart ailments [3]. The best possible and exact distinguishing proof of the coronary illness chance in patients is vital for improving the wellbeing of heart and diminishing their dangers of heart issues that are intense [4].
Based on WHO in the latest survey, only 75% of heart diseases got predicted by Medical researchers. So, the prediction of heart problems with better accuracy is a good topic for emerging medical researchers in the healthcare field. 50% of coronary illness people losing their lives in the initial 1–2 years [5]. These methods delay in the determination results in weak identification of disease at the early stage. Besides, it is computationally Intricate and expensive and requires some investment [6].
The heart-based illness is a term with a broad scope that includes different kinds of afflictions affecting various parts of the heart, which has emerged as the top killer disease in India. The rate of death brought on by heart disease is growing as it is the most threatening disease today. Around 25% of the estimated passing's in the age bunch individuals of 69 years happen because of cardiovascular sicknesses. In urban territories, 32.8% of passing's happening because of heart illnesses, in rural regions, it is 22.9% [21]. The Asian population remained at fifth, raking in the death rate of all-inclusive for heart-based sicknesses.
The capability of reprogramming in machine learning brings a great deal of scope and strength for various opportunities in the area of medical science. Although ML is categorized under (artificial intelligence), which is also known as anticipative modelling or analytics. At its simplest, algorithms that read and examine to forecast the generated values within a suitable range programmed by users [7, 8] created by various analysts which is used for illness identification [9]. For the performance classification of machine learning algorithms [10]. California Irvine (UCI) information mining store, Cleveland coronary illness dataset used for this research.
Heart-based disease is a term with a large scope that includes different kinds of diseases affecting various parts of the heart. This has emerged as the top killer disease in India. The rate of death brought on by heart disease is growing continuously as it is the most threatening disease today. Around 25% of estimated passing’s in the age bunch individuals of 69 years happen because of cardiovascular maladies. In urban territories, 32.8% of passing’s happen because of heart illnesses, in rural regions it is 22.9%. CADI (Coronary Artery Disease among Asian Indians) Research Foundation, [11] determined as Indian, Asian population remained at 5th rank in death rate of all-inclusive for heart-based sicknesses.
For the emerging researchers in the field of Cardiology, the main challenge is the proper diagnosis of the presence of heart disease inside a person. Existing methods/techniques aren't a lot of efficient and correct in identifying heart diseases, although a lot of researching scientists in the field proposed effective algorithms in predicting heart-based diseases. There are a variety of various medical instruments available in the healthcare field. But they failed in two cases; the very first concern is they're too costly, and the 2nd is they're not accurate in predicting the heart diseases. Based on WHO in the latest survey, only 78% of heart diseases are predicted by Medical researchers and so the prediction of heart problems with better accuracy is a fair topic for emerging medical researchers in the healthcare field at early stage.
The WHO has stressed that CVD is identified as the chief factor of deaths throughout the world, it creates greater than 30% of all fatalities worldwide. The CVD impacts individuals disproportionally in low- and also middle-income nations. Concerning 82% of deaths due to the CVD occurs in reduced- as well as middle-income countries, the price of death takes place approximately similarly in men and women. It is anticipated that by year 2030 concerning 23.6 million individuals will pass away from CVD annually [12]. The biggest growth variety of deaths will certainly occur in the South-East Asia region. It is essential factor to consider for plan makers in countries to take into account the prevention of this disease. Not only the price of death by this condition among people is high, yet also it costs a lot of cash and energy for therapy and also surgery. Considering the straight health care expense, we locate that the CVD seems to be one of the costliest health and wellness state, working in this area will definitely help the medical industry for human wellbeing.
Today, physicians are taking a much more advanced strategy to analysing an individual's danger variables for cardiovascular disease—such as diabetes mellitus and hypertension. Clinical experts’ states that there are more precise ways to evaluate and identify diseases which are severe threats to an individual [13]. Tests include Chest CT Scans, Angiograms, MRI etc.
The capability of programming in machine learning brings a great deal of scope and strength for various opportunities in the area of medical science. Although ML is a subtype of AI, it is also known as predictive modelling or predictive analytics.
Related work
Aniruddha Dutta al. [14] proposed a system with the LASSO regression technique for increasing the accuracy and obtained an order exactness of 77% and 76% for SVM and RF, respectively. Yang et al. [11] presented a technique for early identification of cardiac failure diagnosis. Implementation is done based on the SVM and the Bayesian principal component analysis for doling out information over missing data values. Implemented mechanism categorized the subjects by three different categories, which are with a healthy heart, diseased group, and cardiac failure group. The obtained exactness of the model is 74.40%.
Mike Mastanduno et al. [15] recognized 5 Key differentiators that lead to progress with expert system techniques in the medical field. Utilizing Bayesian classifiers, they got exactness estimations of 52.33%, 52%, and 45.67% separately for Naïve Bayes, Decision Tree, and K-NN. Gudadhe et al. [16] utilized MLP and VM for coronary illness characterization. The proposed technique acquired an accuracy of 80.41%.
Kanikar et al. [17] proposed the forecast model for cardiovascular ailments using SVM and Bayesian Classification and achieved accuracy values of 79% and 80%. Muhammad Saqlain et al. [18], in their work, proposed about recognition of coronary failure with variable data of people with coronary illness and obtained an accuracy of 80% and 69% using Logistic regression and random forest algorithms.
Manavalan et al. [19] proposed in computational approaches for heart disease prediction—a review 2. Computational Model-Based Heart about different data mining methods introduced for heart disease prediction is extensively reviewed and discussed their computational results. Prasad et al. [20] proposed a heart disease prediction mechanism with a logistic regression algorithm and achieved an accuracy value of 79% and K-NN Algorithms with an accuracy value of 78%.
Yadav et al. [21] achieved an accuracy of 94.19% and 98% by introducing a regulation model which contributed to increase the overall accuracy with Neural Network and Fuzzy KNN by overfitting. Amin et al. [22] contributed their work on classifying healthy people from diseases and for their study they have used 7 benchmark algorithms and 3 feature selection mechanisms and performance metrics for the analysis and were able to succeeded in achieving 89% accuracy.
Garate-Escamilla et al. [23] proposed a model with random forests (RF) achieved high in accuracy, with 98.7% for UCI Repository Dataset. Cengiz Gazeloğlu et al. [24] in their research work based on the result analysis, SVM (PolyKernel) with an 85.14% ratio was found to be the most successful machine learning algorithm without feature selection.
From the above-referred work from several authors who worked out with various machine learning algorithms out of which few research findings are displayed in Table 1.
Table 1. Model accuracy comparison in existing systems with benchmark dataset
Authors | Method used | Published year | Accuracy achieved |
|---|---|---|---|
Yang et al. | SVM | 2010 | 74% |
P.Kanikar et al. | SVM | 2016 | 79% |
Muhammad Saqlain et al. | LR & RF | 2016 | 80% |
Aniruddha Dutta et al. | LASSO,SVM,RF | 2019 | 77% |
Gudadhe et al. | MLP | 2019 | 80.41% |
Samir S Yadav et al. | Fuzzy KNN | 2020 | 94.14% |
Amin Ul Haq | Case study | 2018 | 89% |
Garate-Escamilla | CHI-PCA | 2020 | 98.7% |
Cengiz Gazeloğlu | SVM (PolyKernel) | 2020 | 85.14% |
For the emerging researchers in the field of Cardiology, the major challenge and of grave concern is the proper diagnosis of the presence of heart disease inside a human. Existing methods/techniques are not much efficient and accurate in identifying heart diseases even though many of the researching scientists in the field are efficient in predicting heart-based conditions. Based on the WHO’s latest survey, only 75% of heart diseases are predictable by medical researchers. So, the prediction of heart disease with good accuracy is a reasonable topic for emerging medical researchers in the medical field.
The fundamental point of our proposed work is to build up a methodology for the detection of coronary sickness with better accuracy. Our proposed work aims towards attribute reduction, which is one of a solution to achieve better efficiency. For feature selection, we used the genetic algorithm, which influences the target predicted value. Furthermore, we identified the performance of all classifiers on features concerning accuracy, Sensitivity, Specificity etc.
Methods
This complete section explains about the methods and materials used in the proposed work.
Dataset description
Coronary disease dataset from Cleveland repository is used for this research work which is made available by University of California, Irvine [24]. Which consists of 303 studies with 76 characteristics. Attributes with missing values (6 no’s) have been removed during analysis phase. Output Target variable has used for identifying coronary diseases. This dataset contains total number of 76 characteristics out of which a subset of 14 attributes are used for this study so total 297 studies of 14 features and the complete description of the dataset had been shown in Table 2 below.
Table 2. Cleveland heart disease dataset features and description [24]
Name | Code | Description | Domain range of values |
|---|---|---|---|
Patients age | AGE | Age in years | 30–77 |
Patients gender | SEX | Male = 1 | 1 |
Female = 0 | 0 | ||
1 = atypical angina | 1 | ||
2 = typical angina | 2 | ||
Chest pain type | CPT | 3 = asymptomatic | 3 |
4 = nonanginal pain | 4 | ||
Resting blood pressure | RBP | mm Hg admitted at the hospital | 94–200 |
Serum cholesterol | SCH | In mg/dl | 120–154 |
Fasting Blood Sugar | FBS | Fasting Blood Sugar > 120 | 1 |
mg/dl (1 = true; 0 = false) | 0 | ||
Resting ECG results | RES | 0 = normal | 0 |
1 = having ST-T | 1 | ||
2 = hypertrophy | 2 | ||
Maximum heart rate | MHR | – | 71–202 |
Exercise-induced angina | EIA | 1 = yes | 1 |
0 = no | 0 | ||
Old peak = ST depression induced by exercise-relative to rest | OPK | – | 0–6.2 |
Slope of peak exercise ST segment | PES | 1 = up sloping | 1 |
2 = flat | 2 | ||
3 = down sloping | 3 | ||
VCA | – | 1 | |
Number of major vessels (0–3) coloured by fluoroscopy | 2 | ||
3 | |||
THA | 3 = normal | 3 | |
Thallium Scan | 6 = fixed defect | 6 | |
7 = reversible defect | 7 | ||
Target | TGT | Diagnosis (angiographic disease status) target attribute 0 = absence 1 = presence | 0 1 |
Framework of the proposed system
Proposed system mainly concentrates on identifying heart-based diseases with better accuracy using selected features. Performance of different machine learning models have been tested on the Cleveland Dataset for both complete and selected features. Popular benchmark machine learning models were used in this study. The work flow of the system has been implemented in 4 different stages including Pre-processing of dataset, Feature Selection and reduction, Cross Validation, Classification and Performance Evaluation. Figure 1 shows proposed system Architecture.
[See PDF for image]
Fig. 1
Proposed framework architecture
Dataset pre-processing
Processing of data plays a prominent role for efficient representation of the data as well as performance of the classifier in an effective manner. Minimax scalar has been applied to the dataset so that values of all the attributes ranges between 0 and 1.
If the missing out on values in a column or function are mathematical, the values can be imputed by the mean of the total instances of the variable. Mean can be changed by average if the attribute is suspected to have outliers. For a specific attribute, the missing out on values could be changed by the setting of the column [25]. The major downside of this method is that it decreases the variance of the imputed variables. This method additionally lowers the connection in between the imputed variables and other variables due to the fact that the imputed values are just estimates and will certainly not be associated with other values inherently.
Feature selection algorithms
Feature selection creates huge impact on the performance of a model. The information set we choose to train the model creates a difference to achieve better accuracy. Feature selection is required to eliminate unwanted features because they create negative impact on the performance of the model. For this research study we have chosen Genetic Algorithm for Feature Selection [26].
Genetic Algorithm
Feature selection is carried out with Genetic Algorithm which mainly focusses on
Abstraction of real biological evaluation.
Solving complex problems.
Focuses on optimization.
Population of possible solutions for a given problem.
Genetic Algorithms (GA) are a mathematical model influenced by the famous Charles Darwin's suggestion of natural selection [26]. The natural selection preserves just the fittest individuals, over the various generations. In machine learning, among the uses of genetic algorithms is to grab the best number of variables in order to create a predictive design. To get the ideal part of variables is an issue of combinatory and optimization. The benefit of this technique over others is, it allows the very best remedy to emerge from the most effective of prior remedies. An evolutionary formula which improves the option in time. The suggestion of GA is to incorporate the various solutions generation after generation to remove the most effective genetics (variables) from each one. That way it creates brand-new and more equipped individuals. We can locate various other uses GA such as hyper-tunning specification, locate the maximum (or minutes) of a feature or the look for a correct neural network design (Neuro evolution), or among others.
1
Equation 4 used for fitness probability calculation to a single genotype.
ith genotype’s fitness probability. ith genotype fitness value.
Summation of cumulative probabilities of fitness should be equal to 1. For Chromosomes selection random probabilities are generated. In Crossover phase chromosomes expressed in terms of genes. In this phase values will be converted to binary strings. Mutation parameters decides how many parameters are mutated.
Convergence is the state where we reach optimal solution with high fitness values.
Genetic algorithm for attribute selection:
Produce the population from the initial attribute set.
Calculate an individual's fitness value using a CFS (Correlation Based feature Selection).
Crossover and mutation operations occur in the population.
After the process of mutation and crossover, create an entirely new population set.
Select the better people from a newly created population-based on fitness values.
Repeat steps three to five until the generation count reached.
Which can be treated as one among the most advanced algorithms used for the purpose of feature selection. Figure 2 represents the work flow of the Genetic Algorithm.
[See PDF for image]
Fig. 2
Workflow of genetic algorithm [27]
Terms of genetics
Selection: For the making of cutting edge, select the people.
Chromosome: Qualities string.
Genes: Which holds a particular trait of a person.
Individual: Same as a chromosome.
Population: Number of individuals present with a comparable length of the chromosome.
Fitness: Wellness is the worth given to a person.
Fitness function: Wellness work f (x) is a capacity that appoints wellness incentives to the person.
Crossover: Qualities from guardians join to create a different chromosome.
Mutation: Changing the quality arbitrarily in a person.
Algorithm
START
CREATE an initial population of n chromosomes
ASSIGN fitness f(x) to all chromosomes
DO UNTIL feasible arrangement comes out.
SELECT the people for making cutting edge.
CREATE new posterity with change or potentially hybrid.
COMPUTE new readiness for all people.
REMOVE all the unfit people to produce space for late posterity.
Check for the final best arrangement.
END LOOP
END
Attribute reduction
An advantage feature, mostly identified as a merit function, which assess compliance with information and the suitable rendition for a choosing specification. By convention, the feature value is tiny when the arrangement is great.
Attribute subsets generally have a high correlation with the class, but not with each other. This equation provides the merit of k features for Subset S.
2
here, it indicates the mean estimation of feature-class relationships and is the normal of the element highlight affiliations.Mainly Genetic Algorithm is advantageous in performing better than traditional mechanisms. It has capability of managing datasets with a greater number of features. This can be easily parallelized in any cluster.
Reduced attribute set with nine characteristics is shown in the below Table 3.
Table 3. List of attributes after feature selection
Name | Code | Description | Domain range of values |
|---|---|---|---|
Patients gender | SEX | Male = 1 | 1 |
Female = 0 | 0 | ||
Type of chest pain | CPT | 1 = atypical angina | 1 |
2 = typical angina | 2 | ||
3 = asymptomatic | 3 | ||
4 = nonanginal pain | 4 | ||
Maximum heart rate | MHR | – | 71–202 |
Exercise-induced angina | EIA | 1 = yes | 1 |
0 = no | 0 | ||
Old peak = ST depression induced by exercise-relative to rest | OPK | – | 0–6.2 |
Slope of peak exercise ST segment | PES | 1 = up sloping | 1 |
2 = flat | 2 | ||
3 = down sloping | 3 | ||
VCA | – | 1 | |
Number of major vessels (0–3) coloured by fluoroscopy | 2 | ||
2 | |||
3 | |||
3 = normal | 3 | ||
Thallium scan | THA | 6 = fixed defect | 6 |
7 = reversible defect | 7 | ||
Target | TGT | Diagnosis (angiographic disease status) target attribute 0 = absence 1 = presence | 0 1 |
Machine learning classifiers
Support vector machine (SVM)
This algorithm used for either classifying or regression purposes on both linear and non-linear data. It separates data by labels. Kernel trick used to match new data to best from trained data to predict unknown target label.
Binary classification instances are separated with hyper plane
3
where w dare dimensional coefficient vector, b offset value from the origin.- Dataset values.
w- solution is obtained by introducing Lagrongian multipliers in the linear case and at the borders support vectors are used as data points.
4
where n no of vectors. Yi target labels to x.Linear discriminant function can be written as
5
Kernel trick decision function is
6
Decision tree (DT)
It contains interlink between both internal and external nodes meant for decision making and child nodes for visiting next node. Leaf nodes has no child nodes and is associate with label.
7
Logistic regression (LR)
Popular and mostly used for predictive analysis. Output will be continuous. It shows linear relationship between dependent (y) and independent (x) variable.
Sigmoid Function
8
where9
LR cost function
10
Naïve Bayes
One of the most simple and effective classification algorithms. This can also be called as probabilistic classifier. It performs well in multi-class predictions as compared to other algorithms. This can also be used in real-time predictions because Naïve Bayes classifier is an eager learner.
11
posterior probability, likelihood, class prior probability, predictor Prior probability.
For this research work RBF network is used for classification purpose.
RBF network
The RBF network is a three-layered architecture [28] where each neural cell in the information layer relates to each indicator variable. Each neural cell in a hidden level comprises of a spiral premise work. The output layer holds a weighted sum from the shrouded segment to shape the framework outputs. Figure 3 shows the representation of the RBF network.
[See PDF for image]
Fig. 3
RBF (radial basis function) network architecture [28]
Major advantages of this architecture are because of its simple layout, excellent generalization and learning ability. The residential or commercial properties of this network make it really appropriate to make flexible control systems [29]. Finally, a number of issues are applied to evaluate the major residential or commercial properties. RBF networks are also compared to traditional neural networks as well as blurry inference systems.
12
is the Gaussian activation work parameters ‘r’ (the span of standard deviation), and ‘c’ (the middle reasonable from the information space) characterized independently at each RBF unit.
13
is the performance for finding the output.
Validation mechanism
We have used K-Fold Cross validation method for the current research work. Here Dataset will be split into k equal sizes parts where k-1 groups will be used for training the data and leftover part will be used to check the outperformance in each step. In this work we have taken as k = 10 which yields better performance. In this process 90% of the data used for schooling and 10% of data will be used for experimentation purpose. This process will be repeated for 10 times by dividing the instances of training and testing groups randomly. As a result average of all the iterations will be considered.
Classification and accuracy computation
Figure 4 clarifies the procedure of arrangement and precision count. For the classification purpose, the information is partitioned into testing and preparing sets. The score is a correlation among genuine and anticipated qualities by giving the data to an explicit model.
[See PDF for image]
Fig. 4
Process of data classification and model accuracy computation
Model accuracy
Prediction of the disease is identified based on the accuracy values generated by the proposed hybrid GA-RBF technique. Accuracy achieved by the proposed system is the average of all the Cross-validation folds accuracies considered.
Performance evaluation metrics
To serve the purpose of performance evaluation we have used confusion matrix (Fig. 5).
[See PDF for image]
Fig. 5
Understanding the Confusion Matrix
With the help of confusion matrix, we are going to obtain the following values:
This value concludes that unhealthy subjects are properly classified and contains Heart Disease.
This value concludes that healthy subjects are properly classified and contains no Heart Diseases.
This value concludes that healthy subjects are improperly classified so they don’t have heart diseases.
This value concludes that subjects are improperly classified as unhealthy so they don’t have heart diseases.
14
15
Overall Performance of the classifier is identified by accuracy.
16
17
18
19
+ 1 describes a perfect prediction, 0 unable to return any valid information (no better than random prediction), − 1 describes complete inconsistency between prediction and observation.
20
Results and discussion
In this exploration work, the RBF network used for determination of coronary illness is tested on the dataset acquired from the UCI information datastore. The dataset comprises of fourteen (14) features which contain 303 samples. After processing the data, a total number of 297 samples with 14 features are determined for the study. The dataset is partitioned into test and train sets. Dataset isolated utilizing proportion 70:30, for example, 70% of the dataset for preparing and 30% of the dataset for testing of the model, which is the standard proportion for partitioning dataset. The upside of this division is that it gives satisfactory information to prepare and test the framework such that it keeps away from under fitting that may happen if the preparation of dataset is smaller than the testing dataset. Additionally, if the preparation dataset is far higher than the testing dataset, this can bring about overfitting of the framework.
In this research work we have used python on windows 10, 8 GB RAM for exploratory investigation of Cardio Vascular Diseases informational indexes and execution assessment of every classifier in which applied for the arrangement of informational indexes. The features considered in the data set discussed in Table 3.
Below Table 4. Showcases the results of few bench mark algorithms on the dataset used in the study.
Table 4. Result achieved through different classifiers
Classification techniques | Accuracy achieved with 14 attributes |
|---|---|
Naïve Bayes | 84 |
Decision Tree | 78 |
Logistic Regression | 84 |
Support Vector Machine | 83 |
Random forest | 82 |
K-Nearest Neighbor | 84 |
A total number of 14 features and 297 samples had been packed, detailed corelation of attributes in the dataset has been represented in the Fig. 6.
[See PDF for image]
Fig. 6
Plotting the corelation of attributes
Proposed system performance calculation
Performance of the RBF Network before and after attribute selection has been demonstrated in Table 5. 85.40% of accuracy with 14 attributes is achieved, and the prediction accuracy increased to 94.20% with nine characteristics where the functionality of the RBF Network performs much better after attribute reduction.
Table 5. Performance of the proposed system
No of attributes | Accuracy | Mean absolute error (MAE) |
|---|---|---|
14 Attributes | 85.40% | 0.0915 |
9 Attributes | 94.20% | 0.0907 |
Evaluation metrics
Different metrics were considered to outline the performance of classifiers. Below Table 6 represents the PEM of all the classifiers studied and proposed in this research work (Fig. 7).
Table 6. Performance metrics of all the algorithms after attribute reduction
Classification technique | Sensitivity | Specificity | Accuracy | Precision | F-Score | Classification error | MCC |
|---|---|---|---|---|---|---|---|
Naïve Bayes | 89 | 80 | 84 | 87 | 88 | 14.6 | 0.56 |
Decision Tree | 90 | 81 | 78 | 84 | 87 | 15.7 | 0.52 |
Logistic Regression | 89 | 80 | 84 | 86 | 87 | 13.4 | 0.59 |
Support Vector | 91 | 83 | 83 | 84 | 87 | 13.4 | 0.59 |
Machine | |||||||
Random forest | 89 | 79 | 82 | 86 | 87 | 14.6 | 0.56 |
KNN | 87 | 78 | 84 | 88 | 87 | 15.7 | 0.65 |
Proposed GA + RBF | 96 | 93 | 94 | 95 | 95 | 5.61 | 0.87 |
[See PDF for image]
Fig. 7
Graphical representation of performance evaluation metrics
Performance comparison between different models
Before and after the reduction of attributes, out of all proposed network performs better. The accuracy of the proposed system is 85% for 14 characteristics and 94% for nine attributes. Table 7 shows the performance comparison of the different classification techniques used and Fig. 8 shows the model accuracy comparison graph.
Table 7. Result comparison between proposed and existing classification techniques
Classification techniques | Accuracy (%) with 14 attributes | Accuracy (%) with 9 attributes |
|---|---|---|
Naïve Bayes | 78 | 84 |
Decision Tree | 77 | 78 |
Logistic Regression | 80 | 84 |
Support Vector Machine | 80 | 83 |
Random forest | 75 | 82 |
KNN | 81 | 84 |
Proposed GA + RBF | 85 | 94 |
[See PDF for image]
Fig. 8
Model accuracy comparison graph
Performance comparison of benchmark algorithms with proposed system
For all benchmark algorithms considered for the study and proposed GA-RBF Model we have presented performance evaluation of the models with the help of AUC-ROC Curves. Figure 9 shows the representation of ROC curves to show the performance of the models.
[See PDF for image]
Fig. 9
ROC curve
Conclusion and future scope
We proposed a hybrid machine learning mechanism to develop a Heart Disease Prognosis system with better accuracy. We have tested the dataset on six benchmark algorithms shown in Table 6. Compared with existing models our proposed system yields the most accurate results. Coronary disease prediction method produced better accuracy after reducing the number of attributes, which also benefits in saving time and unnecessary expenses of patients by minimizing the tests to be taken. The proposed system has given good results over other classification techniques, as represented in Fig. 6. One identified limitation in this research work is in the benchmark dataset used for this research work involved with a total number of 297 studies with 14 attributes. Performance of the algorithm need to be identified with a dataset having large volume of data. In future, this work can be further carried to increase the performance of predictive models for coronary illness detection with the help of other feature selection algorithms and optimization mechanisms.
References
1. Bui, AL; Horwich, TB; Fonarow, GC. Epidemiology and risk profile of heart failure. Nat. Rev. Cardiol.; 2011; 8,
2. Mourão-Miranda, J; Bokde, ALW; Born, C; Hampel, H; Stetter, M. Classifying brain states and determining the discriminating activation patterns: support vector machine on functional MRI data. Neuroimage; 2005; 28,
3. Vanisree, K; Singaraju, J. Decision support system for congenital heart disease diagnosis based on signs and symptoms using neural networks. Int. J. Comput. Appl.; 2011; 19,
4. Nazir, S; Shahzad, S; Septem Riza, L. Birthmark-based software classification using rough sets. Arab. J. Sci. Eng.; 2017; 42,
5. Methaila, A; Kansal, P; Arya, H; Kumar, P. Early heart disease prediction using data mining techniques. Proc. Comput. Sci. Inf. Technol. (CCSIT-2014); 2014; 24, pp. 53-59.
6. Detrano, R; Janosi, A; Steinbrunn, W. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol.; 1989; 64,
7. Edmonds, B.: Using localized 'Gossip' to structure distributed learning. In: Proceedings of AISB symposium on socially inspired computing, pp. 1–12, Hatfield, UK (2005)
8. Gudadhe, M., Wankhade, K., Dongre, S.: Decision support system for heart disease based on support vector machine and artificial neural network. In: 2010 international conference on computer and communication technology, ICCCT-2010, pp 741–745 (2010). https://doi.org/10.1109/ICCCT.2010.5640377
9. Kahramanli, H; Allahverdi, N. Design of a hybrid system for diabetes and heart diseases. Expert Syst. Appl.; 2008; 35,
10. Palaniappan S., Awang, R: Intelligent heart disease prediction system using data mining techniques. In: Proceedings of IEEE/ACS international conference on computer systems and applications (AICCSA 2008), Doha, Qatar, pp. 108–115 (2008)
11. Yang, G., Yinzi, R., Qing, P., Gangmin N., Gong, S., Cai, G., Zhang, Z., Li, L. Yan, L. A heart failure diagnosis model based on support vector machine. In: 2010 3rd international conference on biomedical engineering and informatics. IEEE, vol. 3, pp. 1105–1108. (2010)
12. The British Heart Foundation Statistics database. http://www.heartstats.org. Accessed 5 Nov 2021
13. https://www.montefiorenyack.org/highland/press/early-detection-of-heart-disease-can-keep-you-healthy. Accessed 8 Jan 2020
14. Dutta, A., Batabyal, T., Basu, M., Acton, S.T.: An efficient convolutional neural network for coronary heart disease prediction. Expert Syst. Appl. 159, 113408 (2020)
15. Mastanduno, M.: Data scientist: health catalyst applications of machine learning in health care (©2019 HealthCare.ai). https://healthcare.ai/what-models-has-health-catalyst-created-with-healthcare-ai/. Accessed 20 Jan 2020
16. Ngare, K.N.: A project proposal submitted for the study leading to a project report in partial fulfilment of the requirements for the award of a Bachelor of Science in Computer Science at St. Paul’s University. (2019)
17. Kanikar, P; Rajesh Kumar, D. Prediction of cardiovascular diseases using support vector machine and Bayesian classification. Int. J. Comput. Appl.; 2016; 156,
18. Saqlain, M., Hussain, W., Saqib, N.A., Khan, M.A. Identification of heart failure by using unstructured data of cardiac patients. In: Proceedings of the international conference on parallel processing workshops, pp 426–431, (2016). https://doi.org/10.1109/ICPPW.2016.66.
19. Manavalan, R; Saranya, S. computational approaches for heart disease prediction—a review 2. Comput. Model-Based Heart; 2018; [DOI: https://dx.doi.org/10.21917/ijsc.2018.0234]
20. Prasad, R; Anjali, P; Adil, S; Deepa, N. Heart disease prediction using logistic regression algorithm using machine learning. Int. J. Eng. Adv. Technol.; 2019; 8,
21. Yadav, S.M., Jadhav, S.M, Nagrale, S., Patil, N.: Application of machine learning for the detection of heart disease. In: Second international conference on innovative mechanisms for industry applications (ICIMIA 2020).
22. Haq, AU; Li, JP; Memon, MH; Nazir, S; Sun, R. A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mob. Inf. Syst.; 2018; [DOI: https://dx.doi.org/10.1155/2018/3860146]
23. Gárate-Escamila, AK; El Hassani, AH; Andrès, E. Classification models for heart disease prediction using feature selection and PCA. Inf. Med. Unlocked; 2020; [DOI: https://dx.doi.org/10.1016/j.imu.2020.100330]
24. Dua, D., Graff, C.: UCI machine learning repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine (2019)
25. Peng, L; Lei, L. A review of missing data treatment methods. Intell. Inf. Manag. Syst. Technol.; 2005; 1, pp. 412-419.
26. https://towardsdatascience.com/feature-selection-using-genetic-algorithms-in-r-3d9252f1aa66. Accessed 25 Oct 2020
27. https://www.neuraldesigner.com/blog/genetic_algorithms_for_feature_selection. Accessed 25 Oct 2020
28. Radial Basis Function Networks. https://www.saedsayad.com/artificial_neural_network_rbf.htm. Accessed 8 Jan 2021
29. Yu, H; Xie, T; Paszczynski, S; Wilamowski, BM. Advantages of radial basis function networks for dynamic system design. IEEE Trans. Ind. Electron.; 2011; 58,
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021.