1. Introduction
A person is considered overweight when her/his Body Mass Index (BMI) is higher than 25 and obese if it is over 30. BMI is computed by dividing the weight of the person (in kilograms) by the square of her/his height (in meters) [1]. According to the Spanish National Institute of Statistics [2], the rate of individuals with obesity has increased from 7.4% in 1987 to 17.4% in 2017. In just 30 years, it has multiplied by 2.4 times, taking the data for Spain as a reference. This health problem affects all sectors of the population, although it is not equally distributed. In particular, men are more prone to develop overweight/obesity than women. The number of cases of childhood obesity has also increased, reaching 10.3% of children between 2 and 17 years old. The World Health Organization [3] also shows that overweight people are more prone to developing cerebrovascular and respiratory problems, gallbladder disease and may increase the risk of different types of cancer. Hence, it is necessary to prevent future cases of overweight or obesity.
Some people may have a certain predisposition to suffer from overweight/obesity due to genetics. Overweight and obesity are usually a consequence of social behaviors, such as high-fat meals and sedentary lifestyles. Prevention should begin at early ages to consolidate healthy lifestyle habits, and those who suffer from any of these disorders should seek the help and advice of medical staff. Regarding prevention, it would be essential to know the relationships among lifestyle indicators, overweight, and obesity. In this context, a system predicting the risk of developing overweight and obesity could be very useful. In this paper, we show that it is possible to analyze the different factors and habits of a person and obtain the relationship among them by machine-learning techniques. In particular, we investigate the performance of a set of machine-learning classification algorithms when classifying people as overweight/obese versus non-overweight/non-obese using information about lifestyle habits. This information was collected by a pool of 14 different institutions participating in the consortium of the GenObIA project (work financed by the Community of Madrid and the European Social Fund through the GENOBIA-CM project with reference S2017/BMD-3773).
In machine learning (ML), classification is related to the assignment of class labels to data in the problem domain. A critical element in this type of problem is the selection of the variables used as predictors (feature selection). Using all variables is not always the best option; sometimes it is necessary to discard part of them in order to avoid noise in the data and also to reduce computational load. Feature-selection (FS) methods aim to reduce dimensionality, reduce execution times, and improve model results. In this work, four different approaches for FS have been tested, firstly, without applying FS, secondly by classical FS, and finally two FS methods using genetic algorithms (GAs). In particular, we investigate FS by a GA, using two selection operators: a classical tournament selection, and Stud selection, a new method proposed in this paper. Experimental results show that machine-learning algorithms are good classifiers when combined with evolutionary feature-selection methods for this particular problem.
The main contributions of this work are:
We tried different configurations for a set of ML classifiers for people being overweight or obese based on information concerning lifestyle and dietary habits.
We use a dataset with data of more than 1170 people, which is the largest study in Spain to the best of our knowledge.
We explore four different feature-selection methods.
We propose the Stud selection operator, a variant of the stud GA algorithm presented in [4], that could be adapted to another evolutionary algorithm.
The rest of the paper is organized as follows. Section 2 provides a brief review of related work. Section 3 describes the evolutionary algorithms for feature selection and implementation details. In Section 4, we explain the experimental setup and Section 5 collects the experimental results. In Section 6, the statistical analysis can be found and finally Section 7 contains the conclusions and future work.
2. Literature Review
Machine learning [5] is a constantly evolving branch of artificial intelligence related to those algorithms that try to simulate human intelligence using information from their environment. Techniques based on machine learning have been used in different fields, such as finance [6], pattern recognition [7,8], and medical applications [9,10].
In machine learning, it is important to carefully choose which features should be used and which should be discarded to construct the models. Medical datasets commonly present a small number of cases with a large number of variables, thus introducing different problems such as dimensionality and high computational requirements [11]. To deal with these obstacles, the use of feature-selection techniques has been proposed to select the variables that provide the greatest value. According to [12], three common FS categories are: filters, wrappers, and embedded methods.
Filter methods stand out mainly for their speed and scalability, being of great help in extremely large datasets. Through a series of statistical processes, scores are assigned to the different variables; the ones with the highest scores will be used to create the model. The great limitation of these methods is that they do not consider the relationship between variables. Two examples of this category are Pearson’s correlation coefficient, which allows us to quantify the linear dependence between two variables, and mutual information, which seeks to reduce the uncertainty of a random variable through knowledge of another random variable [12,13].
Wrapper methods search for the most appropriate subset of variables using the selected predictor as a black box to score the different subsets of variables generated throughout the different iterations. Wrapper methods [13] have a high computational cost since they require training and testing for each possible subset. Sequential Feature Selection (SFS) [14] starts with an empty set and adds the different variables individually, looking for the one that contributes the most value to the set. Once identified, this variable will be permanently incorporated into the subset, and then the next iteration will follow, repeating the same process until the desired number of variables is obtained. Based on this implementation, it is possible to find different variants, such as Sequential Backward Selection (SBS), which starts with the complete set of variables and reduces it through iterations, or Sequential Floating Forward Selection (SFFS) [15], which is based on the SFS method and incorporates a backward component with the SBS.
An example of the use of this type of technique is [16], where the authors apply a wrapper method, called Recursive Feature Elimination with Cross-Validation (RFECV), to select the best variables for a classification problem in the medical domain and obtain an accuracy improvement. In our work, RFECV is one of the techniques against which we compare the performance of our proposal, since it is a good FS alternative and has the benefits of wrapper methods. Heuristic search methods applied to feature selection can also be considered part of the wrapper methods. Examples of the use of genetic algorithms focused on feature selection can be found in the medical field, these methods being interesting for large datasets. In [17], an optimization algorithm based on a genetic algorithm is proposed, which allows to optimize the values of the SVM parameters, obtaining an optimal subset of features, and improving the classification accuracy. Embedded methods aim to reduce the computational time required to reclassify subsets of distinct variables. In order to do that, it tries to combine the advantages of the filter and wrapper methods. One of the main characteristics of embedded methods is the introduction of feature selection as part of the training process rather than as a separate phase, i.e., the feature-selection process becomes an integral part of the model.
Our proposal, Stud selection, is a variant of the stud GA algorithm presented in [4]. In stud GA, the fittest individual is considered a Stud and the rest of the population is crossed with it to obtain the offspring. On the other hand, in stud GA, diversity is maintained using the hamming distance between the Stud and the individual that will serve as the second parent. If the diversity is above a set threshold, the crossing is made to produce an offspring; otherwise the current second parent is mutated to produce the offspring.
In this work we have adapted this strategy to the particularities of our data. We found that crossover based on hamming distance produced a large computational load that did not mean significantly improved solutions. Therefore, we decided to use a simple one-point crossover and compensate for the possible loss of diversity with four studs.
In relation to the study of factors related to obesity, Hudson Reddon [18] deals with physical activity and genetic predisposition to obesity. With this purpose, the impact of exercise on 14 variants predisposing to obesity is analyzed. Physical activity is able to reduce the impact of the fat mass and obesity-associated (FTO) gene variation and obesity genetic risk scores. The results include the identification of an interaction between physical activity and FTO SNP rs1421085, a single nucleotide polymorphism of the gene associated with fat mass, and obesity, in a prospective cohort of six ethnic groups. According to this study, prevention programs with a heavy physical activity load can be a very important resource to combat obesity.
In [19], the authors try to identify risk factors for overweight and obesity using machine-learning techniques (regression and classification). Their main contributions are the identification of factors related to obesity/overweight, the analysis of these factors, and their respective variable analysis. The issue we find with this work is the use of the variable “weight” since it is a variable that is unknown in the future and is part of the BMI formula (the value to be predicted). In our opinion, weight cannot be used in a classification method.
There are examples of the use of evolutionary computation in similar environments. Ref. [20] deals with the prediction of obesity in children using a hybrid approach, combining supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features, a.k.a. Naïve Bayes, and a GA. In this case, the use of Naïve Bayes in prediction presented problems when dealing with zero-value parameters, and as a solution, the author proposes to use GA for parameter optimization. The initial experiment to identify the usability of their approach indicated a 75% improvement in accuracy. Similarly, this work proposes creating a genetic algorithm to support the classification model in use by selecting the most useful features.
Among the studies dealing with overweight or obesity, datasets have few cases and limited information. There are also occasions where the decisions made may be questionable, such as the use of weight in the dataset.
The work presented here seeks to predict the risk of overweight and obesity in Madrid. With this aim, a binary classification approach with evolutionary feature selection is proposed. Hence, we provide the most relevant variables for the classification algorithm through an evolutionary process. A particular structure of the feature-selection process has been developed. Additionally, a high-quality dataset has been used, composed of detailed information about the habits of the individuals and their health.
3. Methodology
This section explains the methodology applied in this work and how the feature selection for the classification problem was performed using genetic algorithms. Figure 1 shows a diagram of the feature-selection process and the generation of the classification model.
In order to apply the methodology, we need to select three main items:
The machine-learning technique.
The dataset, which defines the features.
The FS method, and if it applies, the parameters of the genetic algorithm.
From the original dataset, a curation process is performed, which also defines the initial dataset to be used. After that, the FS process is applied, and the selected features will be used to train the ML algorithm. The best classification ML model will be chosen after analyzing the results, their accuracy, and the number of false negatives.
The objective of a GA is to find the best solution of a problem through the iterative transformation (using crossover and mutation operators [21]) of an initial set of potential solutions (population). For each solution (individual), its performance (fitness) is evaluated, and, based on this value, the fittest ones will have a higher probability of passing to the next iteration. After a certain number of iterations, one of the candidates will be selected as the solution to the problem.
Four key concepts need to be considered when designing GAs:
The encoding of the problem.
The size of the population and the initialization method.
The selection method including the fitness function.
The processes by which the changes are introduced in the next iteration, including the probabilities and parameters [22].
After the execution of the GA, the fittest individual of the last generation represents the set of features (variables) selected to train the ML models.
Evaluation Metrics
Table 1 shows a description of a confusion matrix. It is a numeric matrix where we can see the number of successes of the model for the different classes. The confusion matrix of Table 1 is for algorithms classifying data into two classes (binary classification): Positive and Negative. For constructing it, it is necessary to compute the following values:
Number of True Positives (TPs): the class assigned to the sample by the model is Positive and it is also the real class.
Number of False Negatives (FNs): the class assigned to the sample by the model is Negative, but the real class is Positive.
Number of False Positives (FPs): the class assigned to the sample by the model is Positive, but the real class is Negative.
Number of True Negatives (TNs): the class assigned to the sample by the model is Negative and it is also the real class.
Number of Total Samples (Total).
From those metrics, we can also obtain different metrics to measure the model’s performance.
-
Accuracy: Percentage of data correctly classified.
(1)
-
Misclassification Rate: Percentage of misclassified data.
(2)
-
Precision: The percentage of correct predictions obtained.
(3)
-
Recall: True positive rate, the percentage of data that manages to be classified from the positive class.
(4)
Precision and recall can be associated with the positive and negative classes. In our case, we will focus on reducing the number of false negatives since these would be cases of overweight or obesity that our model does not detect. In other words, we seek to obtain a high recall of the positive class.
The GA selects the best set of features for prediction. A solution is represented by a binary chain (chromosome), with as many positions as features available in the curated dataset. Each of the positions (genes) in the chromosome applies to a feature. The value of the gene will indicate if the feature is selected (1) or not (0) as the prediction variable. The initial population will be generated randomly.
Due to the fact that we have a balanced dataset, to evaluate an individual (a model), we used a classical cross-validation scheme, with stratification [23]. Each model is trained using only the features expressed as 1s in the individual genotype. The average accuracy rate of the 10 folds, F, is used as fitness function:
where is the accuracy obtained for each one of the cross-validation folds [24].In this paper, we propose a variation of the Stud GA method and we compare its performance with a traditional tournament implementation.
The Stud selection method works as follows. First, the four best individuals of the generation are selected, and form the Stud candidates group. Second, the two best individuals pass to the following iteration without crossover. Finally, the rest of the population is completed by applying the crossover operator to a pair formed by: (i) a member of the Stud candidates group and (ii) another individual of the population in the event that the probability of crossover is met.
For the tournament selection, we use a simple implementation with selection pressure of five [25]. As usual in the literature [26], we denote selective pressure to the size of the tournament pool. Adjusting this parameter allows us to find a trade-off between exploration and exploitation of the fitness landscape of the algorithm. In this study, we use a selection pressure of five, which is a value that prioritizes exploitation over exploration and seemed to work well in the preliminary experiments with our datasets.
After the selection of the individuals, we apply a single-point crossover, choosing a point in the chromosome of the two selected individuals and generating one offspring. With this purpose, we combine the information from one of them up to the crossover point and complete it with the remaining information from the other individual.
A random mutation is introduced in the individual with a very low probability. This mutation affects a gene of the individual, flipping its value.
The parameters used for the GA were a crossover probability of 0.82, a mutation probability of 0.09, a population of 50 individuals, and 100 generations.
4. Experimental Setup
4.1. Dataset
The original dataset is the result of surveys carried out in different center parts of the GenObIA consortium, including universities and hospitals. The information collected in these surveys includes lifestyle, nutrition habits, and information about pathologies suffered by the person in the past.
4.1.1. Data Curation
The original dataset, Appendix A.2, Table A2, is composed of a total of 93 variables and 1179 subjects, among which we find:
One subject identifier;
Thirteen variables of general information about the subject such as weight, age, education, stress, etc.;
Seven variables related to alcoholic drinks, distinguishing between distilled and fermented drinks;
Seven variables on smoking habits, such as number of cigarettes, pipes, cigars. For ex-smokers: time since a smoker quit smoking;
Fifteen variables related to pathologies, such as types of cancer, sleep apnea, and type 2 diabetes mellitus, among others;
Thirty-four variables on nutritional habits; we found information on the portions of different types of food and the points of adherence to the Mediterranean diet according to these portions;
Sixteen variables related to physical exercise and its intensity.
The dataset is balanced in terms of the predicted variable, overweight/obesity (BMI ≥ 25), with 48% being obese/overweight individuals and 52% being non-obese/non-overweight individuals. Therefore, we considered unnecessary the use of classification techniques focused on imbalanced datasets.
In order to avoid repeated information that may introduce noise in the system, a reduction in the number of variables was performed. An example of a reduction is the case of the variables referring to the food intake, which were replaced by a unique variable, namely adherence to the Mediterranean Diet (ADH). The original dataset, contains a set of variables related to food, 16 associated with the servings, and derived from these, another 16 measuring their adherence to the Mediterranean diet using points. If there are more than eight points in total, the subject is considered to have a high ADH.
In addition, some redundant variables were eliminated. For instance, the dataset initially contains two variables referring to exercise that were computed with a set of features of the pool: Cal_IPAQ, which reports the calories burned as a function of physical exercise; and IPAQ, which reports the information on the exercise performed and its intensity. IPAQ takes into account the duration and intensity of exercise, pondering the value of sedentary, moderate, and vigorous exercise. Cal_IPAQ includes weight as a variable for its calculation and therefore cannot be used since weight is also present in the close form for computing BMI. Hence, we use only IPAQ as a training variable.
There are also some features, such as the place or institution, where the sample was obtained (center), that were removed from the dataset since a high correlation with the BMI was observed due to the differences in the nature of the population of the places (policeman, sport teams, retired people, etc.).
After processing the data, which in our case was supervised by the medical staff that participated in the project, the total number of variables was reduced, from 93 to 41, as shown in Appendix A.1, Table A1. This table shows the different features of the study, providing its identifier, name, short description, and type.
4.1.2. Dataset with Pathologies
Two datasets were generated using the variables of Appendix A.1. One of them, called dataset with pathologies, includes the variables related to pathologies. This dataset includes most of the features. In particular, all the variables, excluding variables 37 to 40. The objective of this dataset is to evaluate the classifiers with standard data of the pools, which includes information on the health record of the person.
4.1.3. Dataset without Pathologies
There is a set of variables related to pathologies. When dealing with this type of variable, it is necessary to consider whether a pathology is a cause or a consequence of overweight/obesity. An example is the variable number 33, Apnea, which indicates whether the subject suffers from sleep apnea. Usually, overweight or obese people suffer from this problem. However, it is not necessarily true that because they suffer from sleep apnea, they are suffering from overweight/obesity. The same applies to other pathologies. In order to evaluate this kind of artifact, we create a new dataset, selecting all the variables in Appendix A.1, but excluding those of pathological type and variable 11.
5. Experimental Results
We performed experiments combining ten different algorithms as classifiers and four different feature-selection strategies (two evolutionary feature-selection methods, one feature selection from the literature, and no feature selection) on the two datasets explained above (With and Without Pathologies).
Table 2 and Table 3 present the experimental results. These tables contain one row for each configuration, identified by an acronym (ID), and 11 additional columns including: the name of the algorithm (ALGORITHM), the feature-selection strategy (FS), the number of variables of the dataset (VARIABLES), the accuracy of the best solution (BEST), the accuracy of the worst solution (WORST), the mean (MEAN), and the standard deviation (STD) of 30 runs for the configuration in the row. In addition, the four last columns show precision (PRECISION_0 and PRECISION_1 )) and recall values (RECALL_0 and RECALL_1) for class 0 (non-overweight/non-obese) and 1 (overweight or obese) for the best solution with this algorithm.
The interpretation of the FS column is:
Stud-GA: Evolutionary feature selection with Stud selection operator.
Tournament-GA: Evolutionary feature selection with tournament selection operator.
RFECV: Feature selection with recursive feature elimination (RFE) with cross-validation (CV).
No-FS: No feature selection applied in the configuration
As mentioned, ten classification algorithms were used, using the implementation available at the Scikit-learn python library [27]:
Decision Tree (DT): Its objective is to create a model to predict the value of a target variable by learning simple decision rules from data characteristics.
Gradient Boosting (GB): An additive model is created in a stepwise way, allowing the optimization of arbitrary differentiable loss functions. At each step, a regression tree is fitted to the negative gradient of the given loss function
Adaboost (ADB): A meta-estimator that starts by fitting a classifier on the original dataset and next fits additional copies of the classifier for the same dataset where the weights of the misclassified instances are adjusted in order to make the subsequent classifiers concentrate on the difficult cases.
Bagging (BG): Fits base classifiers on random subsets of the original dataset and aggregates its individual predictions into a final prediction.
Bernoulli Naive Bayes (BNB): This classifier is useful for discrete data and is designed to handle binary/boolean features.
Extra Trees (ET): A meta estimator that fits random Decision Trees on multiple subsamples of the dataset and uses the average to improve predictive accuracy and control overfitting.
Gaussian Naive Bayes (GNB): Another Naïve Bayes model. This classifier is used when the values of the predictor are continuous
Logistic Regression (LR): This algorithm attempts to predict the probability that a given data entry will belong to a category. Just as linear regression assumes that the data follow a linear function, logistic regression models the data using the sigmoid function.
Random Forest (RFC): This technique fits several Decision Tree classifiers on multiple subsamples of the dataset and uses averaging to improve predictive accuracy and control overfitting.
XGBoost (XGB): A tree boosting system that stands out for its scalability and is widely used by data scientists.
Appendix A.3, Table A3 provides a table with information on the parameters used for each model.
Regarding evolutionary feature-selection methods, we focus their application on three models. The first one is Gradient Boosting [28] as it obtained consistently good results among the set of classifiers without feature selection. Next is XGBoost, [29], which is a state-of-the-art machine-learning algorithm that allows us to measure the goodness of results from the other algorithms. Finally, there is Decision Tree [30], because the use of models based on trees provides a solution with a straightforward interpretation by the medical staff. The development of understandable solutions for medical doctors is one of the main objectives of our research. Understanding why a model makes a particular prediction can be as crucial as prediction accuracy in the medical field. In some cases, the best results are obtained with complex models that are difficult to interpret. Thanks to the SHAP library [31], it is possible to obtain from each feature a value of importance for a particular predictor. The SHAP algorithm aims to explain the outcome of machine-learning models, representing the results by means of graphs. It is based on the Shapley values of game theory. In particular, in this paper we will focus on the ones that allow us to show the impact of the different variables in the model. To understand these graphs, two factors must be taken into account: the position of the points on the horizontal axis and the color. Let us take Figure 2 as an example.
-
The color of the dots denotes the numerical value of the variable. In the case of age, the redder the dot the higher the age and the bluer the dot the lower the age. In the case of sex. The red color represents the female sex and the blue color represents the male sex. In the case of education, the red color represents higher levels of education while the blue color represents people with low levels of education or no education.
-
The position of the points on the horizontal axis in our study indicates the probability of overweight/obesity. The further to the right the point is (positive values), the greater the probability that the person suffers from this problem. The further to the left (negative values), the lower the probability.
Thus, we can see as an example that Figure 2 provides us with the following information:
Age is an important factor for the probability of being overweight/obese. We can clearly see that higher ages (red dots) have a higher probability than lower ages (blue dots).
Gender is an unequivocal factor, with male gender (blue dots) being related to a higher probability of being overweight/obese while female gender (red dots) is similarly distributed but in the negative range.
Education is another important factor to take into account. Low levels of education (blue dots) have a higher probability while high levels (red dots) are related to a lower probability.
In the case of other variables, such as job, for example, we can see that the color of the points is intermixed, indicating no correlation with the probability of being overweight/obese.
5.1. Results Using the Dataset with Pathologies
5.1.1. Results without Feature Selection
In the scenario with pathologies, Table 2 summarizes the results of the algorithms with and without feature selection (No-FS). The average accuracy rate is 0.6953, and the standard deviation is 0.0297, with DT, ADB, BNB, and GNB being below this average. The algorithms with the best results are GB and RFC, with a mean that is close to 0.74.
Figure 2 shows the impact of the different variables of the RFC model. The age variable is in first place, in the case of younger people, represented by the lowest values (blue), which are mainly found in the left part of the graph, indicating a lower risk of being overweight/obese. In contrast, the higher values of this variable, the older people (red), have a higher risk of suffering from these health problems. In the case of sex, being male has a higher probability of being classified in the overweight/obesity class. Third, the first variable associated with the type of pathology, apnea, appears. The variable apnea has all the blue points very close to the zero point, while the red points extend along the positive part of the axis. This indicates that in cases of suffering from this pathology it is likely to suffer from overweight/obesity, but otherwise, it does not have a significant impact on the prediction.
5.1.2. Results for Gradient Boosting with Classical Feature Selection (GB-RFECV)
Moreover, in Table 2, using RFECV, an average accuracy rate of 0.7324 was achieved, reaching a maximum of 0.7797. A total of 17 variables were selected, including stress; some of the variables related to alcoholic drinks, education, time since quitting smoking, and physical exercise. Some of the most selected variables, such as sex, age, or apnea, are among the variables obtained. Figure 3 shows the evolution of the accuracy rate for different algorithms and datasets, taking the average of the cross-validation runs with RFECV in relation to the number of variables chosen. The average accuracy rate increases until it reaches 17 variables and then starts to decrease, probably due to the selection of variables that introduce noise.
5.1.3. Results for Gradient Boosting with Evolutionary Feature Selection
The results presented below are obtained with GB using evolutionary FS.
-
GA with Stud selection for Gradient Boosting (GB-S): In this case, the selection method reduced the number of variables used from 37 to 19, keeping its classification rate, and even obtaining a slight improvement, reaching an average accuracy of 0.7382, as shown in Table 2. Three of the variables selected were age, sex, and apnea, these being the most frequently chosen. Other variables to be highlighted are those related to smoking, education, earning, and adherence to the Mediterranean diet.
Figure 4 shows the graph corresponding to the Gradient Boosting model with Stud selection for the dataset with pathologies. The first variables that appear are age, sex, and apnea (as in the previous cases). In the case of education, it is shown that a lower level of education increases the probability of being overweight or obese. On the other hand, those individuals who suffer from metabolic syndrome are also more likely to be overweight or obese, but as in the case of apnea, if the individual does not suffer from this pathology, the variable does not have such a strong impact. Again, this can be seen in that the blue dots are clustered next to the zero point, while the red dots are spread over the positive values.
-
GA with tournament selection for Gradient Boosting (GB-T): Using the tournament selection method with selection pressure of five, 23 variables were selected by the GA, achieving an average accuracy of 0.7332, as can be found in Table 2. Similar to the previous case, among the features selected, some of the most common ones are sex, age, adherence to the Mediterranean diet, and some new additions, which were variables related to heart disease and different types of cancer. Other variables to note are the appearance of distilled/fermented beverages, education, earnings, and stress.
5.1.4. Results for Decision Tree with Classical Feature Selection (DT-RFECV)
Using RFECV, a total of 13 variables were selected, reaching an accuracy rate of 0.6962 and, in the best case, up to 0.7559, as shown in Table 2. Some of the variables are education, alcoholic drinks, physical exercise, and diabetes, among others. Again the most common variables are included: sex, age, and apnea. Figure 3 shows the evolution of the accuracy rate, in this case, the maximum working with 13 variables, reducing the accuracy when the number of variables increases.
5.1.5. Results for Decision Tree with Evolutionary Feature Selection
The results obtained with DT using evolutionary FS are presented below.
-
GA with Stud selection for Decision Tree (DT-S): Fourteen variables were selected, obtaining an average accuracy of 0.7150 and, in the best case, up to 0.7661, as can be verified in Table 2. Among the variables used were sex, age, and seven variables related to pathologies.
Figure 5 shows the impact of the different variables in the model Stud with Decision Tree with pathologies. Once again, age, sex and apnea are at the top of the list. In this case, diabetes appears with a similar distribution to apnea, although with less impact. The appearance of fermented beverages is also interesting.
-
GA with tournament selection for Decision Tree (DT-T): Thanks to the feature selection, with only 16 variables, an accuracy reaching 0.7492 was achieved in the best cases and an average of 0.7103, as shown in Table 2.
Some of the variables used for this case are metabolic syndrome, those related to cancer, fermented drinks, specifically those related to wine, and the variable heart_angina. Moreover, as in the previous cases, sex and age appear.
Figure 6 shows a heat map representing the frequency with which the different FS models (columns) selected each of the variables (rows). Depending on the color of each cell, it is possible to get an idea of the number of times a variable was selected, with the cool colors being those cases with the lowest number of occurrences and the warm colors being those that appeared most frequently. In the case of the Decision Tree, there is greater diversity in the variables chosen by the models, since the colors are not so intense, unlike Gradient Boosting. Despite the differences between the variables chosen for the models, the extremes (top and bottom) of the heat map show a similar range of colors in most cases. The most frequent variables were sex, age, education, apnea, alcoholic beverages, both fermented and distilled, and metabolic syndrome.
5.1.6. Results for XGBoost with Classical Feature Selection (XGB-RFECV)
This time the average accuracy rate achieved was 0.7085, reaching a maximum of 0.7424, as shown in Table 2. A total of 20 variables were used, including sex, age, education, metabolic syndrome, apnea, IPAQ, and different variables associated with alcoholic beverages, among others. Figure 3 shows the evolution of the accuracy rate with the number of variables chosen. In this case, the maximum value is obtained with 20 variables.
5.1.7. Results for XGBoost with Evolutionary Feature Selection
The results obtained by applying the evolutionary FS to XGBoost are presented below.
-
GA withStudselection for XGBoost (XGB-S): As shown in Table 2, an average accuracy rate of 0.7216 and a maximum of 0.7831 was achieved using 20 variables. Similar to other cases, variables such as sex, age, edu, and those related to alcoholic beverages appear. Nine of the selected variables belong to the group of pathologies. Among them, we find several types of cancer, apnea, diabetes, and metabolic syndrome.
-
GA with tournament selection for XGBoost (XGB-T): In this case, using a total of 19 variables, a mean accuracy ratio of 0.7029 was obtained, and a maximum of 0.7559 was reached, as seen in Table 2. Among the most common variables, we found sex, edu, and apnea, but age did not appear. In addition, five variables related to smoking and up to nine related to pathologies were selected.
5.2. Results without Pathologies
A new batch of experiments was performed with the dataset without pathologies, using the same algorithms as in the previous section. The results with this new dataset are worse than those obtained using the dataset with pathologies. It may indicate that the models obtained using the dataset with pathologies gave more importance to these variables, which may be considered a consequence of overweight/obesity rather than a cause.
5.2.1. Results without Feature Selection
After testing the dataset without pathologies with the algorithms without feature selection, a reduction in the accuracy rate of the models was observed with respect to the results with pathologies, obtaining an average accuracy rate of 0.6825 and a standard deviation of 0.0408. RFC obtained the highest result, reaching 0.7331 accuracy, as can be seen in Table 3.
Figure 7 shows the impact of the variables for the RFC model, using the dataset without pathologies. Age and sex maintain the highest positions with the same characteristics seen in previous cases. Among the variables referring to eating habits, vegetables and soda stand out. The values of vegetables seem to have an inversely proportional relationship with the possibility of being overweight or obese while in the case of soda this relationship is direct. In Figure 8, we can see the Pearson Correlation Matrix of the variables in Figure 7. As we can see, no significant correlation can be appreciated between variables with the exception of smoke (smoker or non-smoker) and nsmoke (number of cigarettes per day).
5.2.2. Results for Gradient Boosting with Classical Feature Selection (GB-RFECV)
In this configuration, the number of variables selected was 9, reaching an average accuracy rate of 0.7171 and reaching 0.7695 in the best case, as shown in Table 3. Legume intake, vegetable intake, physical exercise, and time as an ex-smoker are some of the selected variables. On the other hand, there are also those more common ones such as sex, age, and education. In Figure 3, the accuracy reaches its maximum value at nine variables.
5.2.3. Results for Gradient Boosting with Evolutionary Feature Selection
The results obtained using GB with evolutionary FS are presented below.
-
GA withStudselection for Gradient Boosting (GB-S): Using only 12 variables, it has been possible to achieve an average accuracy of 0.7295 and 0.7797 for the best case, achieving a slight improvement compared to the version without feature selection, as shown in Table 3. In this new set of variables, sex and age continue to dominate similarly to the set with pathologies. Other variables used are those related to hours of sleep, fermented/distilled drinks, soft drinks, legume portions, and education.
The graph of the impact of the variables corresponding to the Gradient Boosting model with Stud selection for the case without pathologies is shown in Figure 9. In the highest positions we find sex, age, education, and soda with the same performance as above. A new variable to highlight is legumes, whose highest values appear in the left zone of the graph. As for alcoholic beverages, we find apparent differences between distilled and fermented beverages; for the higher values of the variable wineWEEK, the probability of being overweight/obese is lower. In the case of spirits, it seems that their intake may be associated with overweight/obesity.
-
GA with tournament selection for Gradient Boosting (GB-T): Using the tournament selection method, the number of variables used was also 12 with an average accuracy of 0.7307 and for the best-case scenario up to 0.7763, as shown in Table 3. New variables were included: adherence to the Mediterranean diet and the population. Other variables seen previously such as education, soft drinks, distilled/fermented beverages, sex, age, and hours of sleep, among others, also appear.
5.2.4. Results for Decision Tree with Classical Feature Selection (DT-RFECV)
For this case, RFECV has selected a single variable, age, achieving an average accuracy rate of 0.6821 and a maximum of 0.7186, as shown in Table 3. Classical Feature Selection algorithms have solutions with very few variables; in our case, we are looking for models that explain more extensively the causes of overweight and obesity, so we explore other solutions such as those based on evolutionary algorithms that allow us to obtain models with a greater number of variables and similar accuracy. Looking at Figure 3, the accuracy reaches its maximum value with only one variable, but here we have to consider two facts:
This value is not very far from the case of using all the variables;
From a medical point of view, the use of a single variable model does not provide any help to clinical practice.
Figure 3 shows the variation of the cross-validation score (inside RFECV strategy) for the different algorithms for both the pathology and non-pathology cases. This value is of little interest if we look at the small difference in the values on the horizontal axis and at the fact that we are comparing two different datasets. What really interests us is to see the differences in the number of selected features (peaks marked with a red dot) and how these values vary depending on the algorithms and the datasets.
5.2.5. Results for Decision Tree with Evolutionary Feature Selection
The following results were obtained with the DT using the evolutionary GA.
-
GA withStudselection for Decision Tree (DT-S): Using the Stud selection method, a total of 14 variables were chosen, with an average accuracy of 0.6934 and 0.7322 for the best case, as can be seen in Table 3. In this case, the age variable was taken but not the sex variable. The variables chosen include education, population, hours of sleep, alcoholic drinks, soft drinks, legumes rations, vegetable rations, and the stress variable as a new incorporation.
Figure 10 shows the impact of the different variables for the Decision Tree model with Stud for the set without pathologies. First, age appears again, the next variable is education and it is observed that, for higher values and higher level of studies, the probability of being overweight/obese is lower. In the case of soft drinks, higher consumption implies a higher probability of being overweight or obese. Another variable to note is vegetable, where it is observed that most of the red dots are on the left side of the graph, so it can be considered that it is less likely to be overweight/obese. It would be interesting to study in the future the causes of the behaviors of the variables stress and ex-smoker.
-
GA with tournament selection for Decision Tree (DT-T): In this case, the number of variables used was 15, obtaining an average accuracy of 0.6862 and reaching 0.7525 in the best case, as shown in Table 3. Among the variables chosen are portions of legumes, fermented drinks, the time without smoking of an ex-smoker, and the IPAQ, the variable that reflects physical exercise. Other variables found are the most common, such as sex, age, soft drinks, and education, among others.
Figure 11 shows the heat map obtained by representing the frequency of variables selected from the dataset without pathologies using the FS models with genetic algorithms. There is a closer correlation between the variables selected with the GB and DT models, compared to the case of the dataset with pathologies. Again the most commonly selected variables are age, sex, and education followed by alcoholic beverages. As a new addition among the most selected variables, we find the variable soda.
5.2.6. Results for XGBoost with Classical Feature Selection (XGB-RFECV)
With only two variables, sex and age, an average accuracy ratio of 0.6850 and a maximum of 0.7186 were obtained, as shown in Table 2. The evolution of the accuracy rate as a function of the number of variables used is shown in Figure 3. From a medical point of view, the selection of only these two variables does not provide any information.
5.2.7. Results for XGBoost with Evolutionary Feature Selection
Results obtained with the XGBoost using the evolutionary GA were as follows.
-
GA withStudselection for XGBoost (XGB-S): As shown in Table 2, a mean accuracy rate of 0.6989 and a maximum of 0.7424 using 12 variables were achieved. In this case, among the variables selected, we found some of the most common ones, such as sex, age, edu, or soda. In addition, there are variables such as vegetation or those related to alcoholic beverages.
-
GA with tournament selection for XGBoost (XGB-T): Using the 11 selected variables, an average precision ratio of 0.6948 and a maximum of 0.7288 were obtained, as can be seen in Table 2. In addition to the most common variables, such as sex, age, education, or soda, we can find variables related to alcoholic beverages, both distilled and fermented, vegetable intake, and earning.
Table 4 gives a sample of time in seconds taken by the GA for the different algorithms and datasets. One run of each type was carried out individually (no other program was running in parallel) on the same machine to obtain these values. The time is strongly related to the algorithm chosen.
6. Statistical Analysis
We performed a nonparametric statistical analysis once the results were obtained for the two proposed problems, with and without pathologies. The Friedman test [32] is a nonparametric [33] alternative to the two-way parametric analysis of variance that tries to detect significant differences between the behavior of two or more algorithms. It can be used to identify if, in a set of k samples (where k ≥ 2), at least two of the samples represent populations with different median values. The first step is to convert the original results, in Table 5, into ranks to produce Table 6. Table 5 shows the average accuracy rate for each of the algorithms on the two datasets.
-
Gather observed results for each algorithm/problem pair.
-
For each problem i, rank values from 1 (best result) to k (worst result). Denote these ranks as:
(5)
where k is the number of algorithms. -
For each algorithm j, average the ranks obtained in all problems to obtain the final rank:
(6)
where n is the number of datasets.
In Table 6, the best algorithms are Gradient Boosting with Stud selection method, Gradient Boosting with tournament method, and Random Forest without feature selection. In both problems, algorithms with feature selection outperform their traditional version.
Under the null hypothesis (H0), which states that all algorithms behave similarly, so their ranks should be equal, the Friedman statistic can be calculated as:
(7)
Distributed according to a distribution with k − 1 degrees (k = 19) of freedom [33]. As usual, we can define the p-value as the probability of obtaining a result as extreme or even more extreme than the observed one, provided that the null hypothesis is true. As usual, we have chosen a p-value of 0.05. For this p-value, the critical value according to the distribution of with 18 degrees of freedom is 28.8693. To reject H0, it is necessary to exceed this value. Applying the formula of Friedman’s statistic, Equation (7), a value of 34.6421 is obtained, so the critical value is exceeded and the hypothesis H0 is rejected, confirming that there are significant differences between the models.
7. Conclusions and Future Work
In this work, a set of classification systems of persons at risk of suffering from overweight/obesity have been developed. Four different FS strategies have been employed for the two experiment datasets, one FS from the literature, two evolutionary FS methods, and no FS. For this work, ten machine-learning models have been employed. For the application of the feature-selection methods, we chose three of these 10 methods, based on reasons of performance and possible application to medical clinical practice. Thus, the FS methods were successfully applied in Gradient Boosting, XGBoost, and Decision Tree algorithms. The most important conclusions of this work are:
Although not a surprising finding, we have found GA to be a very competent tool to perform feature selection and thus improve the training of classification models.
The Stud selection, which uses an elitist set that is always part of the crossover process, achieves promising results.
If we look at the fitness of the best individual in each of the final populations, they all maintain a similar standard deviation, perhaps indicating that we might expect to obtain similar results in the future with new data sets.
About the variables related to pathologies, it is necessary to identify what is the cause and what is the effect. A good example would be sleep apnea. On several occasions, models have been based on apnea when classifying, but this disorder is caused by overweight or obesity in many cases. A person may have apnea due to being overweight or obese but will not necessarily be overweight or obese due to suffering from apnea.
Finally, significant differences were found among the algorithms, with Gradient Boosting with feature selection being the one obtaining the best results.
The models developed in this work will be the basis of a recommendation system. This system will be able to warn people about behavioral tendencies that will end up producing overweight and obesity and recommend healthy habits to replace them.
Future Work
Although the overall accuracy of the model is important, from a medical point of view, classifying an individual at risk of overweight/obesity as not at risk is more detrimental than classifying a healthy person as at risk. The model must take this into account, so it would be interesting to find different fitness functions or develop a multi-objective evolutionary algorithm that could increase accuracy and reduce the number of false negatives for at-risk individuals. Therefore, we intend to test with precision and recall as fitness functions instead of accuracy.
In this work, and for space reasons, we have only used accuracy as a fitness function. There remains, therefore, as future work, the investigation of other fitness functions as well as the heuristic search or the dynamic selection as they have already been tested in the literature of other fields [34,35].
It is also necessary to study carefully the parameters of the evolutionary algorithm, testing with different combinations of population size and number of generations.
It would be highly recommended to increase the volume of the dataset as it could improve the accuracy of the models. As part of the project, genetic information of individuals will be incorporated, so it will be necessary to perform a study on the impact of these variables and their possible interaction with the current ones.
In this study we achieved an accuracy close to 0.8. Can these results be considered good enough, can they be improved, are they unacceptable? Although the members of our research team who are specialists in medicine consider these results as good, we lack having a context with a clear metric to determine where the minimum level is and how high we can achieve. Establishing this scale remains ambitious future work.
Conceptualization: J.I.H., J.M.V. and J.J.Z.-L.; methodology: J.J.Z.-L., D.P., O.G. and J.I.H.; software: D.P. and A.G.-G.; validation: O.G., J.J.Z.-L. and J.M.V.; formal analysis: J.J.Z.-L., N.d.l.H. and J.I.H.; investigation: D.P. and K.Z.-N.; resources: K.Z.-N., J.J.Z.-L., N.d.l.H. and J.I.H.; data curation: K.Z.-N., D.P., A.G.-G. and N.d.l.H.; writing—original draft preparation: D.P. and A.G.-G.; writing—review and editing: J.M.V., O.G. and J.I.H.; visualization: D.P., A.G.-G. and J.M.V.; supervision: J.I.H.; project administration: J.I.H. and J.J.Z.-L.; funding acquisition: J.J.Z.-L. and J.I.H. All authors have read and agreed to the published version of the manuscript.
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Clinical Research Ethics Committee of the Community of Madrid (Comité Ético de Investigación Clínica de la Comunidad de Madrid (CEIC-R)) and the Clinical Research Ethics Regional Committee (CEIC-R) (Comité Ético de Investigación Clínica-Regional (CEIC-R)). Genetic analyses have always been carried out in compliance with the provisions specified in the Biomedical Research Law (14/2007) and in the Personal Data Protection Law (Law 15/1999) from Spain.
Informed consent was obtained from all subjects involved in the study.
The data that support the findings of this study are available on reasonable request from the corresponding author [J.I. Hidalgo]. The data are not publicly available due to legal restrictions.
We would also like to thank the centers that provided the data and made this work possible. 1. Atención Primaria. 2. Hospital Clínico San Carlos. 3. Hospital Universitario 12 de Octubre. 4. Hospital Universitario La Paz. 5. Hospital General Universitario Gregorio Marañón. 6. Hospital Universitario Ramón y Cajal. 7. Hospital Universitario Infanta Leonor.
The authors declare no conflict of interest.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figure 2. Impact of variables with SHAP, Random Forest with pathologies, and no FS.
Figure 4. Impact of variables with SHAP, Gradient Boosting with Stud selection, with pathologies.
Figure 5. Impact of variables with SHAP, Decision Tree with Stud selection, with pathologies.
Figure 6. Heat map for frequency of the different variables using FS with GA, dataset with pathologies.
Figure 7. Impact of variables with SHAP, Random Forest without pathologies, and no FS.
Figure 9. Impact of variables with SHAP, Gradient Boosting with Stud selection, without pathologies.
Figure 10. Impact of Variables with SHAP, Decision Tree with Stud Selection, without Pathologies.
Figure 11. Heat map representation of the frequency of different variables obtained using FS with GA, dataset without pathologies.
Example of confusion matrix structure for binary classification.
Positive Prediction | Negative Prediction | |
---|---|---|
Positive Class | (TP) | (FN) |
Negative Class | (FP) | (TN) |
Results of the different algorithms for the set of variables with pathologies. The table shows the algorithm, selection method, number of variables used, best case, worst case, mean, and standard deviation. The algorithm with the highest mean is marked with bold font.
ID | ALGORITHM | FS | VARIABLES | BEST | WORST | MEAN | STD +/− | PRECISION_0 | PRECISION_1 | RECALL_0 | RECALL_1 |
---|---|---|---|---|---|---|---|---|---|---|---|
DT-S | Decision Tree | Stud-GA | 14 | 0.7661 | 0.6407 | 0.7150 | 0.0309 | 0.7682 | 0.7639 | 0.7733 | 0.7586 |
DT-T | Decision Tree | Tournament-GA | 16 | 0.7492 | 0.6542 | 0.7103 | 0.0238 | 0.7255 | 0.7746 | 0.7762 | 0.7237 |
DT-RFECV | Decision Tree | RFECV | 13 | 0.7559 | 0.6373 | 0.6962 | 0.0257 | 0.7597 | 0.7518 | 0.7697 | 0.7413 |
DT | Decision Tree | No FS | 37 | 0.7254 | 0.6237 | 0.6914 | 0.0232 | 0.7035 | 0.7561 | 0.8013 | 0.6458 |
XGB-S | XGBOOST | Stud-GA | 20 | 0.7831 | 0.6780 | 0.7216 | 0.0245 | 0.7451 | 0.8239 | 0.8201 | 0.7500 |
XGB-T | XGBOOST | Tournament-GA | 19 | 0.7559 | 0.6746 | 0.7029 | 0.0187 | 0.7614 | 0.7479 | 0.8171 | 0.6794 |
XGB-RFECV | XGBOOST | RFECV | 20 | 0.7424 | 0.6746 | 0.7085 | 0.0201 | 0.7677 | 0.7143 | 0.7484 | 0.7353 |
XGB | XGBOOST | No FS | 37 | 0.7593 | 0.6441 | 0.6981 | 0.0225 | 0.7669 | 0.7500 | 0.7911 | 0.7226 |
GB-S | Gradient Boosting | Stud-GA | 19 | 0.7966 | 0.6881 | 0.7382 | 0.0231 | 0.8204 | 0.7656 | 0.8204 | 0.7656 |
GB-T | Gradient Boosting | Tournament-GA | 23 | 0.7864 | 0.6644 | 0.7332 | 0.0251 | 0.7738 | 0.8031 | 0.8387 | 0.7286 |
GB-RFECV | Gradient Boosting | RFECV | 17 | 0.7797 | 0.6915 | 0.7324 | 0.0199 | 0.7727 | 0.7899 | 0.8447 | 0.7015 |
GB | Gradient Boosting | No FS | 37 | 0.7797 | 0.6814 | 0.7305 | 0.0211 | 0.7683 | 0.7939 | 0.8235 | 0.7324 |
ADB | ADABOOST | No FS | 37 | 0.6746 | 0.6000 | 0.6424 | 0.0189 | 0.6690 | 0.6800 | 0.6690 | 0.6800 |
BG | BAGGING | No FS | 37 | 0.7458 | 0.6678 | 0.7114 | 0.0210 | 0.7486 | 0.7411 | 0.8253 | 0.6434 |
BNB | BERNOULLI NB | No FS | 37 | 0.7220 | 0.6271 | 0.6675 | 0.0198 | 0.7429 | 0.6917 | 0.7784 | 0.6484 |
ET | EXTRA TREES | No FS | 37 | 0.7627 | 0.6508 | 0.7119 | 0.0276 | 0.7419 | 0.7857 | 0.7931 | 0.7333 |
GNB | GAUSSIAN NB | No FS | 37 | 0.7051 | 0.5932 | 0.6603 | 0.0230 | 0.6792 | 0.7711 | 0.8834 | 0.4848 |
LR | LOGISTIC REGRESSION | No FS | 37 | 0.7627 | 0.6780 | 0.7098 | 0.0190 | 0.7603 | 0.7651 | 0.7603 | 0.7651 |
RFC | RANDOM FOREST | No FS | 37 | 0.7763 | 0.6644 | 0.7292 | 0.0240 | 0.7582 | 0.7958 | 0.8000 | 0.7533 |
Results of the different algorithms for the set of variables without pathologies. The table shows the algorithm, selection method, number of variables used, best case, worst case, mean, and standard deviation. The algorithm with the highest mean is marked with bold font.
ID | ALGORITHM | FS | VARIABLES | BEST | WORST | MEAN | STD +/− | PRECISION_0 | PRECISION_1 | RECALL_0 | RECALL_1 |
---|---|---|---|---|---|---|---|---|---|---|---|
DT-S | Decision Tree | Stud-GA | 14 | 0.7322 | 0.6407 | 0.6934 | 0.0212 | 0.7419 | 0.7214 | 0.7468 | 0.7163 |
DT-T | Decision Tree | Tournament-GA | 15 | 0.7525 | 0.6169 | 0.6862 | 0.0307 | 0.7548 | 0.75 | 0.7697 | 0.7343 |
DT-RFECV | Decision Tree | RFECV | 1 | 0.7186 | 0.6373 | 0.6821 | 0.0195 | 0.7200 | 0.7172 | 0.7248 | 0.7123 |
DT | Decision Tree | No FS | 26 | 0.7153 | 0.6203 | 0.6799 | 0.0252 | 0.7533 | 0.6759 | 0.7062 | 0.7259 |
XGB-S | XGBOOST | Stud-GA | 12 | 0.7424 | 0.6237 | 0.6989 | 0.026 | 0.7702 | 0.709 | 0.7607 | 0.7197 |
XGB-T | XGBOOST | Tournament-GA | 11 | 0.7288 | 0.6542 | 0.6948 | 0.0223 | 0.7235 | 0.736 | 0.7885 | 0.6619 |
XGB-RFECV | XGBOOST | RFECV | 2 | 0.7186 | 0.6475 | 0.6850 | 0.0164 | 0.6746 | 0.7778 | 0.8028 | 0.6405 |
XGB | XGBOOST | No FS | 26 | 0.7254 | 0.6068 | 0.6803 | 0.0295 | 0.7024 | 0.7559 | 0.7919 | 0.6575 |
GB-S | Gradient Boosting | Stud-GA | 12 | 0.7797 | 0.6814 | 0.7295 | 0.0230 | 0.7578 | 0.8060 | 0.8243 | 0.7347 |
GB-T | Gradient Boosting | Tournament-GA | 12 | 0.7763 | 0.6610 | 0.7307 | 0.0250 | 0.7636 | 0.7923 | 0.8235 | 0.7254 |
GB-RFECV | Gradient Boosting | RFECV | 9 | 0.7695 | 0.661 | 0.7171 | 0.0236 | 0.7471 | 0.8 | 0.8355 | 0.6993 |
GB | Gradient Boosting | No FS | 26 | 0.7695 | 0.6678 | 0.7169 | 0.0280 | 0.7857 | 0.7518 | 0.7756 | 0.7626 |
ADB | ADABOOST | No FS | 26 | 0.6678 | 0.5661 | 0.6236 | 0.0262 | 0.6883 | 0.6454 | 0.6795 | 0.6547 |
BG | BAGGING | No FS | 26 | 0.7695 | 0.6644 | 0.7086 | 0.0254 | 0.7709 | 0.7672 | 0.8364 | 0.6846 |
BNB | BERNOULLI NB | No FS | 26 | 0.6881 | 0.5695 | 0.6154 | 0.0295 | 0.7048 | 0.6667 | 0.7312 | 0.6370 |
ET | EXTRA TREES | No FS | 26 | 0.7661 | 0.6542 | 0.7125 | 0.0251 | 0.7197 | 0.8188 | 0.8188 | 0.7197 |
GNB | GAUSSIAN NB | No FS | 26 | 0.6881 | 0.5763 | 0.6488 | 0.0279 | 0.6550 | 0.7339 | 0.7724 | 0.6067 |
LR | LOGISTIC REGRESSION | No FS | 26 | 0.7559 | 0.6644 | 0.7060 | 0.0232 | 0.7697 | 0.7385 | 0.7888 | 0.7164 |
RFC | RANDOM FOREST | No FS | 26 | 0.7695 | 0.6983 | 0.7331 | 0.0188 | 0.7987 | 0.7353 | 0.7791 | 0.7576 |
Measured times in seconds for each method and problem.
GB-T | GB-S | DT-T | DT-S | XGB-T (GPU) | XGB-S (GPU) | XGB-T (CPU) | XGB-S (CPU) | |
---|---|---|---|---|---|---|---|---|
With pathologies | 4641.42 | 5203.35 | 178.31 | 189.13 | 13,140.59 | 13,492.46 | 2130.07 | 2085.69 |
Without pathologies | 4311.17 | 4384.38 | 200.53 | 194.88 | 13,781.81 | 12,732.41 | 2220.16 | 2558.31 |
Algorithm results by problem (average).
Problem | DT-S | DT-T | DT-RFECV | DT | XGB-S | XGB-T | XGB-RFECV | XGB | GB-S | GB-T | GB-RFECV | GB | ADB | BG | BNB | ET | GNB | LR | RFC |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
With Pathologies | 0.7150 | 0.7103 | 0.6962 | 0.6914 | 0.7216 | 0.7029 | 0.7085 | 0.6981 | 0.7382 | 0.7332 | 0.7324 | 0.7305 | 0.6424 | 0.7114 | 0.6675 | 0.7119 | 0.6603 | 0.7098 | 0.7292 |
Without Pathologies | 0.6934 | 0.6862 | 0.6821 | 0.6799 | 0.6989 | 0.6948 | 0.6850 | 0.6803 | 0.7295 | 0.7307 | 0.7171 | 0.7169 | 0.6236 | 0.7086 | 0.6154 | 0.7125 | 0.6488 | 0.7060 | 0.7331 |
Friedman Ranks for the different proposed algorithms.
With Feature Selection | Without Feature Selection | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Problem | GB-S | GB-T | GB-RFECV | XGB-S | DT-S | DT-T | XGB-T | XGB-RFECV | DT-RFECV | RFC | GB | ET | BG | LR | XGB | DT | GNB | BNB | ADB |
With Pathologies | 1 | 2 | 3 | 6 | 7 | 10 | 13 | 12 | 15 | 5 | 4 | 8 | 9 | 11 | 14 | 16 | 18 | 17 | 19 |
Without Pathologies | 3 | 2 | 4 | 9 | 11 | 12 | 10 | 13 | 14 | 1 | 5 | 6 | 7 | 8 | 15 | 16 | 17 | 19 | 18 |
Mean | 2 | 2 | 3.5 | 7.5 | 9 | 11 | 11.5 | 12.5 | 14.5 | 3 | 4.5 | 7 | 8 | 9.5 | 14.5 | 16 | 17.5 | 18 | 18.5 |
Appendix A
Appendix A.1. Representation of Project Variables
Representation of project variables.
ID | Variable | Description | Type |
---|---|---|---|
1 | sex | Sex of the person | General information |
2 | age | Age of the person in years | General information |
3 | pop | Volume of the population where the person resides | General information |
4 | edu | Academic level attained by the person | General information |
5 | earning | Income level of the person | General information |
6 | job | Work type performed by the person | General information |
7 | stress | person self-perceived stress | General information |
8 | sleep.8 | The person sleeps more than eight hours | General information |
9 | spirit | The person drinks spirits | General information |
10 | spiritWEEK | Units of spirit drinks per week | Alcoholic drinks |
11 | wine_beer | The person drinks beer or wine | Alcoholic drinks |
12 | beerWEEK | Units of beer per week | Alcoholic drinks |
13 | wineWEEK | Units of red wine per week | Alcoholic drinks |
14 | whiteWEEK | Units of white wine per week | Alcoholic drinks |
15 | pinkWEEK | Units of rosé wine per week | Alcoholic drinks |
16 | smoke | The person smokes | Tobacco |
17 | nsmoke | Sigarettes consumed per day | Tobacco |
18 | pipe | Pipe tobacco consumed per day | Tobacco |
19 | cigar | Cigars consumed per day | Tobacco |
20 | exsmokerY | Time since a smoker quit smoking in years | Tobacco |
21 | exsmokerUNK | The person has given up smoking but does not remember how long ago. | Tobacco |
22 | cancer | The person has suffered or suffers from cancer | Pathologies |
23 | cancer_mam | The person has suffered or suffers from breast cancer. | Pathologies |
24 | cancer_col | The person has suffered or suffers from colon cancer | Pathologies |
25 | cancer_pros | The person has suffered or suffers from prostate cancer. | Pathologies |
26 | cancer_lung | The person has suffered or suffers from lung cancer. | Pathologies |
27 | cancer_other | The person has suffered or suffers from another type of cancer. | Pathologies |
28 | heart_attack | The person has suffered an acute myocardial infarction. | Pathologies |
29 | heart_angina | The person has suffered angina pectoris. | Pathologies |
30 | heart_failure | The person has suffered heart failure | Pathologies |
31 | diabetes | The person has type 2 diabetes mellitus | Pathologies |
32 | metabolic_syn | The person suffers from metabolic syndrome | Pathologies |
33 | apnea | The person suffers from sleep apnea | Pathologies |
34 | asthma | The person has asthma | Pathologies |
35 | COPD | The person suffers from chronic obstructive pulmonary disease. | Pathologies |
36 | ADH | The person has adherence to the Mediterranean diet. | Nutritional habits |
37 | vege | Servings of vegetables consumed by the individual per day | Nutritional habits |
38 | soda | Servings of carbonated and/or sweetened drinks consumed by the subject per day | Nutritional habits |
39 | legume | Servings of legumes consumed by the subject per week | Nutritional habits |
40 | milk | Servings of milk or dairy products consumed by the subject per day | Nutritional habits |
41 | IPAQ | Subject scores on the International Physical Activity Questionnaire (IPAQ) | Physical exercise |
Appendix A.2. Representation of All Variables in the Original Data Set
Representation of all variables in the original data set.
ID | Variable | Description | Type |
---|---|---|---|
1 | n | Inclusion number | General information |
2 | center | Center | General information |
3 | sex | Sex of the person | General information |
4 | age | Age of the person in years | General information |
5 | height | Height (m) | General information |
6 | weight | Weight (Kg) | General information |
7 | IMC | BMI | General information |
8 | waist | Waist circumference (cm) | General information |
9 | pop | Volume of the population where the person resides | General information |
10 | edu | Academic level attained by the person | General information |
11 | earning | Income level of the person | General information |
12 | job | Work type performed by the person | General information |
13 | stress | Person self-perceived stress | General information |
14 | sleep.8 | The person sleeps more than eight hours | General information |
15 | spirit | The person drinks spirits | Alcoholic drinks |
16 | spiritWEEK | Units of spirit drinks per week | Alcoholic drinks |
17 | wine_beer | The person drinks beer or wine | Alcoholic drinks |
18 | beerWEEK | Units of beer per week | Alcoholic drinks |
19 | wineWEEK | Units of red wine per week | Alcoholic drinks |
20 | whiteWEEK | Units of white wine per week | Alcoholic drinks |
21 | pinkWEEK | Units of rosé wine per week | Alcoholic drinks |
22 | smoke | The person smokes | Tobacco |
23 | nsmoke | Cigarettes consumed per day | Tobacco |
24 | pipe | Pipe tobacco consumed per day | Tobacco |
25 | cigar | Cigars consumed per day | Tobacco |
26 | exsmokerY | Time since a smoker quit smoking (years) | Tobacco |
27 | exsmokerM | Time since a smoker quit smoking (months) | Tobacco |
28 | exsmokerUNK | The person has given up smoking but does not remember how long ago | Tobacco |
29 | cancer | The person has suffered or suffers from cancer | Pathologies |
30 | cancer_mam | The person has suffered or suffers from breast cancer. | Pathologies |
31 | cancer_col | The person has suffered or suffers from colon cancer | Pathologies |
32 | cancer_pros | The person has suffered or suffers from prostate cancer. | Pathologies |
33 | cancer_lung | The person has suffered or suffers from lung cancer | Pathologies |
34 | cancer_other | The person has suffered or suffers from another type of cancer. | Pathologies |
35 | heart_attack | The person has suffered an acute myocardial infarction. | Pathologies |
36 | heart_angina | The person has suffered angina pectoris. | Pathologies |
37 | heart_failure | The person has suffered heart failure | Pathologies |
38 | diabetes | The person has type 2 diabetes mellitus | Pathologies |
39 | hemo | Glycosylated hemoglobin (%) | Pathologies |
40 | metabolic_syn | The person suffers from metabolic syndrome | Pathologies |
41 | apnea | The person suffers from sleep apnea | Pathologies |
42 | asthma | The person has asthma | Pathologies |
43 | COPD | The person suffers from chronic obstructive pulmonary disease. | Pathologies |
44 | ADH | The person has adherence to the Mediterranean diet. | Nutritional habits |
45 | ADH_tot | Total ADH points | Nutritional habits |
46 | olive | Use olive oil | Nutritional habits |
47 | n_olive | Use olive oil (POINTS) | Nutritional habits |
48 | tot_olive | Tablespoons of olive oil consumed in total per day | Nutritional habits |
49 | ntot_olive | Tablespoons of olive oil consumed in total per day (POINTS) | Nutritional habits |
50 | vege | Servings of vegetables consumed per day | Nutritional habits |
51 | n_vege | Servings of vegetables consumed per day (POINTS) | Nutritional habits |
52 | fruit | Pieces of fruit (including natural juice) consumed per day | Nutritional habits |
53 | n_fruit | Pieces of fruit (including natural juice) consumed per day (POINTS) | Nutritional habits |
54 | burger | Red meat portions | Nutritional habits |
55 | n_burger | Red meat portions (POINTS) | Nutritional habits |
56 | cream | Servings of butter, margarine or cream consumed per day | Nutritional habits |
57 | n_cream | Servings of butter, margarine or cream consumed per day (POINTS) | Nutritional habits |
58 | soda | Glasses of carbonated and/or sweetened beverages per day | Nutritional habits |
59 | n_soda | Glasses of carbonated and/or sweetened beverages per day (POINTS) | Nutritional habits |
60 | wine_week | Wine consumed per week | Nutritional habits |
61 | n_wine_week | Wine consumed per week (POINTS) | Nutritional habits |
62 | legume | Servings of legumes per week | Nutritional habits |
63 | n_legume | Servings of legumes per week (POINTS) | Nutritional habits |
64 | fish | Servings of fish or seafood consumed per week | Nutritional habits |
65 | n_fish | Servings of fish or seafood consumed per week (POINTS) | Nutritional habits |
66 | cake | Times per week consuming commercial bakery products | Nutritional habits |
67 | n_cake | Times per week consuming commercial bakery products (POINTS) | Nutritional habits |
68 | nuts | Servings of nuts and dried fruit consumed per week | Nutritional habits |
69 | n_nuts | Servings of nuts and dried fruit consumed per week (POINTS) | Nutritional habits |
70 | chicken | Preferably consume chicken, turkey or rabbit meat instead of beef, pork, hamburgers or sausages | Nutritional habits |
71 | n_chicken | Preferably consume chicken, turkey or rabbit meat instead of beef, pork, hamburgers or sausages (POINTS) | Nutritional habits |
72 | sauce | Times a week eat cooked vegetables, pasta, rice or other dishes seasoned with a tomato, garlic, onion or leek sauce simmered with olive oil | Nutritional habits |
73 | n_sauce | Times a week eat cooked vegetables, pasta, rice or other dishes seasoned with a tomato, garlic, onion or leek sauce simmered with olive oil (POINTS) | Nutritional habits |
74 | milk | Milk or dairy products (yogurts, cheese) consumed per day | Nutritional habits |
75 | n_milk | Milk or dairy products (yogurts, cheese) consumed per day (POINTS) | Nutritional habits |
76 | milk_light | The person takes skimmed dairy products | Nutritional habits |
77 | n_milk_light | The person takes skimmed dairy products (POINTS) | Nutritional habits |
78 | IPAQ | IPAQ Points | Physical exercise |
79 | cal_IPAQ | IPAQ Calories | Physical exercise |
80 | exercise_H | Days of intense physical exercise | Physical exercise |
81 | exercise_H_mets | Days of intense physical exercise (Mets) | Physical exercise |
82 | exercise_H_min | Intense physical exercise in one day (minutes) | Physical exercise |
83 | exercise_H_tot | Not sure about the time of intense physical exercise in one day | Physical exercise |
84 | exercise_L | Days of moderate physical exercise | Physical exercise |
85 | exercise_L_mets | Days of moderate physical exercise (Mets) | Physical exercise |
86 | exercise_L_min | Moderate physical exercise in one day (minutes) | Physical exercise |
87 | exercise_L_tot | Not sure about the time of moderate physical exercise in one day | Physical exercise |
88 | exercise_walk | Days of sedentary physical exercise | Physical exercise |
89 | exercise_walk_mets | Days of sedentary physical exercise (Mets) | Physical exercise |
90 | exercise_walk_min | Sedentary physical exercise in one day (minutes) | Physical exercise |
91 | exercise_walk_tot | Not sure about the time of sedentary physical exercise in one day | Physical exercise |
92 | exercise_sit_min | Time spent sitting during a day (minutes) | Physical exercise |
93 | exercise_sit | The person is not sure how much time was spent sitting during a day (minutes) | Physical exercise |
Appendix A.3. Models Parameters
Details about models parameters.
MODEL | Objective | ccp_Alpha | Class Weight | Criterion | Learning_Rate | Loss | Max_Depth | Max_Features | n_estimators | Splitter | Bootstrap | Algorithm | Base_Estimator | Max_iter | Solver | Tol | Penalty | Priors | Var_Smoothing | Binarize | Fit_Prior |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XGB | binary: |
- | - | - | None | - | None | - | 100 | - | - | - | - | - | - | - | - | - | - | - | - |
GB | - | 0.0 | - | friedman_mse | 0.1 | deviance | 3 | None | 100 | - | - | - | - | - | - | - | - | - | - | - | - |
DT | - | 0.0 | None | gini | - | - | 6 | None | - | best | - | - | - | - | - | - | - | - | - | - | - |
RFC | - | 0.0 | None | gini | - | - | None | auto | 100 | - | True | - | - | - | - | - | - | - | - | - | - |
ADB | - | 0.0 | None | gini | 1.0 | - | None | None | 500 | best | - | SAMME | DecisionTreeClassifier | - | - | - | - | - | - | - | - |
LR | - | - | balanced | - | - | - | - | - | - | - | - | - | - | 100 | lbfgs | 0.0001 | 12 | - | - | - | - |
ET | - | 0.0 | None | gini | - | - | None | auto | 250 | - | False | - | - | - | - | - | - | - | - | - | - |
GNB | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | None | 1 × 10−9 | - | - |
BNB | - | 1.0 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 0.0 | True | |
BG | - | - | - | - | - | - | - | auto | 250 | - | True | - | - | - | - | - | - | - | - | - | - |
References
1. Keys, A.; Fidanza, F.; Karvonen, M.J.; Kimura, N.; Taylor, H.L. Indices of relative weight and obesity. J. Chronic Dis.; 1972; 25, pp. 329-343. [DOI: https://dx.doi.org/10.1016/0021-9681(72)90027-6]
2. Spanish Ministry of Health (Ministerio de Sanidad, Consumo y Bienestar Social). Encuesta Nacional de Salud. España 2017. Available online: https://www.mscbs.gob.es/estadEstudios/estadisticas/encuestaNacional/encuestaNac2017/ENSE2017_notatecnica.pdf (accessed on 15 January 2021).
3. World Health Organization. Obesity: Preventing and Managing the Global Epidemic; World Health Organization: Geneva, Switzerland, 2000; 252p.
4. Khatib, W.; Fleming, P.J. The Stud GA: A mini revolution?. Proceedings of the Parallel Problem Solving from Nature—PPSN V; Amsterdam, The Netherlands, 27–30 September 1998; Eiben, A.E.; Bäck, T.; Schoenauer, M.; Schwefel, H.P. Springer: Berlin/Heidelberg, Germany, 1998; pp. 683-691.
5. El Naqa, I.; Murphy, M.J. What is machine learning?. Machine Learning in Radiation Oncology; Springer: Berlin/Heidelberg, Germany, 2015; pp. 3-11.
6. De Prado, M.L. Advances in Financial Machine Learning; John Wiley & Sons: Hoboken, NJ, USA, 2018.
7. Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4.
8. Braga-Neto, U. Fundamentals of Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2020.
9. Kononenko, I. Machine learning for medical diagnosis: History, state of the art and perspective. Artif. Intell. Med.; 2001; 23, pp. 89-109. [DOI: https://dx.doi.org/10.1016/S0933-3657(01)00077-X]
10. Ahsan, M.M.; Luna, S.A.; Siddique, Z. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review. Healthcare; 2022; 10, 541. [DOI: https://dx.doi.org/10.3390/healthcare10030541] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35327018]
11. Pirgazi, J.; Alimoradi, M.; Abharian, T.E.; Olyaee, M.H. An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Sci. Rep.; 2019; 9, 18580. [DOI: https://dx.doi.org/10.1038/s41598-019-54987-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31819106]
12. Chandrashekar, G.; Sahin, F. A survey on feature-selection methods. Comput. Electr. Eng.; 2014; 40, pp. 16-28. [DOI: https://dx.doi.org/10.1016/j.compeleceng.2013.11.024]
13. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res.; 2003; 3, pp. 1157-1182.
14. Reunanen, J. Overfitting in making comparisons between variable selection methods. J. Mach. Learn. Res.; 2003; 3, pp. 1371-1382.
15. Pudil, P.; Novovičová, J.; Kittler, J. Floating search methods in feature selection. Pattern Recognit. Lett.; 1994; 15, pp. 1119-1125. [DOI: https://dx.doi.org/10.1016/0167-8655(94)90127-9]
16. Misra, P.; Yadav, A.S. Improving the classification accuracy using recursive feature elimination with cross-validation. Int. J. Emerg. Technol.; 2020; 11, pp. 659-665.
17. Kumar, G.R.; Ramachandra, G.; Nagamani, K. An efficient feature selection system to integrating SVM with genetic algorithm for large medical datasets. Int. J.; 2014; 4, pp. 272-277.
18. Reddon, H.; Gerstein, H.C.; Engert, J.C.; Mohan, V.; Bosch, J.; Desai, D.; Bailey, S.D.; Diaz, R.; Yusuf, S.; Anand, S.S. et al. Physical activity and genetic predisposition to obesity in a multiethnic longitudinal study. Sci. Rep.; 2016; 6, 18672. [DOI: https://dx.doi.org/10.1038/srep18672] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26727462]
19. Chatterjee, A.; Gerdes, M.W.; Martinez, S.G. Identification of Risk Factors Associated with Obesity and Overweight—A Machine Learning Overview. Sensors; 2020; 20, 2734. [DOI: https://dx.doi.org/10.3390/s20092734] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32403349]
20. Muhamad Adnan, M.H.B.; Husain, W.; Abdul Rashid, N. A hybrid approach using Naïve Bayes and Genetic Algorithm for childhood obesity prediction. Proceedings of the 2012 International Conference on Computer Information Science (ICCIS); Chongqing, China, 17–19 August 2012; Volume 1, pp. 281-285. [DOI: https://dx.doi.org/10.1109/ICCISci.2012.6297254]
21. Mirjalili, S. Genetic algorithm. Evolutionary Algorithms and Neural Networks; Springer: Berlin/Heidelberg, Germany, 2019; pp. 43-55.
22. Affenzeller, M.; Winkler, S.; Wagner, S.; Beham, A. Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications; Chapman and Hall/CRC Publishers: London, UK, 2009.
23. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI’95; Montreal, QC, Canada, 20–25 August 1995; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; Volume 2, pp. 1137-1143.
24. Rao, R.; Fung, G. On the Dangers of Cross-Validation. An Experimental Evaluation. Proceedings of the 2008 SIAM International Conference on Data Mining; Atlanta, GA, USA, 24–26 April 2008; pp. 588-596. [DOI: https://dx.doi.org/10.1137/1.9781611972788.54]
25. Miller, B.L.; Goldberg, D.E. Genetic algorithms, tournament selection, and the effects of noise. Complex Syst.; 1995; 9, pp. 193-212.
26. Bäck, T. Selective Pressure in Evolutionary Algorithms: A Characterization of Selection Mechanisms. Proceedings of the First IEEE Conference on Evolutionary Computation; Orlando, FL, USA, 27–29 June 1994; pp. 57-62.
27. Jolly, K. Machine Learning with Scikit-Learn Quick Start Guide: Classification, Regression, and Clustering Techniques in Python; Packt Publishing Ltd.: Birmingham, UK, 2018.
28. Friedman, J.H. Stochastic Gradient Boosting. Comput. Stat. Data Anal.; 2002; 38, pp. 367-378. [DOI: https://dx.doi.org/10.1016/S0167-9473(01)00065-2]
29. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining; San Francisco, CA, USA, 13–17 August 2016; pp. 785-794.
30. Myles, A.J.; Feudale, R.N.; Liu, Y.; Woody, N.A.; Brown, S.D. An introduction to Decision Tree modeling. J. Chemom. J. Chemom. Soc.; 2004; 18, pp. 275-285. [DOI: https://dx.doi.org/10.1002/cem.873]
31. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30; Guyon, I.; Luxburg, U.V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R. Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765-4774.
32. Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput.; 2011; 1, pp. 3-18. [DOI: https://dx.doi.org/10.1016/j.swevo.2011.02.002]
33. Eisinga, R.; Heskes, T.; Pelzer, B.; Te Grotenhuis, M. Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers. BMC Bioinform.; 2017; 18, 68. [DOI: https://dx.doi.org/10.1186/s12859-017-1486-2] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28122501]
34. Chen, H.; Jiang, W.; Li, C.; Li, R. A Heuristic Feature Selection Approach for Text Categorization by Using Chaos Optimization and Genetic Algorithm. Math. Probl. Eng.; 2013; 2013, pp. 1-6. [DOI: https://dx.doi.org/10.1155/2013/524017]
35. Malhotra, R.; Khanna, M. Dynamic selection of fitness function for software change prediction using Particle Swarm Optimization. Inf. Softw. Technol.; 2019; 112, pp. 51-67. [DOI: https://dx.doi.org/10.1016/j.infsof.2019.04.007]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In this paper, we experimented with a set of machine-learning classifiers for predicting the risk of a person being overweight or obese, taking into account his/her dietary habits and socioeconomic information. We investigate with ten different machine-learning algorithms combined with four feature-selection strategies (two evolutionary feature-selection methods, one feature selection from the literature, and no feature selection). We tackle the problem under a binary classification approach with evolutionary feature selection. In particular, we use a genetic algorithm to select the set of variables (features) that optimize the accuracy of the classifiers. As an additional contribution, we designed a variant of the Stud GA, a particular structure of the selection operator of individuals where a reduced set of elitist solutions dominate the process. The genetic algorithm uses a direct binary encoding, allowing a more efficient evaluation of the individuals. We use a dataset with information from more than 1170 people in the Spanish Region of Madrid. Both evolutionary and classical feature-selection methods were successfully applied to Gradient Boosting and Decision Tree algorithms, reaching values up to 79% and increasing the average accuracy by two points, respectively.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details








1 Computer Architecture and Automation Department, Faculty of Computer Science, Universidad Complutense de Madrid, 28040 Madrid, Spain
2 Department of Medicine, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain
3 Public Health and Maternal and Child Health Department, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain
4 Department of Physiology, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain