1. Introduction
Uric acid is a widely studied biomarker due to its association with various health conditions, including gout, hypertension, and cardiovascular diseases. When uric acid levels exceed the reference range in the blood, the condition is termed hyperuricemia (HU). This condition can be a prevalent metabolic disorder. HU not only increases the risk of developing gout but also serves as a significant risk factor for cardiovascular diseases, which are the leading cause of death in Mexico and worldwide. Monitoring uric acid levels is crucial because it allows for the early detection of HU, helping to prevent the onset of gout and reduce the risk of kidney stones and cardiovascular complications [1,2].
The application of machine learning techniques has significantly influenced the detection and identification of elevated uric acid levels. By analyzing large datasets containing clinical and biochemical parameters, machine learning models were able to predict the likelihood of hyperuricemia with greater accuracy than traditional methods [3,4].
In a study led by Sangwoo Lee et al. [5], key performance metrics were used to evaluate the effectiveness of various machine learning algorithms in predicting HU. The Naïve Bayes algorithm achieved the highest sensitivity at 0.73, making it the most sensitive in detecting elevated uric acid levels. The Random Forest also performed well, with a sensitivity of 0.66 and the highest Balanced Classification Rate at 0.68. Additionally, RFC demonstrated superior performance in overall prediction accuracy with an Area Under the Curve of 0.775, compared to 0.669 for Naive Bayes and 0.568 for logistic regression models. Likewise, in a study by Zeng et al. [6], a stacked multimodal machine learning model was developed to predict HU by integrating genetic and clinical data. The model achieved high accuracy, with an AUC of 0.859 in the training set, effectively stratifying individuals into high and low risk groups for gout and other metabolic disorders. Additionally, the study found that lifestyle changes could mitigate adverse outcomes in high-risk individuals.
Conversely, another study by Sampa et al. [7] aimed to develop a predictive model for blood uric acid levels using data from health checkups, nutritional information, and socio-demographic characteristics to apply machine learning regression algorithms, including Boosted DTR, Decision Forest Regression, Neural Network, Bayesian Linear Regression, and Linear Regression. Among these, the Boosted DTR model demonstrated the best performance, achieving a notably low RMSE of 0.03.
These studies, along with others in the field [8,9,10], underscore the potential of machine learning approaches in accurately predicting uric acid levels. This evidence justifies the application of machine learning for early intervention and cost-effective management of related health conditions. While classification algorithms are typically used to categorize patients into groups or classes, such as normal or elevated uric acid levels, regression algorithms are particularly useful when predicting the exact value of uric acid concentration. In the context of the Tlalpan 2020 project, an analysis was conducted to develop a predictive model for identifying elevated uric acid levels using machine learning regression algorithms, including Boosted DTR, XGBoost, CatBoost, and SHAP. The approach involves applying these algorithms to identify the most relevant features influencing uric acid levels. While previous research has demonstrated performance variations across different machine learning models in metabolic studies, this study adopted an exploratory approach to compare multiple algorithms and identify the most effective predictors of uric acid levels. Given the variability in machine learning model performance across different datasets and study populations, this study did not assume a predefined hypothesis regarding the superiority of any particular model. Instead, an empirical evaluation was conducted to determine which approach performed best under the conditions of the dataset used. After each algorithm generates its respective subset of important features, a cross-feature analysis was performed to combine the top features from each algorithm. This process aimed to enhance the model’s performance by leveraging the strengths of multiple feature sets, leading to more accurate predictions.
2. Materials and Methods
2.1. Data
This study used baseline data from the Tlalpan 2020 cohort at the National Institute of Cardiology Ignacio Chávez in Mexico City [11]. Although the Tlalpan 2020 cohort was designed as a longitudinal study, this analysis focused on baseline data, as it provided the most complete and consistent set of variables across participants. This cross-sectional approach allowed us to explore initial associations between uric acid levels, nutritional intake, and lifestyle factors before assessing long-term changes. The study was approved by the institution’s Bioethics Committee under code 13-802, and the research involved informed and consenting volunteers. The Tlalpan 2020 project was established to identify risk factors for hypertension and other cardiometabolic conditions. Participants, aged 20 to 50 years and clinically healthy at recruitment, have been followed biennially since 2014. The dataset includes clinical, anthropometric data, alcohol and tobacco use, physical activity, economic income, education, family health history, biomedical evaluations, sleep quality, and nutrient intake.
Sleep quality was measured using the MOS Sleep Scale [12], which evaluates sleep through 12 questions on interruptions, snoring, breathing difficulties, headaches on waking, overall sleep quality, and daytime drowsiness, and tracks total sleep hours over the previous four weeks. Clinical and anthropometric data, such as systolic and diastolic blood pressure (SBP and DBP), were collected following the procedure outlined in [13], with waist circumference, height, and weight measured according to ISAK guidelines [14]. Biochemical assessments were conducted using blood and urine samples to analyze various metabolic and lipid markers. Data on lifestyle factors, including smoking, alcohol intake, and physical activity, were gathered, with physical activity levels assessed using the International Physical Activity Questionnaire [15] and categorized as low, moderate, or high based on metabolic equivalent minutes per week. Educational attainment was recorded as primary, secondary, or tertiary (including postgraduate) levels, and economic income levels were segmented into low, medium, and high categories based on monthly income in Mexican pesos. Nutritional habits, including daily meal frequency and nutrient intake over the past year, were tracked and analyzed using the “Evaluation of Nutritional Habits and Nutrient Consumption System” software tool Version 2 [16].
All data mentioned in this section are presented in Table S1, which is available in Supplementary Materials, where the letter D stands for dichotomous and the letter C stands for continuous.
2.2. Methods
2.2.1. Machine Learning Approaches
We applied four machine learning algorithms in this study: XGBoost, Boosted DTR, CatBoost, and SHAP. XGBoost was developed by Tianqi Chen [17], while Boosted DTR was pioneered by Robert E. Schapire and Yoav Freund [18,19]. CatBoost was created by Yandex [20]. Finally, SHAP, introduced by Lundberg and Lee [21], was used for feature selection due to its proven effectiveness in diverse fields such as healthcare.
XGBoost is known for its speed and performance on large datasets, Boosted DTR combines multiple decision trees for a robust predictive model, CatBoost stands out for its remarkable efficiency and performance, particularly excelling in regression problems. SHAP is employed for feature selection due to its demonstrated effectiveness across diverse domains, such as healthcare [22]. It offers valuable insights into the contribution of each feature to the model’s predictions.
2.2.2. Feature and Cross-Feature Selection
Feature selection is crucial in machine learning for enhancing model efficiency and accuracy, particularly in regression algorithms. It helps identify key predictors, reduces overfitting, and improves computational efficiency. After identifying these key features, cross-feature selection can be employed to further refine the model by evaluating the interactions between features across multiple datasets or folds. This process evaluates the interactions between features from multiple datasets or models, which is essential for uncovering complex relationships between variables.
Although cross-feature selection offers a promising approach to uncovering complex interactions between variables in predictive models, its adoption remains relatively limited compared to more conventional methods. However, the cross-feature selection method has shown significant results in studies related to cancer and various diseases [23]. These investigations have demonstrated that this approach can enhance the accuracy of predictive models and facilitate the identification of complex interactions that other conventional methods might overlook.
In this study, cross-feature selection was applied as follows:
Feature subsets (in this case 3), F_1, F_2, and F_3, were derived from the three best performing models. The top 5 features from each subset were then selected for further analysis. Finally, the common features among F_1, F_2, and F_3 were identified to enhance the model’s robustness, as represented by the following equation:
(1)
This approach ensured that the final model incorporated the most influential features from each of the top-performing models, improving its overall performance.
Overfitting was controlled by performing cross-validation and hyperparameter tuning to ensure model generalization. Additionally, feature selection was based on multiple algorithms (Boosted DTR, CatBoost, SHAP, and XGBoost), reducing irrelevant or redundant variables while preserving the most informative ones. Although some selected features may be intercorrelated, machine learning models such as XGBoost and CatBoost can capture non-linear relationships and interactions among predictors, which conventional statistical models may overlook.
2.2.3. Performance Measures
We used the metrics MSE, RMSE, and , which are employed in regression machine learning algorithms, to assess model performance.
The MSE is calculated as the sum of the squares of the differences between the model predictions () and the actual values (y), divided by the number of samples (n), expressed by the equation:
(2)
The RMSE, the square root of the MSE, is obtained by taking the square root of the MSE value, providing an error metric on the same scale as the original data, facilitating the interpretation of the prediction error.
On the other hand, is calculated as the proportion of variance explained by the model relative to the total variance of the data, represented by the equation:
(3)
2.3. Experimental Setup and Process
All experiments were performed using Python 3.11 and Jupyter Notebook version 7. Min–max scaling from the scikit-learn library [24] was used to normalize continuous variables, and dichotomous variables were represented as numbers.
The overall experimental process is illustrated in Figure 1. It began with the extraction of relevant variables, followed by data cleaning and normalization using the min–max method. Once the dataset was prepared, various models were executed, specifically XGBoost, Boosted DTR, SHAP, and CatBoost. During the training process, parameter tuning was performed to optimize the performance of each model. Subsequently, feature selection was applied to identify the most relevant variables for each algorithm, grouping them into specific subsets.
With the subsets defined, the performance of each algorithm was analyzed using the MSE, RMSE, and metrics. To implement the cross-feature selection method, the five most important features from each subset were selected, considering the corresponding group and sex. These five variables were then combined using the XGBoost algorithm, which was chosen for its high performance in regression problems, computational efficiency, compatibility with SHAP, and control over overfitting.
After obtaining the cross-feature selection results and confirming that the metrics were higher than those of the other algorithms used, the most important variables from each subgroup were presented in a spiderweb graph.
3. Results
In this section, we present the results derived from applying Boosted DTR, CatBoost, SHAP, and XGBoost to the dataset, which was initially divided into the following groups: clinical, anthropometric, and lifestyle variables; food frequency questionnaire data; and nutritional and biochemical features obtained from the food frequency questionnaire. The results were segmented by sex, dividing the analysis between men and women. This separation is crucial to identify the factors that influence each group differently, considering sex as a variable that can modify the relationship between risk factors and health outcomes. This approach allowed for a more precise and nuanced understanding of how different factors specifically impact the health of men and women.
3.1. Clinical, Anthropometric, and Lifestyle Variables
Table 1 illustrates how the variables were identified and evaluated based on their relative importance in predicting health outcomes in women. Algorithms utilizing the variable importance metric, such as Boosted DTR and CatBoost, highlighted Body Mass Index (BMI), triglycerides, and creatinine as the most influential factors. On the other hand, SHAP and XGBoost used the mean absolute SHAP value to measure feature contribution, with BMI, triglycerides, and creatinine also standing out as key contributors. Moreover, the subset of features identified by XGBoost shows some unique differences compared to the other models. Variables such as passive smoking, having had a baby weighing more than 4 kg (BB4K), mother with diabetes, and snoring were highlighted by XGBoost as important factors. These features, particularly those related to lifestyle and family health history, suggest that XGBoost is capturing nuances in health outcomes that might be less prominent in the other models. This indicates that, in addition to clinical and anthropometric factors, elements of personal and familial health history play a significant role in predicting health outcomes in women, as identified by XGBoost.
For women, Table 2 shows the performance metrics obtained by each algorithm (CatBoost, Boosted DTR, SHAP, and XGBoost), when applied to the subset of Clinical, Anthropometric, and Lifestyle Features. Each algorithm identifies different sets of important features, and these features were used to evaluate model performance through metrics such as MSE, RMSE, and .
Based on the performance metrics, the top three algorithms for predicting uric acid levels in women were XGBoost, SHAP, and CatBoost. XGBoost stood out as the best performer with the lowest MSE (0.0079) and RMSE (0.0890), and the highest value (0.3170), indicating both superior predictive accuracy and the ability to explain the most variance in the data. SHAP followed closely, with strong predictive accuracy reflected in its low MSE (0.0068) and RMSE (0.0824), although its (0.1688) was slightly lower, meaning it explained less variance compared to XGBoost. CatBoost, while having the lowest MSE (0.0063) and RMSE (0.0793), ranked third overall due to its value of 0.2315, placing it between XGBoost and SHAP in terms of variance explanation. Boosted DTR performed reasonably well, with an MSE of 0.0087 and an RMSE of 0.0930, indicating it was slightly less accurate than XGBoost, SHAP, and CatBoost.
In the case of men, Table 3 illustrates a similar approach highlighting BMI, waist circumference, creatinine, atherogenic index, triglycerides, and being a daily smoker as the most critical factors. However, SHAP and XGBoost revealed additional nuances in the subset of features important for men’s health. For instance, XGBoost identified passive smoking, father with gout, mother with diabetes, and alcoholic beverages such as whiskey and cognac consumption as significant contributors. Moreover, the presence of high cholesterol and the type of health insurance (IMSS) were also emphasized. These findings suggest that, similar to women, a mix of clinical, anthropometric, and lifestyle factors plays a vital role in determining health outcomes in men. Yet, XGBoost’s ability to capture lifestyle and family health history nuances offers a deeper understanding of how these elements uniquely impact men’s health outcomes.
In terms of performance metrics (see Table 4), the top three algorithms based on performance metrics were XGBoost, Boosted DTR, and SHAP. XGBoost led with the highest value of 0.1996, although it had a slightly higher MSE (0.0137) and RMSE (0.1072) compared to the other models. Boosted DTR followed closely with a lower MSE (0.0111) and RMSE (0.1054), and an of 0.1994. SHAP was very similar to Boosted DTR, with identical MSE and RMSE values but a slightly lower of 0.1987.
3.2. Food Frequency Questionnaire Features
Table 5 presents the subset of features derived from the food frequency questionnaire for men. In this case, the Boosted DTR algorithm highlighted foods and beverages that could have a significant impact on uric acid levels.
In the case of CatBoost, medium-sized cola was again highlighted, along with corn tortilla and a portion of pork. Additionally, chocolate powder, white sliced bread, atole with milk, and a bowl of oatmeal were emphasized as key elements in the diet that could contribute to increased uric acid levels.
The SHAP analysis also underscored the importance of a medium-sized cola and a cup of atole with milk. It is interesting to note that banana consistently appeared in all three algorithms. Although generally considered healthy, this fruit is rich in fructose, which could justify its appearance as a predictor in these models, given that fructose can contribute to increased uric acid levels. Additionally, SHAP highlights foods similar to those identified by other algorithms such as margarine, a portion of pork, chorizo, and sweet bread, which are integral parts of the diet and may influence uric acid levels.
Table 6 presents the performance metrics, where Boosted DTR and CatBoost showed very similar performance, both with an MSE of 0.0111 and RMSE values of 0.1051 and 0.1052, respectively. Boosted DTR had a slightly higher value (0.0767) compared to CatBoost (0.0763), indicating that it explained marginally more variance, making it the better of the two. SHAP outperformed both in terms of predictive accuracy, as evidenced by its lower MSE (0.0083) and RMSE (0.0913). However, SHAP’s value was significantly lower at 0.0187, suggesting it explained much less variance in the data, potentially indicating that while SHAP is good at making accurate predictions, it may not capture the underlying patterns as effectively as the other models.
Concerning women, several nutritional risk factors were associated with elevated uric acid levels. A medium-sized cola was particularly concerning due to its high sugar content. Hard liquor, such as rum, brandy, or tequila, was identified as a risk factor for increasing uric acid levels. Carnitas, a dish high in purines due to its pork content, was another noteworthy risk factor. Rice dishes, particularly those with a high glycemic index, can elevate blood glucose levels, potentially impacting uric acid metabolism negatively.
Sugary drinks, like sodas and other sweetened beverages, were also flagged as a concern. Butter is traditionally considered beneficial due to its healthy fats; however, when consumed in combination with high-carbohydrate foods or if sourced from less healthy origins, it can contribute to inflammation and exacerbate uric acid levels. The consumption of safflower oil, which is high in omega-6 fatty acids, could be problematic if consumed in excess, as it may promote inflammation, a factor associated with higher uric acid levels. Lastly, wheat tortillas were identified as a risk factor for increased uric acid levels.
Table 7 shows the performance metrics for CatBoost, Boosted DTR, and SHAP on the food frequency questionnaire features for women. CatBoost and Boosted DTR performed similarly, both with an MSE of 0.0111 and RMSE around 0.105, and their values were close, with Boosted DTR slightly ahead. SHAP had a lower MSE (0.0083) and RMSE (0.0913), indicating better accuracy, but its was much lower (0.0187), meaning it explained less variance in the data. Overall, CatBoost and Boosted DTR offered a better balance between accuracy and variance explanation.
3.3. Nutritional and Biochemical Features
Table 8 presents a subset of nutritional and biochemical features identified in both women and men, using only the CatBoost and SHAP algorithms. These algorithms were chosen due to their outstanding performance with this dataset.
For women, Table 8 shows that CatBoost highlighted glucose as the most important feature, followed by carbohydrates and alcohol. Other important features included maltose, fructose, and cholesterol, which further emphasized the metabolic and nutritional components potentially affecting uric acid levels. SHAP also prioritized glucose, consistent with CatBoost, and additionally underscored the significance of beta-tocopherol and beta-carotene. Alcohol and cholesterol were also highlighted by SHAP, reinforcing the importance of nutritional and lifestyle factors.
Regarding the metrics obtained by the nutrient algorithms in women, SHAP stood out with the lowest error metrics (MSE of 0.0077 and RMSE of 0.0880), indicating it was the most accurate model. However, CatBoost had a higher value of 0.1139, suggesting it explained more variance in the data. While SHAP was more accurate in its predictions, CatBoost provided a broader understanding of the data variance, making it a strong contender for modeling nutritional and biochemical features in women.
In the case of men, Table 8 shows that CatBoost identified fructose as the most influential feature, followed closely by maltose and glucose. Additional significant features included carbohydrates, lactose, and alcohol, indicating that a combination of sugars and alcohol consumption may be pivotal in understanding men’s uric acid metabolism.
SHAP similarly highlighted fructose and glucose, both sugars strongly linked to increased uric acid levels. Additionally, it emphasized cholesterol, whose association with lipids and cardiovascular health may indicate a connection with HU. Lactose and carbohydrates were also key features identified by SHAP. Additionally, maltose was highlighted, further emphasizing the role of specific sugars in influencing uric acid levels.
Based on Table 9 for men, the SHAP model demonstrated a better ability to explain the variance in the data with an of 0.1345, despite having slightly higher error metrics (MSE of 0.0135 and RMSE of 0.1160) compared to CatBoost. CatBoost, with an of 0.0654, showed a lower capacity to explain the variance, but it had slightly better error metrics (MSE of 0.0130 and RMSE of 0.1139). Overall, SHAP provided a better understanding of the underlying data patterns, while CatBoost offered marginally better accuracy in predictions.
3.4. Results of Cross-Feature Selection
Once the feature subsets were obtained from the XGBoost, CatBoost, Boosted DTR, and SHAP models, it was observed that some models had similar metric results. In the case of the anthropometric variables group for both men and women, XGBoost identified different variables compared to Boosted DTR, CatBoost, and SHAP, which captured more variables related to lifestyle, habits, and family medical history. For women, in Table 2, XGBoost stood out as the best algorithm, showing superior metrics in terms of precision and data variability. However, the subset of features selected by XGBoost (see Table 1) excluded important variables such as triglycerides, creatinine, weight, the atherogenic index, and other potentially significant variables. For this reason, the cross-feature selection method was applied, considering the top five variables from each subset obtained by each algorithm in each group, segmented by sex.
For women, the clinical, anthropometric, and lifestyle variables group was analyzed using the CatBoost, SHAP, and XGBoost algorithms, as they demonstrated the highest performance. After removing duplicate features, the top 10 unique variables identified were triglycerides, BMI, creatinine, atherogenic index, urinary sodium, mother with diabetes, passive smoker, baby birth weight over 4 kg, snoring, and glucose. Similarly, for men, the same algorithms were selected based on their performance, and the top features identified were creatinine, BMI, atherogenic index, triglycerides, urinary sodium, father with diabetes, passive smoker, father with gout, mother with diabetes, and weight.
In the food frequency group, data for women were analyzed using the Boosted DTR, CatBoost, and SHAP algorithms, as they demonstrated the highest performance. The top five variables identified were a medium cola soda, a teaspoon of butter, a drink (rum, brandy, or tequila), a teaspoon of hot sauce or chili, a portion of carnitas, a glass of sugary flavored water, a cup of coffee without sugar, a teaspoon of margarine, a bowl with rice, and safflower oil. Likewise, the essential features identified for men were a medium cola soda, a corn, safflower oil, a teaspoon of margarine, corn oil, a banana, a portion of pork, a teaspoon of chocolate powder, a cup of atole with milk, and chicken egg.
The nutritional and biochemical features for women were analyzed using the CatBoost and SHAP algorithms, which showed the best results (see Table 8). The top five variables identified included glucose, carbohydrates, alcohol, fructose, maltose, total cholesterol, starch, and beta cryptoxanthin. Similarly, the essential features identified for men were fructose, maltose, glucose, carbohydrates, lactose, total cholesterol, alcohol, and beta cryptoxanthin.
Table 10 presents the performance metrics (MSE, RMSE, and ) for the cross-feature method, applied using XGBoost to the clinical, anthropometric, lifestyle, food frequency questionnaire, and nutritional and biochemical subsets across both sexes. By combining the top five features from each subgroup, this method consistently outperformed individual algorithms, demonstrating superior predictive performance across all metrics and groups.
To effectively present the features used in the cross-feature selection for each group and sex, radar charts were created (see Figure 2, Figure 3 and Figure 4). Given that the scale of the importance method used by Boosted DTR and CatBoost differed from the MASV method applied by SHAP and XGBoost, it was necessary to implement a series of normalization steps. The min–max normalization technique was applied to scale the importance values within a range of [0, 1], adjusting the variables to a common scale. This ensured that the importance assigned by each algorithm was reflected in a coherent and comparable manner. After normalization, the data for each gender were merged, and the mean of the normalized values was calculated to obtain a representative importance score for each variable.
To further evaluate the stability of the cross-feature selection method, we performed a sensitivity analysis varying max_iter, max_depth, and l2_regularization. Our original analysis was conducted using XGBoost, CatBoost, and Boosted DTR; however, to further validate the robustness of our approach, we applied HistGradientBoostingRegressor for this sensitivity analysis. This method was selected due to its efficiency in handling structured tabular data and its ability to assess model stability under different hyperparameter settings. The results (Figure 5 and Figure 6) confirm that increasing max_iter did not significantly improve predictive performance. Additionally, deeper models tended to perform worse, likely due to overfitting and increased variance, which reduced generalization. These findings suggest that the cross-feature selection method maintained stable performance across different hyperparameter configurations, indicating robustness. The complete numerical results of the sensitivity analysis are provided in Supplementary Materials (Table S2).
4. Discussion
This study offers a comprehensive examination of the factors influencing uric acid levels in the Tlalpan 2020 cohort, with a particular focus on gender-specific differences. In this regard, it is relevant to notice that women constituted a smaller proportion of the cohort, which likely reduced the variability in their data, leading to more accurate predictions. Additionally, the predictors for women might not have carried as much weight, as some variables may not have been as representative or did not capture the full variability in the cohort. This could have resulted in more stable model performance but with a lesser predictive impact. To further improve the algorithms, it would be crucial to continue collecting more data, which could enhance the representation of key variables and ultimately improve the models’ predictive power. Utilizing machine learning models, including Boosted DTR, XGBoost, CatBoost, and SHAP, we identified critical variables across clinical, anthropometric, lifestyle, and nutritional domains. Additionally, the cross-feature selection method was employed to refine the model further by combining the top features identified by each algorithm. To effectively analyze and compare these variables, spiderweb charts were used, providing a clear visualization of the interactions between variables and enhancing the understanding of the predictive factors across different groups.
The degree of bias in cohort populations when generalizing the results observed in the Tlalpan 2020 study depends on several factors, particularly the representativeness of the participants relative to the general population. One key limitation is the age restriction, as all participants were younger than 50 years at baseline, which may prevent the findings from being extrapolated to older populations where the relationship between uric acid and health outcomes differs, such as the observed uric acid paradox in the elderly. Additionally, geographical and socioeconomic factors play a role, as the cohort consisted of residents of Mexico City, potentially limiting the applicability of the results to populations with different nutritional habits, genetic backgrounds, and environmental exposures. Selection bias is another concern, as individuals who chose to participate may have had distinct health statuses, lifestyles, or access to healthcare, making them non-representative of the broader population at risk of hyperuricemia-related conditions. To mitigate these biases, sensitivity analyses, external validation in diverse populations, and age-stratified modeling approaches may enhance the robustness and applicability of the findings.
4.1. Key Clinical, Anthropometric and Lifestyle Predictors
Figure 2 highlighted the different health, hereditary, and lifestyle factors contributing to elevated uric acid levels in men and women.
In men, BMI was a critical factor, indicating that higher body weight and obesity are closely linked to elevated uric acid levels, likely due to insulin resistance impairing uric acid excretion. Age was another important factor, with older men showing a higher risk of elevated uric acid levels due to declining renal function and increased comorbidities. Elevated creatinine, a marker of kidney function, suggested that impaired renal function was a key contributor to HU in men. Alcohol consumption, as indicated by the cognac factor, was notably associated with uric acid levels. Additionally, a hereditary factor such as having a father with gout was significant, highlighting a genetic predisposition to HU. These factors together suggest that managing weight, addressing metabolic health, moderating alcohol intake, and considering hereditary risk are vital strategies for controlling uric acid levels in men.
In women, BMI was a significant factor, emphasizing the role of obesity in the development of HU. Snoring also emerged as important, with poor sleep linked to increased stress and inflammation, both of which can elevate uric acid levels. Additionally, the atherogenic index was significant, as it reflected the relationship between lipid profiles and cardiovascular risk, with a higher index being associated with increased uric acid levels and a greater risk of cardiovascular disease. High cholesterol further complicates the picture, as dyslipidemia is associated with higher uric acid levels and cardiovascular risks. Hereditary factors also played a role, with a mother’s diabetes and a father’s diabetes both contributing to a genetic predisposition to metabolic conditions that may elevate uric acid levels. Finally, exposure to passive smoking was a significant lifestyle factor, as it could increase oxidative stress and inflammation, potentially exacerbating uric acid levels.
4.2. Food Frequency Questionnaires Factors
Figure 3 revealed significant nutritional patterns among men and women, highlighting specific food items strongly associated with the risk of elevated uric acid levels.
Specifically, the consumption of medium-sized cola in men and women was a significant nutritional factor influencing uric acid levels, aligning with existing evidence linking sugary beverages, particularly those high in fructose, to elevated uric acid [25,26]. Fructose metabolism in the liver is crucial in uric acid production, and frequent consumption of such beverages could exacerbate HU.
Figure 3 identified several other nutritional factors that were particularly relevant in men, including the consumption of corn tortillas, corn, pork, and tamales. Corn and corn-based products, such as tortillas, are staples in many diets but are high glycemic index foods, which can lead to rapid spikes in blood glucose and exacerbate insulin resistance, thereby reducing the body’s ability to effectively excrete uric acid. Pork, being rich in purines, directly contributes to increased uric acid levels, raising the risk of gout and related conditions. Tamales, which combine high glycemic corn masa and purine rich pork, represent a potent nutritional source of elevated uric acid levels [27], further compounded by the use of fats like lard in their preparation. These factors suggest that men who frequently consume these traditional foods might benefit from nutritional modifications, such as reducing the intake of high-purine and high-glycemic index foods and incorporating more low-glycemic vegetables, fruits, and lean proteins to manage uric acid levels effectively.
In women, several other nutritional factors identified were a teaspoon of butter, a portion of carnitas, and rice dishes, which are associated with elevated uric acid levels. Butter, while rich in beneficial fats and potentially healthful when consumed in moderation, can contribute to increased uric acid levels, especially when frequently combined with high-carbohydrate foods like rice [28]. Additionally, carnitas, a dish high in purines due to its pork content, directly contributes to uric acid production [27]. Therefore, while foods like butter may offer health benefits, their consumption alongside carbohydrate-heavy foods should be moderated to prevent exacerbating uric acid levels.
4.3. Nutritional and Biochemical Factors
This study provided an analysis of the nutritional and biochemical factors influencing uric acid levels, with a particular focus on gender differences. The findings revealed that while some factors, such as glucose and cholesterol, were common between men and women, others exhibited gender-specific influences, suggesting the need for personalized approaches in the prevention and management of HU.
Fructose was identified as a key factor in men, aligning with the existing literature that links high fructose consumption with increased uric acid production [29,30,31]. Through its metabolism in the liver, fructose leads to the generation of urate, thereby exacerbating uric acid levels in the blood. It is important to note, however, that the impact of fructose on uric acid levels can vary significantly depending on the source and type of fructose consumed. For instance, fructose found in whole fruits is accompanied by fiber, vitamins, and antioxidants, which may mitigate some of its negative effects. In contrast, the fructose in processed foods, such as commercial baked goods, is often present in a more concentrated form and combined with other refined sugars, leading to a more pronounced increase in glucose and urate levels.
In addition to fructose, maltose and lactose also emerged as notable risk factors, specifically in men. Maltose, commonly found in malted products and certain grains, contributes to uric acid elevation through mechanisms similar to those of fructose, by promoting rapid glucose spikes and potentially exacerbating insulin resistance, which can impair uric acid excretion. Similarly, lactose, primarily found in dairy products, was identified as a significant factor, suggesting that men who consume high amounts of dairy may also be at increased risk of elevated uric acid levels.
Understanding the type of fructose consumed by the population, whether simple or complex, is crucial for tailoring nutritional recommendations. Moreover, it is essential to educate the population about the maximum recommended intake of fructose, particularly from added sugars, which should ideally be limited to no more than 10% of daily caloric intake [32] according the World Health Organization. This distinction underscores the importance of not only reducing overall fructose intake but also making informed choices about the sources of fructose, which can significantly influence the risk of developing HU.
Cholesterol was another significant factor in men, consistent with studies in the literature [33,34] that suggest a complex interaction between lipid metabolism and uric acid. Elevated cholesterol levels can contribute to systemic inflammation and oxidative stress, both of which are linked to increased uric acid levels. This finding highlights the need to integrate cholesterol management as part of a broader approach to controlling HU in the male population.
In contrast, carbohydrates and glucose emerged as an influential factor in women. This observation is supported by studies in the literature that have demonstrated a strong connection between high carbohydrate intake, particularly from foods with a high glycemic index, and elevated blood glucose levels [35,36]. Increased glucose levels can lead to insulin resistance, a condition that diminishes the kidneys’ capacity to excrete uric acid efficiently. Research indicates that diets high in high glycemic carbohydrates not only cause significant blood glucose spikes but also aggravate conditions such as metabolic syndrome, which is closely associated with HU [37]. These findings highlight the critical role of glycemic control in nutritional interventions aimed at managing uric acid levels, particularly in women who may be more vulnerable to these metabolic effects [38].
Additionally, it is observed that women consume foods containing beta-tocopherol and beta-carotene, which are antioxidants known for their protective role against oxidative stress [39,40], a factor that contributes to the accumulation of uric acid. Vitamin E, particularly in its various forms including beta-tocopherol, and beta-carotene, as a precursor to vitamin A, also function as antioxidants. However, as seen in Figure 4, the intake of these antioxidants was relatively low. This suggests that although women are including these nutrients in their diet, their consumption may not be sufficient to fully counteract the negative effects of a high carbohydrate intake on uric acid levels.
While machine learning models such as XGBoost and CatBoost improve predictive accuracy, their complexity can limit interpretability. To address this, we leveraged SHAP values to quantify feature importance and enhance model transparency. In this study, we applied SHAP to identify relationships among the most relevant variables and explore sex-specific differences. Additionally, we generated dependency plots and clustering-based SHAP analyses to visualize distinct risk patterns, strengthening the interpretation of our results.
5. Conclusions
This study, framed within the Tlalpan 2020 project, demonstrated the effectiveness of using machine learning models to predict uric acid levels in a diverse cohort from Mexico City. Through the implementation of algorithms such as Boosted DTR, XGBoost, CatBoost, and SHAP, key variables were identified across different categories, including clinical, anthropometric, lifestyle, and nutritional characteristics.
The findings highlighted that XGBoost was particularly effective in identifying critical factors in the anthropometric and clinical variables, while CatBoost excelled in analyzing the food frequency questionnaire, and SHAP provided detailed insights into the complex interactions between nutrients and uric acid levels. These tools not only improved predictive accuracy but also enhanced our understanding of the factors contributing to HU.
Furthermore, the application of the cross-feature selection method, in combination with spiderweb charts for visualization, proved crucial in refining the model and ensuring that the most significant variables were consistently identified and compared across different groups and genders. This approach facilitated a more nuanced analysis of the interactions between variables, highlighting the importance of gender-specific risk factors and nutritional patterns in managing uric acid levels.
The use of feature engineering techniques resulted in significant improvements in performance metrics, validating the importance of careful variable selection in predictive modeling. Moreover, the findings underscore the potential of machine learning models not just for prediction but also for uncovering complex relationships within large datasets, offering insights that could inform targeted interventions.
While the implementation of public health policies is beyond the scope of this study, our findings provide insights that could support future initiatives focused on early detection and prevention of hyperuricemia. Additionally, efforts are being considered to develop predictive models that enhance risk assessment and aid in data-driven decision-making.
This integrated approach, which considers gender differences and demographic factors, provides a strong foundation for future research and the development of personalized interventions. Ultimately, this work contributes significantly to the understanding of hyperuricemia and its public health implications. It also sets a precedent for the application of machine learning techniques in data analysis. The study also suggests practical pathways for the integration of data analytics into public health strategies, particularly in the prevention and management of conditions related to hyperuricemia.
The public health implications of the study could be further strengthened by specifying how the predictive modeling of uric acid levels can inform targeted interventions and policy recommendations. For instance, if machine learning models accurately identify nutritional and metabolic predictors of hyperuricemia, these insights could be used to develop personalized nutritional guidelines or community-based prevention programs aimed at reducing the burden of related conditions such as gout, kidney disease, and cardiometabolic disorders. Additionally, integrating these predictive tools into clinical practice could help healthcare professionals identify at-risk individuals earlier and implement lifestyle modifications before the onset of complications. Policymakers could also leverage these findings to design educational campaigns or refine nutritional recommendations tailored to specific demographic groups. By explicitly outlining these applications, the authors could provide a clearer connection between their findings and actionable public health strategies.
The results of this study could have valuable applications in clinical settings, particularly in cardiology clinics, where hyperuricemia is increasingly recognized as a potential risk factor for cardiovascular disease. If machine learning models can accurately predict uric acid levels based on metabolic and nutritional factors, clinicians could integrate these tools into routine assessments to identify patients at higher risk of developing conditions such as hypertension, atherosclerosis, or heart failure. This predictive capability would allow for earlier interventions, including lifestyle modifications, nutritional counseling, and, when necessary, pharmacological strategies to manage uric acid levels proactively. Incorporating these findings into risk models could improve personalized treatment by tailoring recommendations to individual metabolic profiles. By bridging predictive analytics with clinical decision-making, this research has the potential to improve preventive care and optimize cardiovascular risk management.
Limitations
As every scientific research piece, the present work has some limitations; we can mention the following:
Cohort-specific limitations: The study was based on data from the Tlalpan 2020 cohort, which represents a specific geographic and demographic group (Mexico City). The findings may not be generalizable to populations with different ethnic, socioeconomic, or geographic characteristics. A broader dataset might provide more generalizable results.
Since the baseline participants of the Tlalpan 2020 cohort were younger than 50 years of age, the potential protective effect of uric acid observed in elderly populations (the uric acid paradox) might not be evident. Additionally, the non-linear relationship between uric acid and mortality within the 200–300 μmol/L range suggests that small variations in uric acid levels could have different health implications depending on the age and metabolic context of the population. This age-related effect should be considered when interpreting the predictive modeling outcomes, as the relationships observed in younger adults may not extrapolate directly to older populations.
Potential bias in self-reported data: some data, such as lifestyle habits and nutritional intake from the food frequency questionnaire, are self-reported, which can introduce recall bias or under-reporting.
Limited long-term data: This study focused on a snapshot of uric acid levels rather than on long-term changes. Uric acid concentrations can fluctuate daily and seasonally due to physiological and environmental factors; however, these variations were not captured in our cross-sectional analysis using baseline data. Additionally, individuals with a genetic predisposition to hyperuricemia might experience different fluctuation patterns, but this aspect was not assessed in our dataset.
Limited exploration of confounding factors: while gender and lifestyle factors were considered, other confounders, such as genetic predispositions or concurrent medication use, might also impact uric acid levels and should be more thoroughly explored.
Algorithm-specific limitations: Each machine learning algorithm has its limitations in terms of predictive accuracy and interpretability. For instance, models like CatBoost and XGBoost are powerful but complex, making it harder to interpret the relationships between features and outcomes. These limitations may hinder the clinical applicability of the findings.
L.M.A.-G.: investigation, project administration, supervision, writing—review and editing; M.F.M.-M.: investigation, project administration, writing—review and editing; M.M.-G.: investigation, supervision, data curation, writing—review and editing; M.B.-P.: investigation, project administration, writing—review and editing; E.H.-L.: investigation, supervision, project administration, supervision, writing—review and editing, G.G.-E.: conceptualization, data curation, investigation, software, validation, writing—original draft, writing—review and editing. All authors have read and agreed to the published version of the manuscript.
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Bioethics Committee of the National Institute of Cardiology “Ignacio Chávez” in Mexico City (protocol ID 13-802 and date of approval 19 February 2013).
Informed consent was obtained from all subjects involved in the study.
The original contributions presented in the study are included in the article/
We want to extend our appreciation to Secretaría de Ciencia, Humanidades, Tecnología e Innovación (The Secretariat of Science, Humanities, Technology, and Innovation) under the ’Investigadores por México’ program, No. 7186. The authors gratefully acknowledge Maite Vallejo and the Tlalpan 2020 project advisory group for their logistic support in this work.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 2. Radar chart of the most important clinical, anthropometric, and lifestyle features for women and men as identified in Table 1 and Table 3.
Figure 3. Radar chart of the most important features from a food frequency questionnaires for women and men as identified in Table 5 and Table 11.
Figure 4. Radar chart of the most important nutritional and biochemical features for women and men as identified in Table 8.
Subset of clinical, anthropometric, and lifestyle features of women obtained through Boosted DTR, CatBoost, SHAP, and XGBoost.
Boosted DTR | CatBoost | SHAP | XGBoost | ||||
---|---|---|---|---|---|---|---|
Feature | Importance | Feature | Importance | Feature | MASV | Feature | MASV |
Body Mass Index | 0.0375 | Triglycerides | 7.2254 | Creatinine | 0.0138 | Mother with diabetes | 0.0073 |
Triglycerides | 0.0214 | Body Mass Index | 6.8841 | Body Mass Index | 0.0125 | Passive smoker | 0.0045 |
Weight | 0.0198 | Creatinine | 6.6062 | Triglycerides | 0.0109 | BB4K | 0.0044 |
HDL-cholesterol | 0.0140 | Atherogenic index | 5.4440 | Atherogenic index | 0.0093 | Snoring | 0.0040 |
Glucose | 0.0139 | Urinary sodium | 4.9646 | Glucose | 0.0088 | Triglycerides | 0.0038 |
Creatinine | 0.0137 | Glucose | 4.5729 | Urinary sodium | 0.0077 | Menstrual cycle | 0.0036 |
Atherogenic index | 0.0135 | Age | 4.3607 | Weight | 0.0065 | Smoke | 0.0033 |
MGMD | 0.0126 | Weight | 3.3334 | Age | 0.0064 | Father with gout | 0.0031 |
Waist circumference | 0.0120 | Waist circumference | 2.9811 | Iron | 0.0038 | ISSSTE | 0.0030 |
Iron | 0.0102 | Total cholesterol | 1.7219 | Waist circumference | 0.0033 | Father with hypertension | 0.0028 |
Urinary sodium | 0.0100 | Respiratory rate | 1.4676 | Heart rate | 0.0026 | Father with diabetes | 0.0028 |
Father with gout | 0.0100 | Heart rate | 1.3924 | Mother with diabetes | 0.0024 | Mother with dyslipidemia | 0.0027 |
Pregnancies | 0.0093 | Serum sodium | 1.3627 | Pregnancies | 0.0023 | Very low physical activity | 0.0025 |
BB4K | 0.0091 | Height | 1.2105 | Urine potassium | 0.0022 | Skilled worker | 0.0024 |
Father with obesity | 0.0091 | Mother with diabetes | 1.1261 | Complications in childbirth | 0.0021 | Mother with hypertension | 0.0023 |
BB4K: baby birth weight over 4 kg; MGMD: maternal grandmother diabetes; ISSSTE: Institute for Social Security and Services for State Workers.
Performance metrics for the subset of clinical, anthropometric, and lifestyle features for women identified by Boosted DTR, CatBoost, SHAP, and XGBoost.
Model | Parameters | Method | MSE | RMSE | |
---|---|---|---|---|---|
CatBoost | depth: 6 | Importance | 0.0063 | 0.0793 | 0.2315 |
DTR | learning_rate: 0.01, | Importance | 0.0087 | 0.0930 | 0.2768 |
SHAP | alpha: 1, | SHAP | 0.0068 | 0.0824 | 0.1688 |
XGBoost | alpha: 1, | MeanAbsShapValue | 0.0079 | 0.0890 | 0.3170 |
Subset of clinical, anthropometric, and lifestyle features of men obtained through Boosted DTR, CatBoost, SHAP, and XGBoost.
BOOSTED DTR | CATBOOST | SHAP | XGBoost | ||||
---|---|---|---|---|---|---|---|
Feature | Importance | Feature | Importance | Feature | MASV | Feature | MASV |
Body Mass Index | 0.0330 | Creatinine | 8.6576 | Body Mass Index | 0.0181 | Father with diabetes | 0.0103 |
Waist circumference | 0.0193 | Body Mass Index | 5.9170 | Creatinine | 0.0169 | Passive smoker | 0.0097 |
Creatinine | 0.0173 | Atherogenic index | 4.1839 | Atherogenic index | 0.0106 | Father with gout | 0.0077 |
Daily smoker | 0.0159 | Triglycerides | 3.8335 | Triglycerides | 0.0102 | Mother with diabetes | 0.0066 |
Atherogenic index | 0.0158 | Urinary sodium | 3.4797 | Weight | 0.0074 | IMSS | 0.0055 |
MGFHA | 0.0136 | Waist circumference | 3.2791 | Urinary sodium | 0.0068 | Whiskey | 0.0049 |
Currently smokes | 0.0132 | Age | 2.6792 | Age | 0.0063 | High cholesterol | 0.0048 |
Triglycerides | 0.0131 | HDL-Cholesterol | 2.3282 | Father with diabetes | 0.0062 | Cognac | 0.0044 |
Drunk | 0.0131 | Urine potassium | 2.1635 | Iron | 0.0052 | Father with obesity | 0.0038 |
Father with gout | 0.0120 | Weight | 1.8986 | Passive smoker | 0.0049 | Use of electric vehicle | 0.0036 |
Father with diabetes | 0.0117 | Height | 1.8661 | Waist circumference | 0.0047 | Biparental family | 0.0033 |
Passive smoker | 0.0110 | LDL-Cholesterol | 1.8497 | Urine potassium | 0.0045 | PGMD | 0.0032 |
Whiskey | 0.0109 | Total cholesterol | 1.7534 | Height | 0.0042 | MGMD | 0.0032 |
Prof | 0.0106 | Pulse pressure | 1.7148 | IMSS | 0.0034 | Drunk | 0.0031 |
MGFHA: maternal grandfather heart attack; Prof: professional academic level; IMSS: Mexican Social Security Institute; PGMD: paternal grandmother with diabetes; MGMD: maternal grandmother with diabetes.
Performance metrics for the subset of clinical, anthropometric, and lifestyle features for men identified by Boosted DTR, CatBoost, SHAP, and XGBoost.
Model | Parameters | Method | MSE | RMSE | |
---|---|---|---|---|---|
CatBoost | depth: 6, | Importance | 0.0115 | 0.1074 | 0.1687 |
DTR | learning_rate: 0.01, | Importance | 0.0111 | 0.1054 | 0.1994 |
SHAP | learning_rate: 0.1, | SHAP | 0.0111 | 0.1054 | 0.1987 |
XGBoost | alpha: 1, | MeanAbsShapValue | 0.0137 | 0.1072 | 0.1996 |
Subset of features from a food frequency questionnaire for men obtained through Boosted DTR, CatBoost and SHAP.
Boosted DTR | CatBoost | SHAP | |||
---|---|---|---|---|---|
Feature | Importance | Feature | Importance | Feature | MASV |
A medium cola soda | 0.0239 | A medium cola soda | 6.9284 | A medium cola soda | 0.0110 |
Corn | 0.0135 | A corn tortilla | 2.2745 | A banana | 0.0030 |
Safflower oil | 0.0134 | A banana | 2.1508 | A cup of atole with milk | 0.0025 |
A teaspoon of margarine | 0.0130 | A portion of pork | 2.0625 | A teaspoon of margarine | 0.0025 |
Corn oil | 0.0128 | A teaspoon of chocolate powder | 1.7368 | A portion of pork | 0.0023 |
A tamale | 0.0127 | A liver steak or chicken liver | 1.4353 | A piece of chorizo or sausage | 0.0022 |
A teaspoon of honey, jam or caramel | 0.0125 | A cup of atole with milk | 1.4117 | Chicken egg | 0.0022 |
A teaspoon of chocolate powder | 0.0122 | A bowl of oatmeal | 1.3965 | A piece of sweet bread | 0.0020 |
A portion of pork | 0.0121 | A bowl of sardines in tomato sauce | 1.2908 | A cup of beans | 0.0019 |
A banana | 0.0118 | A cup of coffee without sugar | 1.2908 | Half a cup of seafood | 0.0018 |
A bowl of oatmeal | 0.0117 | Olive oil | 1.2625 | Corn | 0.0017 |
A cup of atole with milk | 0.0117 | A slice of white bread | 1.2612 | A liver steak or chicken liver | 0.0017 |
A piece of chorizo or sausage | 0.0115 | Chicken egg | 1.2538 | Safflower oil | 0.0016 |
A portion of carnitas | 0.0115 | Wheat tortilla | 1.2270 | Pork skin | 0.0016 |
A teaspoon of vegetable shortening | 0.0115 | A fresh apple | 1.2131 | a teaspoon of chocolate powder | 0.0014 |
Performance metrics for the subset of food frequency questionnaire features for men.
Model | Parameters | Method | MSE | RMSE | |
---|---|---|---|---|---|
CatBoost | depth: 10, | Importance | 0.0129 | 0.1135 | 0.0714 |
Boosted_DTR | learning_rate: 0.05, | Importance | 0.0128 | 0.1133 | 0.0739 |
SHAP | alpha: 1, | SHAP | 0.0129 | 0.1136 | 0.0702 |
Performance metrics for the subset of food frequency questionnaire features for women.
Model | Parameters | Method | MSE | RMSE | |
---|---|---|---|---|---|
CatBoost | depth: 10, | Importance | 0.0111 | 0.1052 | 0.0763 |
DTR | learning_rate: 0.05, | Importance | 0.0111 | 0.1051 | 0.0767 |
SHAP | alpha: 1, | SHAP | 0.0083 | 0.0913 | 0.0187 |
Subset of nutritional and biochemical features for women and men obtained from a food frequency questionnaire via CatBoost and SHAP.
Women | Men | ||||||
---|---|---|---|---|---|---|---|
CatBoost | SHAP | CatBoost | SHAP | ||||
Feature | Importance | Feature | MASV | Feature | Importance | Feature | MASV |
Glucose | 3.0789 | Glucose | 0.0050 | Fructose | 2.4942 | Fructose | 0.0044 |
Carbohydrates | 2.8389 | Starch | 0.0033 | Maltose | 2.3573 | Cholesterol | 0.0039 |
Alcohol | 2.1420 | Alcohol | 0.0032 | Glucose | 2.3070 | Glucose | 0.0038 |
Fructose | 1.9236 | Fructose | 0.0030 | Carbohydrates | 1.9796 | Maltose | 0.0037 |
Maltose | 1.9102 | Beta cryptoxanthin | 0.0030 | Lactose | 1.9027 | Carbohydrates | 0.0036 |
Cholesterol | 1.8989 | Alcohol | 1.8901 | Maltose | 0.0034 | ||
Beta cryptoxanthin | 0.0020 |
Performance metrics for the subset of nutritional and biochemical features for women and men.
Sex | Model | Parameters | Method | MSE | RMSE | |
---|---|---|---|---|---|---|
Men | CatBoost | depth: 10, | Importance | 0.0130 | 0.1139 | 0.0654 |
Men | SHAP | alpha: 1, | SHAP | 0.0135 | 0.1160 | 0.1345 |
Women | CatBoost | depth: 10, | Importance | 0.0106 | 0.1030 | 0.1139 |
Women | SHAP | alpha: 1, | SHAP | 0.0077 | 0.0880 | 0.0523 |
Performance metrics of cross-feature method for groups in men and women.
Group | Sex | Parameters | MSE | RMSE | |
---|---|---|---|---|---|
Women | colsample_bytree: 0.9, | 0.0034 | 0.0582 | 0.5857 | |
Clinical, anthropometric | Men | colsample_bytree: 0.8, | 0.0093 | 0.0964 | 0.3302 |
Women | colsample_bytree: 0.9, | 0.0091 | 0.0951 | 0.2431 | |
Food frequency | Men | colsample_bytree: 0.8, | 0.0099 | 0.0995 | 0.2865 |
Women | colsample_bytree: 0.8, | 0.0039 | 0.0627 | 0.519 | |
Nutritional | Men | colsample_bytree: 0.8, | 0.0096 | 0.0979 | 0.3098 |
Subset of features from a food frequency questionnaire for women obtained through Boosted DTR, CatBoost and SHAP.
Boosted DTR | CatBoost | SHAP | |||
---|---|---|---|---|---|
Feature | Importance | Feature | Importance | Feature | MASV |
A medium cola soda | 0.0292 | A medium cola soda | 6.0152 | A drink (rum, brandy, or tequila) | 0.0078 |
A teaspoon of butter | 0.0170 | A portion of carnitas | 2.8352 | A medium cola soda | 0.0065 |
A drink (rum, brandy, or tequila) | 0.0165 | A glass of sugary flavored water | 2.0415 | A bowl with rice | 0.0039 |
A teaspoon of hot sauce or chili | 0.0150 | A cup of coffee without sugar | 1.9075 | Safflower oil | 0.0034 |
A portion of carnitas | 0.0141 | A teaspoon of margarine | 1.8767 | A glass of sugary flavored water | 0.0033 |
Pork skin | 0.0136 | A glass of whole milk | 1.7528 | Wheat tortilla | 0.0032 |
A glass of sugary flavored water | 0.0135 | A cup of yogurt or Bulgarian yogurt | 1.7215 | A corn | 0.0032 |
A corn tortilla | 0.0134 | Soy Oil | 1.6787 | A portion of carnitas | 0.0029 |
A taco with al pastor meat | 0.0132 | A portion of dried beans | 1.6536 | A beer | 0.0029 |
Corn oil | 0.0132 | A taco with al pastor meat | 1.6423 | A teaspoon of honey, jam or caramel | 0.0027 |
A bowl of pasta soup | 0.0130 | A corn tortilla | 1.6204 | A cup of yogurt or Bulgarian yogurt | 0.0025 |
Half a cup of seafood | 0.0129 | Olive oil | 1.6147 | A orange | 0.0025 |
Half a cup of peas | 0.0128 | A cup of beans | 1.5657 | A portion of grapes | 0.0025 |
A bowl of oatmeal | 0.0128 | Half a cup of peas | 1.4594 | A slice of papaya | 0.0025 |
A sapote | 0.0126 | A slice of papaya | 1.3905 | A bowl of cream of vegetable soup | 0.0023 |
Supplementary Materials
The following supporting information can be downloaded at:
References
1. INEGI. Estadística de Defunciones Registradas de Enero a Junio de 2021; INEGI: Aguascalientes, Mexico, 2022.
2. American College of Cardiology. Advancing Cardiovascular Health in Mexico: Roundtable Report; American College of Cardiology: Washington, DC, USA, 2022.
3. Brikman, S.; Serfaty, L.; Abuhasira, R.; Schlesinger, N.; Bieber, A.; Rappoport, N. A machine learning-based prediction model for gout in hyperuricemics: A nationwide cohort study. Rheumatology; 2024; 63, pp. 2411-2417. [DOI: https://dx.doi.org/10.1093/rheumatology/keae273] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38895877]
4. Nayan, N.A.; Ab Hamid, H.; Suboh, M.Z.; Aminaldin, N.M.; Azizul, N.I.M.; Nizar, M.N.M. Machine learning application for classifying serum uric acid level with photoplethysmogram. Proceedings of the 2020 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES); Langkawi Island, Malaysia, 1–3 March 2021; pp. 522-527.
5. Lee, S.; Choe, E.K.; Park, B. Exploration of machine learning for hyperuricemia prediction models based on basic health checkup tests. J. Clin. Med.; 2019; 8, 172. [DOI: https://dx.doi.org/10.3390/jcm8020172]
6. Zeng, L.; Ma, P.; Li, Z.; Liang, S.; Wu, C.; Hong, C.; Li, Y.; Cui, H.; Li, R.; Wang, J. et al. Multimodal Machine Learning-Based Marker Enables Early Detection and Prognosis Prediction for Hyperuricemia. Adv. Sci.; 2024; 11, 2404047. [DOI: https://dx.doi.org/10.1002/advs.202404047]
7. Sampa, M.B.; Hossain, M.N.; Hoque, M.R.; Islam, R.; Yokota, F.; Nishikitani, M.; Ahmed, A. Blood uric acid prediction with machine learning: Model development and performance comparison. JMIR Med. Inform.; 2020; 8, e18331. [DOI: https://dx.doi.org/10.2196/18331] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33030442]
8. Gao, Y.; Jia, S.; Huang, C.; Meng, Z.; Yu, M.; Xu, T.; Liu, M.; Sun, J.; Jia, Q.; Zhang, Q. et al. Risk factors of hyperuricemia calculated by random forest machine learning. Res. Sq.; 2020; pp. 1-20.
9. Shi, J.C.; Chen, X.H.; Yang, Q.; Wang, C.M.; Huang, Q.; Shen, Y.M.; Yu, J. A simple prediction model of hyperuricemia for use in a rural setting. Sci. Rep.; 2021; 11, 23300. [DOI: https://dx.doi.org/10.1038/s41598-021-02716-y]
10. Zhang, Y.; Zhang, L.; Lv, H.; Zhang, G. Ensemble machine learning prediction of hyperuricemia based on a prospective health checkup population. Front. Physiol.; 2024; 15, 1357404. [DOI: https://dx.doi.org/10.3389/fphys.2024.1357404]
11. Colín-Ramírez, E.; Rivera-Mancía, S.; Infante-Vázquez, O.; Cartas-Rosado, R.; Vargas-Barrón, J.; Madero, M.; Vallejo, M. Protocol for a prospective longitudinal study of risk factors for hypertension incidence in a Mexico City population: The Tlalpan 2020 cohort. BMJ Open; 2017; 7, e016773. [DOI: https://dx.doi.org/10.1136/bmjopen-2017-016773]
12. Stewart, A.L.; Ware, J.E. Measuring Functioning and Well-Being: The Medical Outcomes Study Approach; Duke University Press: Durham, NC, USA, 1992.
13. Chobanian, A.V.; Bakris, G.L.; Black, H.R.; Cushman, W.C.; Green, L.A.; Izzo Jr, J.L.; Jones, D.W.; Materson, B.J.; Oparil, S.; Wright Jr, J.T. et al. Seventh report of the joint national committee on prevention, detection, evaluation, and treatment of high blood pressure. Hypertension; 2003; 42, pp. 1206-1252. [DOI: https://dx.doi.org/10.1161/01.HYP.0000107251.49515.c2]
14. Marfell-Jones, M.J.; Stewart, A.; De Ridder, J. International Standards for Anthropometric Assessment. 2012; Available online: https://www.researchgate.net/publication/236891109_International_Standards_for_Anthropometric_Assessment (accessed on 10 February 2025).
15. Craig, C.L.; Marshall, A.L.; Sjöström, M.; Bauman, A.E.; Booth, M.L.; Ainsworth, B.E.; Pratt, M.; Ekelund, U.; Yngve, A.; Sallis, J.F. et al. International physical activity questionnaire: 12-country reliability and validity. Med. Sci. Sport. Exerc.; 2003; 35, pp. 1381-1395. [DOI: https://dx.doi.org/10.1249/01.MSS.0000078924.61453.FB]
16. Hernández-Avila, J.; González-Avilés, L.; Rosales-Mendoza, E. Manual de Usuario. SNUT Sistema de Evaluación de Hábitos Nutricionales y Consumo de Nutrimentos; Instituto Nacional de Salud Pública: Cuernavaca, Mexico, 2003.
17. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T. et al. Xgboost: Extreme Gradient Boosting. R Package Version 0.4-2. 2015; pp. 1-4. Available online: https://cran.ms.unimelb.edu.au/web/packages/xgboost/vignettes/xgboost.pdf (accessed on 10 February 2025).
18. Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. Proceedings of the 13th International Conference on Machine Learning (ICML); Bari, Italy, 3–6 July 1996; Volume 96, pp. 148-156.
19. Schapire, R.E.; Singer, Y. Improved boosting algorithms using confidence-rated predictions. Proceedings of the Eleventh Annual Conference on Computational Learning Theory; Madison, WI, USA, 24–26 July 1998; pp. 80-91.
20. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Proceedings of the NeurIPS 2018; Montreal, QC, Canada, 2–8 December 2018; Volume 31.
21. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Proceedings of the NeurIPS 2018; Long Beach, CA, USA., 4–9 December 2017; Volume 30.
22. Shaon, M.S.H.; Karim, T.; Shakil, M.S.; Hasan, M.Z. A Comparative Study of Machine Learning Models with LASSO and SHAP Feature Selection for Breast Cancer Prediction. Healthc. Anal.; 2024; 6, 100353. [DOI: https://dx.doi.org/10.1016/j.health.2024.100353]
23. Gómez, O.V.; Herraiz, J.L.; Udías, J.M.; Haug, A.; Papp, L.; Cioni, D.; Neri, E. Analysis of cross-combinations of feature selection and machine-learning classification methods based on [18F] F-FDG PET/CT radiomic features for metabolic response prediction of metastatic breast cancer lesions. Cancers; 2022; 14, 2922. [DOI: https://dx.doi.org/10.3390/cancers14122922] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35740588]
24. Kramer, O.; Kramer, O. Scikit-learn. Machine Learning for Evolution Strategies; O’Reilly Media: Sebastopol, CA, USA, 2016; pp. 45-53.
25. Meneses-León, J.; León-Maldonado, L.; Macías, N.; Torres-Ibarra, L.; Hernández-López, R.; Rivera-Paredez, B.; Flores, M.; Flores, Y.N.; Barrientos-Gutiérrez, T.; Quezada-Sánchez, A.D. et al. Sugar-sweetened beverage consumption and risk of hyperuricemia: A longitudinal analysis of the Health Workers Cohort Study participants in Mexico. Am. J. Clin. Nutr.; 2020; 112, pp. 652-660. [DOI: https://dx.doi.org/10.1093/ajcn/nqaa160] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32644154]
26. Büsing, F.; Hägele, F.A.; Nas, A.; Döbert, L.V.; Fricker, A.; Dörner, E.; Podlesny, D.; Aschoff, J.; Pöhnl, T.; Schweiggert, R. et al. High intake of orange juice and cola differently affects metabolic risk in healthy subjects. Clin. Nutr.; 2019; 38, pp. 812-819. [DOI: https://dx.doi.org/10.1016/j.clnu.2018.02.028]
27. Zhang, Y.; Chen, S.; Yuan, M.; Xu, Y.; Xu, H. Gout and diet: A comprehensive review of mechanisms and management. Nutrients; 2022; 14, 3525. [DOI: https://dx.doi.org/10.3390/nu14173525]
28. Yustisia, I.; Tandiari, D.; Cangara, M.H.; Hamid, F.; Nu’man, A. A high-fat, high-fructose diet induced hepatic steatosis, renal lesions, dyslipidemia, and hyperuricemia in non-obese rats. Heliyon; 2022; 8, e10896. [DOI: https://dx.doi.org/10.1016/j.heliyon.2022.e10896]
29. Zhang, C.; Li, L.; Zhang, Y.; Zeng, C. Recent advances in fructose intake and risk of hyperuricemia. Biomed. Pharmacother.; 2020; 131, 110795. [DOI: https://dx.doi.org/10.1016/j.biopha.2020.110795]
30. Jamnik, J.; Rehman, S.; Mejia, S.B.; de Souza, R.J.; Khan, T.A.; Leiter, L.A.; Wolever, T.M.; Kendall, C.W.; Jenkins, D.J.; Sievenpiper, J.L. Fructose intake and risk of gout and hyperuricemia: A systematic review and meta-analysis of prospective cohort studies. BMJ Open; 2016; 6, e013191. [DOI: https://dx.doi.org/10.1136/bmjopen-2016-013191]
31. Ebrahimpour-Koujan, S.; Saneei, P.; Larijani, B.; Esmaillzadeh, A. Consumption of sugar sweetened beverages and dietary fructose in relation to risk of gout and hyperuricemia: A systematic review and meta-analysis. Crit. Rev. Food Sci. Nutr.; 2020; 60, pp. 1-10. [DOI: https://dx.doi.org/10.1080/10408398.2018.1503155]
32. World Health Organization. Guideline: Sugars Intake for Adults and Children; World Health Organization: Geneva, Switzerland, 2015.
33. Liu, X.Y.; Wu, Q.Y.; Chen, Z.H.; Yan, G.Y.; Lu, Y.; Dai, H.J.; Li, Y.; Yang, P.T.; Yuan, H. Elevated triglyceride to high-density lipoprotein cholesterol (TG/HDL-C) ratio increased risk of hyperuricemia: A 4-year cohort study in China. Endocrine; 2020; 68, pp. 71-80. [DOI: https://dx.doi.org/10.1007/s12020-019-02176-5]
34. Wang, Z.; Wu, M.; Yan, H.; Zhong, S.; Xu, R.; Zhao, Z.; Yang, Q. Association between Remnant Cholesterol and risk of Hyperuricemia: A cross-sectional study. Horm. Metab. Res.; 2024; 56, pp. 882-889. [DOI: https://dx.doi.org/10.1055/a-2299-2914] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38569514]
35. Juraschek, S.P.; McAdams-Demarco, M.; Gelber, A.C.; Sacks, F.M.; Appel, L.J.; White, K.J.; Miller, E.R., III. Effects of lowering glycemic index of dietary carbohydrate on plasma uric acid levels: The OmniCarb randomized clinical trial. Arthritis Rheumatol.; 2016; 68, pp. 1281-1289. [DOI: https://dx.doi.org/10.1002/art.39527] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26636424]
36. Zhang, T.; Bian, S.; Gu, Y.; Meng, G.; Zhang, Q.; Liu, L.; Wu, H.; Zhang, S.; Wang, Y.; Wang, X. et al. Sugar-containing carbonated beverages consumption is associated with hyperuricemia in general adults: A cross-sectional study. Nutr. Metab. Cardiovasc. Dis.; 2020; 30, pp. 1645-1652. [DOI: https://dx.doi.org/10.1016/j.numecd.2020.05.022] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32669242]
37. Lubawy, M.; Formanowicz, D. High-fructose diet–induced hyperuricemia accompanying metabolic syndrome–mechanisms and dietary therapy proposals. Int. J. Environ. Res. Public Health; 2023; 20, 3596. [DOI: https://dx.doi.org/10.3390/ijerph20043596]
38. Dearden, L.; Bouret, S.G.; Ozanne, S.E. Sex and gender differences in developmental programming of metabolism. Mol. Metab.; 2018; 15, pp. 8-19. [DOI: https://dx.doi.org/10.1016/j.molmet.2018.04.007]
39. Zhang, L.; Shi, X.; Yu, J.; Zhang, P.; Ma, P.; Sun, Y. Dietary vitamin E intake was inversely associated with hyperuricemia in US adults: NHANES 2009–2014. Ann. Nutr. Metab.; 2021; 76, pp. 354-360. [DOI: https://dx.doi.org/10.1159/000509628]
40. Roumeliotis, S.; Roumeliotis, A.; Dounousi, E.; Eleftheriadis, T.; Liakopoulos, V. Dietary antioxidant supplements and uric acid in chronic kidney disease: A review. Nutrients; 2019; 11, 1911. [DOI: https://dx.doi.org/10.3390/nu11081911]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Background/Objectives: Uric acid is a key metabolic byproduct of purine degradation and plays a dual role in human health. At physiological levels, it acts as an antioxidant, protecting against oxidative stress. However, excessive uric acid can lead to hyperuricemia, contributing to conditions like gout, kidney stones, and cardiovascular diseases. Emerging evidence also links elevated uric acid levels with metabolic disorders, including hypertension and insulin resistance. Understanding its regulation is crucial for preventing associated health complications. Methods: This study, part of the Tlalpan 2020 project, aimed to predict uric acid levels using advanced machine learning algorithms. The dataset included clinical, anthropometric, lifestyle, and nutritional characteristics from a cohort in Mexico City. We applied Boosted Decision Trees (Boosted DTR), eXtreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), and Shapley Additive Explanations (SHAP) to identify the most relevant variables associated with hyperuricemia. Feature engineering techniques improved model performance, evaluated using Mean Squared Error (MSE), Root-Mean-Square Error (RMSE), and the coefficient of determination (R2). Results: Our study showed that XGBoost had the highest accuracy for anthropometric and clinical predictors, while CatBoost was the most effective at identifying nutritional risk factors. Distinct predictive profiles were observed between men and women. In men, uric acid levels were primarily influenced by renal function markers, lipid profiles, and hereditary predisposition to hyperuricemia, particularly paternal gout and diabetes. Diets rich in processed meats, high-fructose foods, and sugary drinks showed stronger associations with elevated uric acid levels. In women, metabolic and cardiovascular markers, family history of metabolic disorders, and lifestyle factors such as passive smoking and sleep quality were the main contributors. Additionally, while carbohydrate intake was more strongly associated with uric acid levels in women, fructose and sugary beverages had a greater impact in men. To enhance model robustness, a cross-feature selection approach was applied, integrating top features from multiple models, which further improved predictive accuracy, particularly in gender-specific analyses. Conclusions: These findings provide insights into the metabolic, nutritional characteristics, and lifestyle determinants of uric acid levels, supporting targeted public health strategies for hyperuricemia prevention.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 “Researcher for Mexico” Program under SECIHTI, Secretariat of Sciences, Humanities, Technology, and Innovation, Mexico City 08400, Mexico; Division of Diagnostic and Treatment Services, National Institute of Cardiology Ignacio Chávez, Mexico City 04510, Mexico;
2 Department of Immunology, National Institute of Cardiology Ignacio Chávez, Mexico City 04510, Mexico;
3 Division of Diagnostic and Treatment Services, National Institute of Cardiology Ignacio Chávez, Mexico City 04510, Mexico;
4 Computational Genomics Division, National Institute of Genomic Medicine, Mexico City 14610, Mexico; Center for Complexity Sciences, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico