Content area

Abstract

Diabetes Mellitus is a global health concern, characterized by high blood sugar levels over a prolonged period, leading to severe complications if left unmanaged. The early identification of individuals at risk is critical for effective intervention and treatment. Traditional diagnostic methods rely heavily on clinical symptoms and biochemical tests, which may not capture the underlying genetic predispositions. With the advent of genomics, DNA sequence analysis has emerged as a promising approach to uncover the genetic markers associated with Diabetes Mellitus. However, the challenge lies in accurately classifying DNA sequences to predict susceptibility to the disease, given the complex nature of genetic data. This study addresses this challenge by employing two advanced machine learning models, NuSVC (Nu-Support Vector Classification) and XGBoost (Extreme Gradient Boosting), to classify DNA sequences related to Diabetes Mellitus. The dataset, obtained from reputable sources like NCBI, was preprocessed using Natural Language Processing (NLP) techniques, where DNA sequences were treated as textual data and transformed into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). To handle the class imbalance in the dataset, SMOTE (Synthetic Minority Over-sampling Technique) was applied. The models were trained and validated using 10-fold cross-validation. XGBoost was trained with up to 300 boosting rounds, and performance was evaluated using accuracy, precision, recall, F1-score, ROC-AUC, and log loss. The results demonstrate that XGBoost outperformed NuSVC across all metrics, achieving an accuracy of 98%, a log loss of 0.0650, and an AUC of 1.00, compared to NuSVC’s accuracy of 87%, log loss of 0.2649, and AUC of 0.95. The superior performance of XGBoost indicates its robustness in handling complex genetic data and its potential utility in clinical applications for early diagnosis of Diabetes Mellitus. The findings of this study underscore the importance of advanced machine learning techniques in genomics and suggest that integrating such models into healthcare systems could significantly enhance predictive diagnostics.

Details

1009240
Business indexing term
Title
DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative
Publication title
PLoS One; San Francisco
Volume
20
Issue
7
First page
e0328253
Number of pages
16
Publication year
2025
Publication date
Jul 2025
Section
Research Article
Publisher
Public Library of Science
Place of publication
San Francisco
Country of publication
United States
e-ISSN
19326203
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Milestone dates
2025-02-18 (Received); 2025-06-28 (Accepted); 2025-07-18 (Published)
ProQuest document ID
3231448959
Document URL
https://www.proquest.com/scholarly-journals/dna-sequence-classification-diabetes-mellitus/docview/3231448959/se-2?accountid=208611
Copyright
© 2025 Salloum et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-07-25
Database
ProQuest One Academic