Content area

Abstract

Cyberattacks include Structured Query Language Injection (SQLi), which represents threats at the level of web applications that interact with the database. These attacks are carried out by executing SQL commands, which compromise the integrity and confidentiality of the data. In this paper, a machine learning (ML)-based model is proposed for identifying SQLi attacks. The authors propose a two-stage personalized software processing pipeline as a novel element. Although individual techniques are known, their structured combination and application in this context represent a novel approach to transforming raw SQL queries into input features for an ML model. In this research, a dataset consisting of 90,000 SQL queries was constructed, comprising 17,695 legitimate and 72,304 malicious queries. The dataset consists of synthetic data generated using the GPT-4o model and data from a publicly available dataset. These were processed within a pipeline proposed by the authors, consisting of two stages: syntactic normalization and the extraction of the eight semantic features for model training. Also, within the research, several ML models were analyzed using the Azure Machine Learning Studio platform. These models were paired with different sampling algorithms for selecting the training set and the validation set. Out of the 15 training-sampling algorithm combinations, the Voting Ensemble model achieved the best performance. It achieved an accuracy of 96.86%, a weighted AUC of 98.25%, a weighted F1-score of 96.77%, a weighted precision of 96.92%, and a Matthews correlation coefficient of 89.89%. These values demonstrate the model’s ability to classify queries as legitimate or malicious. The attack identification rate was only 15 malicious queries missed out of a total of 7200, and the number of false alarms was 211 cases. The results confirm the possibility of integrating this algorithm into an additional security layer within an existing web application architecture. In practice, the authors suggest adding an extra layer of security using synthetic data.

Details

1009240
Business indexing term
Title
Machine Learning Models for SQL Injection Detection
Author
Rosca Cosmina-Mihaela 1   VIAFID ORCID Logo  ; Stancu, Adrian 2   VIAFID ORCID Logo  ; Popescu Catalin 2   VIAFID ORCID Logo 

 Department of Automatic Control, Computers, and Electronics, Faculty of Mechanical and Electrical Engineering, Petroleum-Gas University of Ploiesti, 100680 Ploiesti, Romania; [email protected] 
 Department of Business Administration, Faculty of Economic Sciences, Petroleum-Gas University of Ploiesti, 100680 Ploiesti, Romania 
Publication title
Volume
14
Issue
17
First page
3420
Number of pages
21
Publication year
2025
Publication date
2025
Publisher
MDPI AG
Place of publication
Basel
Country of publication
Switzerland
Publication subject
e-ISSN
20799292
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-08-27
Milestone dates
2025-07-12 (Received); 2025-08-25 (Accepted)
Publication history
 
 
   First posting date
27 Aug 2025
ProQuest document ID
3249685025
Document URL
https://www.proquest.com/scholarly-journals/machine-learning-models-sql-injection-detection/docview/3249685025/se-2?accountid=208611
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-09-19
Database
2 databases
  • ProQuest One Academic
  • ProQuest One Academic