Content area
Cyberattacks include Structured Query Language Injection (SQLi), which represents threats at the level of web applications that interact with the database. These attacks are carried out by executing SQL commands, which compromise the integrity and confidentiality of the data. In this paper, a machine learning (ML)-based model is proposed for identifying SQLi attacks. The authors propose a two-stage personalized software processing pipeline as a novel element. Although individual techniques are known, their structured combination and application in this context represent a novel approach to transforming raw SQL queries into input features for an ML model. In this research, a dataset consisting of 90,000 SQL queries was constructed, comprising 17,695 legitimate and 72,304 malicious queries. The dataset consists of synthetic data generated using the GPT-4o model and data from a publicly available dataset. These were processed within a pipeline proposed by the authors, consisting of two stages: syntactic normalization and the extraction of the eight semantic features for model training. Also, within the research, several ML models were analyzed using the Azure Machine Learning Studio platform. These models were paired with different sampling algorithms for selecting the training set and the validation set. Out of the 15 training-sampling algorithm combinations, the Voting Ensemble model achieved the best performance. It achieved an accuracy of 96.86%, a weighted AUC of 98.25%, a weighted F1-score of 96.77%, a weighted precision of 96.92%, and a Matthews correlation coefficient of 89.89%. These values demonstrate the model’s ability to classify queries as legitimate or malicious. The attack identification rate was only 15 malicious queries missed out of a total of 7200, and the number of false alarms was 211 cases. The results confirm the possibility of integrating this algorithm into an additional security layer within an existing web application architecture. In practice, the authors suggest adding an extra layer of security using synthetic data.
Details
; Stancu, Adrian 2
; Popescu Catalin 2
1 Department of Automatic Control, Computers, and Electronics, Faculty of Mechanical and Electrical Engineering, Petroleum-Gas University of Ploiesti, 100680 Ploiesti, Romania; [email protected]
2 Department of Business Administration, Faculty of Economic Sciences, Petroleum-Gas University of Ploiesti, 100680 Ploiesti, Romania