Leveraging machine learning to proactively identify phishing campaigns before they strike

Abstract

With the increasing reliance on digital platforms for shopping, communication, and meetings, users are more exposed to cyber threats like phishing. These attacks often involve fraudulent websites designed to steal sensitive information, such as passwords and credit card details, by mimicking legitimate sites. Attackers use various deceptive techniques, including link manipulation, filter evasion, covert redirection, website forgery, and social engineering. This study introduces an advanced phishing detection framework using machine learning (ML) models. A dataset of 1,353 URLs (702 legitimate, 103 suspicious, and 548 phishing) was compiled, with nine key features extracted for classification. Four ML classifiers—Categorical Boosting, Random Forest (RF), Decision Tree (DT), and Extreme Gradient Boosting (XGB)—were employed, with cross-validation ensuring robust model evaluation. Feature selection was conducted using SHapley Additive Explanations (SHAP) and Recursive Feature Elimination (RFE) to enhance interpretability and computational efficiency. To further refine classification accuracy across legitimate, suspicious, and phishing categories, hyperparameter tuning was performed using four nature-inspired optimization algorithms: Golden Jackal Optimization, Dandelion Optimization, Coati Optimization, and Puma Optimization. These algorithms were chosen for their strong global search capabilities and adaptability to complex datasets, ensuring optimal parameter selection for improved model performance. The study’s main contribution lies in integrating these optimization techniques with ML classifiers, significantly improving phishing detection accuracy while reducing computational complexity. Experimental results demonstrated that XGB-based models, particularly XGPO, achieved the highest performance across two feature-selection scenarios. In Scenario 1, Accuracy = 0.980, Precision = 0.981, Recall = 0.980, F1-score = 0.980, MCC = 0.965, AUC = 0.985. In Scenario 2, Accuracy = 0.984, Precision = 0.985, Recall = 0.984, F1-score = 0.985, MCC = 0.973, AUC = 0.989. These findings highlight the effectiveness of ML-driven phishing detection in strengthening user security, preventing cyber fraud, and fostering trust in online interactions.

Details

Business indexing term

Subject:

Machine learning;
Big Data

Identifier / keyword

Phishing cybercrimes; Uniform resource locator; Machine learning classification; Shapely additive explanations; Recursive feature elimination; Hyperparameter optimization

Title

Leveraging machine learning to proactively identify phishing campaigns before they strike

Publication title

Journal of Big Data; Heidelberg

Volume

Issue

Pages

124

Publication year

2025

Publication date

May 2025

Publisher

Springer Nature B.V.

Place of publication

Heidelberg

Country of publication

Netherlands

Publication subject

Computers--Electronic Data Processing

e-ISSN

21961115

Source type

Scholarly Journal

Language of publication

English

Document type

Journal Article

Publication history

Online publication date

2025-05-20

Milestone dates

2025-04-27 (Registration); 2024-11-26 (Received); 2025-04-27 (Accepted)

Publication history

First posting date

20 May 2025

DOI

https://doi.org/10.1186/s40537-025-01174-x

ProQuest document ID

3206309445

Document URL

https://www.proquest.com/scholarly-journals/leveraging-machine-learning-proactively-identify/docview/3206309445/se-2?accountid=208611

Last updated

2025-11-14

Database

2 databasesView list

Coronavirus Research Database
ProQuest One Academic

Leveraging machine learning to proactively identify phishing campaigns before they strike

Content area

Abstract

Details