Machine learning-based cyber threat detection: an approach to malware detection and security with explainable AI insights

Abstract

The growing prevalence of malware in the digital landscape presents significant risks to the security and integrity of computer networks and devices. Malicious software, designed with harmful intent, can disrupt operations, compromise sensitive data, and undermine critical processes. To counter these ongoing threats, enhanced cyber threat detection systems are essential to identify and mitigate emerging risks proactively. One promising approach to improving cybersecurity involves applying Machine Learning (ML) techniques, which allow systems to detect patterns and make informed predictions. In this paper, we examined the effectiveness of ML in cyber threat detection, focusing on the classification of dangerous and benign entities within digital ecosystems. We tested four ML algorithms: Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbors (KNN), and Random Forest (RF). The dataset, sourced from Kaggle, was carefully pre-processed to ensure accurate malware and benign data classification. We used k-fold cross-validation for splitting the dataset and manually tuned hyper-parameters to refine model performance, reducing bias and variance. Our results revealed distinct performance metrics among the models, with RF emerging as the top performer with an impressive accuracy rate of 100%. To further enhance the interpretability of the RF model’s predictions, we employed Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) as the Explainable AI (XAI) techniques. These approaches ensure that static_prio, nivcsw, vm_truncate_count, shared_vm, and millisecond are the most significant features for validating and trusting ML-based cybersecurity solutions. The interaction-aware technique, Partial Dependence Plot (PDP), is utilized in LIME to demonstrate the impact of individual features on model predictions. We assessed LIME and SHAP, applying optimization techniques to minimize performance overhead. This research also incorporated opinions from cybersecurity experts and employed the Chi-Squared test to validate the explanations of XAI. These results reinforce the importance of ML in bolstering cybersecurity by enhancing cyber defense systems against malware and other threats. Ultimately, our research aims to strengthen computer network resilience and protect digital assets, ensuring the integrity and security of digital ecosystems amid evolving cyber threats.

Full text

Translate

Turn on search term navigation

Introduction

In the rapidly evolving landscape of Industry 4.0, technology integration into corporate and personal routines has seen tremendous growth, driven by advancements in the Internet of Things (IoT) and its applications (Bensaoud et al 2024). However, this digital transformation has also heightened security concerns, as cybercriminals increasingly target systems to steal personal data, disrupt services, and cause widespread harm. Such attackers utilize malicious software or malware to cause significant system dangers and vulnerabilities (Smmarwar et al 2024). Malware is referred to by several names depending on its purpose and behavior, including adware, spyware, virus, worm, Trojan, Rootkit, Backdoor, Ransomware, and command and control (C&C) bot (Aamir et al 2024).

In recent years, the frequency and sophistication of malware attacks have escalated. For instance, global malware attacks surged to over 5.5 billion in 2021 alone, a 5% increase from the previous year, highlighting the persistent and growing danger (Almazroi and Ayub 2024). Ransomware, one of the most notorious forms of malware, saw a staggering 62% increase in attacks in the same year, affecting organizations across multiple sectors (Fernando and Komninos 2024; Gulmez et al 2024). This growing threat underscores the importance of continuous research and development in malware detection and mitigation strategies. Despite advancements in detection technologies, malware creators are constantly evolving their techniques to evade these defenses, making it an ongoing challenge for cybersecurity professionals (Poornima and Mahalakshmi 2024). Strong defenses against emerging cyber threats are required due to the expansion of the threat landscape brought about by the introduction of mobile technologies and the widespread use of internet-connected devices.

Machine learning has become a pivotal tool in combating the ever-evolving malware threat. By leveraging large datasets and advanced algorithms, ML models can identify patterns and anomalies indicative of malicious activities, often before they fully manifest. This capability is particularly crucial in today’s cybersecurity landscape, where traditional signature-based detection methods struggle to keep pace with the rapid development of new malware variants (Deng et al 2024).

In this study, we investigated the effectiveness of different Machine Learning algorithms: Support Vector Machine, Decision Tree, K-Nearest Neighbors, and Random Forest for cyber threat identification, with a particular focus on the categorization of malicious and benign entities in digital ecosystems. We aim to strengthen cyber defense systems’ resistance to the constantly changing cyber threat scenario by utilizing machine learning algorithms. We strived to determine the best practices and algorithms for obtaining high detection accuracy while reducing false positives and false negatives through thorough testing and assessment. We employed LIME and SHAP as Explainable AI techniques to enhance the interpretability of our model predictions. These techniques provide localized, interpretable explanations by approximating the behavior of the complex ML models with simpler, interpretable models around individual predictions. This ensures that the features influencing each classification are transparent and understandable, enabling us to validate and trust the ML models’ decisions. Our study integrates interaction-aware techniques, such as Partial Dependence Plots, into the surrogate model to improve the understanding of feature interactions and their collective influence on model predictions within the LIME analysis. This research employed fivefold cross-validation and manual hyper-parameter tuning to mitigate overfitting in machine learning models. Additionally, early stopping and optimization of various parameters were utilized in LIME and SHAP to minimize performance overhead and enhance the model’s effectiveness. Integrating XAI into our analysis allowed us to achieve high detection accuracy and provide clear insights into the factors driving the predictions, thus reinforcing the robustness and reliability of our cyber threat detection system.

Literature review

Numerous authors used Machine Learning-based techniques for malware detection and classification. Yerima and Sezer (2018) developed DroidFusion, a Machine Learning-based method for Android malware detection. It uses a multi-layer architecture, with base classifiers at the lower level and ranking-based algorithms at the top level, to create a robust final classifier. This method demonstrates the potential of advanced Machine Learning to enhance malware detection in the digital landscape. Taha and Barukab (2022) proposed an optimal ensemble learning technique, RF-GA, designed for Android malware classification. This method uses a Random Forest algorithm optimized with Genetic Algorithms (GA) alongside base classifiers like Support Vector machines, Logistic Regression (LR), Gradient Boosting (GB), Decision Trees, and AdaBoost (ADA). Notably, the RF-GA approach outperforms other techniques, achieving an accuracy rate of 94.15%, a precision of 94.15%, and an Area Under the Curve (AUC) score of 98.10. These findings suggest that RF-GA is a highly effective solution for classifying Android malware, offering improved accuracy and reliability compared to other methods.

Shabtai et al. (2014) developed a system for detecting malicious behavior by analyzing network traffic. The system logs user-specific network traffic patterns for each app under scrutiny and then identifies deviations from typical patterns, which may indicate malicious activity. To assess the system’s performance, they utilized the C4.5 algorithm, achieving an accuracy rate of up to 94%. Jang et al. (2015) created Andro-AutoPsy, a system that identifies malware based on similarity matching using five features: digital certificate serial number, API call sequence, permissions, intents, and system commands. This approach claims to detect zero-day malware. Coronado-De-Alba et al. (2016) presented a mobile malware detection approach using a meta-ensemble algorithm with an accuracy of 97.5%. Milosevic et al. (2017) focused on detecting malicious patterns in source code and permission sets, achieving a 95.6% accuracy with various classifiers like Random Forest and C4.5. Damshenas et al. (2015) used ensemble learning to boost Android malware detection, obtaining an accuracy of up to 99.8% with Naive Bayes, Decision Tree, and Random Forest.

Alam et al. (2017) developed DroidNative, a scheme for detecting Android malware that analyzes both bytecode and native code, reaching a detection rate of 93.57% and an AUC of up to 99.56%. Kouliaridis et al. (2018) proposed an ensemble malware detection method, attaining an accuracy of 97.8% and an AUC of 97.7%. Potha et al. (2021) examined the impact of ensemble models, demonstrating that larger and homogeneous ensembles can enhance performance, with accuracy rates as high as 99.1%. Alzaylaee et al. (2020) introduced DL-Droid, a deep-learning system for detecting Android malware, achieving a detection rate of up to 97.8% with dynamic analysis alone and 99.6% with additional static analysis. Taheri et al. ( 2020) developed malware detection methods based on Hamming distance, achieving 90% to 99% accuracy. Millar et al. ( 2020) presented DANdroid, a deep learning model that uses a mix of op-codes, permissions, and API calls, with an F-score of up to 97.3%. Cai et al. ( 2021) proposed JOWMDroid, an Android malware detection scheme based on feature weighting, scoring an accuracy of 98.1%.

Basheer et al. ( 2024) utilized Explainable AI, specifically the SHAP framework, to enhance model transparency and interpretability in malware detection. The study achieved high detection accuracy by training machine learning models like Random Forest, AdaBoost, SVM, and ANN and applying XAI techniques while clearly explaining feature contributions and justifying the model’s decisions. Almazroi and Ayub (2024) introduced a BERT-based Feed Forward Neural Network Framework (BEFNet) tailored for IoT scenarios. The framework, optimized using the Spotted Hyena Optimizer (SO), was evaluated across eight diverse malware datasets. BEFNet demonstrated exceptional adaptability and performance, achieving 97.99% accuracy, a 97.96 Matthews Correlation Coefficient, a 97% F1-Score, a 98.37% Area under the ROC Curve (AUC-ROC), and a 95.89 Cohen’s Kappa, highlighting its effectiveness in malware detection. Bostani and Moonsamy (2024) introduced EvadeDroid, an advanced adversarial attack that effectively bypasses black-box Android malware detectors in real-world scenarios. This query-efficient optimization algorithm demonstrated high efficacy, achieving 80–95% evasion rates against various malware detectors, including DREBIN and MaMaDroid, with only 1–9 queries.

Additionally, EvadeDroid maintained an average evasion rate of 79% against five popular commercial antiviruses, showcasing its real-world applicability and stealth. Thakur et al. (2024) introduced a hybrid approach combining Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) to enhance malware analysis. This method converts malware binaries into grayscale images and analyzes them using CNN-LSTM networks. Dynamic features are extracted and reduced via Principal Component Analysis (PCA), with a voting scheme used for final classification. The approach effectively captures temporal dependencies with LSTM and performs parallel feature extraction with CNN. Nasser et al. (2024) introduced DL-AMDet, a deep-learning architecture for detecting Android malware using static and dynamic analysis. It combines CNN-BiLSTM for static detection and deep Autoencoders for dynamic anomaly detection, achieving a high accuracy of 99.935%, outperforming current state-of-the-art methods.

Ksibi et al. (2024) proposed a novel malware classification approach using convolutional neural networks, leveraging pre-trained models like DenseNet169, Xception, InceptionV3, ResNet50, and VGG16. The approach benefits from an end-to-end learning process by converting Android APK files into binary codes and RGB images, eliminating the need for manual feature engineering. Tested on the CICInvesAndMal2019 dataset, the models achieved classification accuracies of 95.24% with DenseNet169 and InceptionV3 and 95.83% with VGG16, demonstrating superior performance and suitability for Android IoT devices due to lower resource consumption. Nobakht et al. (2024) proposed SIM-FED, a privacy-focused IoT malware detection model combining deep and federated learning. Using a lightweight 1D CNN and the FedAvg aggregation strategy, SIM-FED is efficient and robust against cyber-attacks. Tested on the IoT-23 dataset, it achieves 99.52% accuracy, outperforming other models. Seyfari and Meimandi (2024) enhance Android malware detection using simulated annealing for feature selection and fuzzy logic for neighbor generation. Tested on the DREBIN dataset with 410 samples, their method, combined with KNN and permission features, achieved 99.02% accuracy, outperforming many recent studies.

Though many authors have employed Machine Learning for malware detection, these approaches often have limitations. Many existing studies (Taha and Barukab 2022; Shabtai et al 2014; Jang et al 2015; Coronado-De-Alba et al 2016; Damshenas et al 2015; Alam et al 2017; Kouliaridis et al 2018; Potha et al 2021; Alzaylaee et al 2020; Taheri et al 2020) on malware detection rely on limited datasets with few attributes, leading to generalization issues. Many studies (Almazroi and Ayub 2024; Shabtai et al 2014; Jang et al 2015; Coronado-De-Alba et al 2016; Milosevic et al 2017; Damshenas et al 2015; Alam et al 2017; Kouliaridis et al 2018; Potha et al 2021; Alzaylaee et al 2020; Taheri et al 2020; Millar et al 2020; Cai et al 2021; Bostani and Moonsamy 2024; Thakur et al 2024; Nasser et al 2024) do not employ Explainable AI techniques, which can impede comprehension and trust in the model’s decision-making process. Although some research (Basheer et al 2024) incorporates XAI methods, it often lacks the use of interaction-aware techniques. Such techniques are crucial for enhancing the understanding of feature interactions and their combined effect on model predictions. The absence of these advanced methods can limit the depth of insight provided into how features collectively influence model outcomes.

Additionally, some papers (Shabtai et al 2014) focus on a single Machine Learning algorithm, which might not capture the diverse behaviors of malware. Overfitting-reducing techniques are often ignored (Almazroi and Ayub 2024; Shabtai et al 2014; Jang et al 2015; Coronado-De-Alba et al 2016; Milosevic et al 2017; Damshenas et al 2015; Alam et al 2017; Kouliaridis et al 2018; Potha et al 2021; Alzaylaee et al 2020; Taheri et al 2020; Millar et al 2020; Cai et al 2021; Basheer et al 2024; Bostani and Moonsamy 2024; Thakur et al 2024; Nasser et al 2024; Ksibi et al 2024), leading to high variance or bias in the models. Furthermore, many studies (Kouliaridis et al 2018; Potha et al 2021; Alzaylaee et al 2020; Taheri et al 2020) report high accuracy but overlook other critical metrics such as precision, recall, and F1-score, which are crucial for a comprehensive evaluation of model performance. This study addresses these gaps by using a robust dataset from Kaggle containing 35 features, incorporating four machine learning algorithms—Support Vector Machine, Decision Tree, K-Nearest Neighbors, and Random Forest—employing fivefold cross-validation and manual hyper-parameter tuning to mitigate bias, and focusing on a comprehensive set of performance metrics (correlation matrix, confusion matrix, ROC curve, precision, recall, specificity, False Positive Rate (FPR), False Negative Rate (FNR), Negative Predictive Value (NPV), Matthews Correlation Coefficient (MCC), F1 score, Balanced Accuracy, accuracy, Precision-Recall curve, and F1 Score vs. Threshold plot) to ensure a well-rounded assessment of malware detection capabilities. Our study addresses the challenge of model interpretability by utilizing LIME and SHAP to offer clear explanations for model predictions. In LIME, Partial Dependence Plots are employed to illustrate the effects of individual features on forecasts. We evaluated the performance of both LIME and SHAP, implementing various optimization techniques to reduce performance overhead. We also tested our best machine learning models and explanations using real-life datasets from the cybersecurity sector to validate their effectiveness. Through these measures, we aim to enhance our models’ reliability and interpretability.

Methodology

This paper aims to develop a Machine Learning-Based Cyber Threat Detection system for malware identification. The proposed methodology involves several critical steps, including data collection and pre-processing, Machine Learning model selection, data splitting, model training, hyperparameter tuning, model evaluation, comparative analysis, implementing Explainable AI techniques, and exploring real-life testing of the ML models. The workflow diagram illustrating the proposed system is shown in Fig. 1. This diagram clearly represents the entire process, from the initial data collection to the final model evaluation, highlighting the logical flow and critical components involved in building a robust malware detection system.

Fig. 1 [Images not available. See PDF.]

Working flow diagram of the proposed machine learning-based malware detection system

Datasets

The dataset utilized in this study was obtained from the Kaggle library (Kaggle 2024). It consists of 100,000 records, evenly distributed between Malware and Benign data, with each category representing 50% of the total dataset. Figure 2 illustrates samples from the dataset, showcasing the diversity and distribution of the features. The dataset includes 35 distinct features, offering a comprehensive source of information for our analysis. These features encompass a variety of data types, as shown in Fig. 3, including categorical features like hash and classification and numerical features such as millisecond, state, usage_counter, prio, and static_prio.

Fig. 2 [Images not available. See PDF.]

Dataset samples

Fig. 3 [Images not available. See PDF.]

Data types of features

The features capture various aspects of system behavior and resource usage, which is critical for distinguishing between benign and malware activities. For instance, features like vm_truncate_count, task_size, total_vm, and shared_vm provide insights into memory management and virtual memory usage, while features like utime, stime, and gtime reflect different CPU usage patterns. The categorical feature classification is the target variable, indicating whether a given instance is benign or malware.

The balanced nature of the dataset, with 50,000 benign and 50,000 malware instances, as depicted in Fig. 4, ensures an unbiased evaluation of the models and methods applied in this study. A comprehensive breakdown of the dataset, including details on these features, is presented in Table 1.

Fig. 4 [Images not available. See PDF.]

Distribution of target class

Table 1. Dataset description

Features	Description	Range
hash	APK/SHA256 file name	Different
millisecond	Time	0–999 ms
classification	Malware/benign
state	Flag of unrunable/runnable/stopped tasks	0-Unrunnable task 1-Runnable task
usage_counter	Task structure usage counter	0-Unstructured 1-Structured
prio	Keeps the dynamic priority of a process	Different
static_prio	Static priority of a process	13988–14274
normal_prio	Priority without taking RT-inheritance into account	Different
policy	Planning policy of the process	0
vm_pgoff	The offset of the area in the file, in pages	0
vm_truncate_count	Used to mark a vma as now dealt with	10406–16335
task_size	Size of the current task	0
cached_hole_size	Size of free address space hole	0
free_area_cache	First address space hole	0–28
mm_users	Address space users	616–724
map_count	Number of memory areas	6850–28336
hiwater_rss	The peak of resident set size	0
total_vm	Total number of pages	27–383
shared_vm	Number of shared pages	112–120
exec_vm	Number of executable pages	96–124
reserved_vm	Number of reserved pages	90–442
nr_ptes	Number of page table entries	0
end_data	End address of code component	112–120
last_interval	Last interval time before thrashing	0–7509
nvcsw	Number of volunteer context switches	337688–355440
nivcsw	Number of in-volunteer context switches	0–172
min_flt	Minor page faults	0–130
maj_flt	Major page faults	114–120
fs_excl_counter	It holds file system-exclusive resources	0–2
lock	The read–write synchronization lock used for file system access	Different
utime	User time	376718–401058 ms
stime	System time	3–5 s
gtime	Guest time	0–1 s
cgtime	Cumulative group time	0
signal_nvcsw	Used as cumulative resource counter	0

Data pre-processing

The initial stage in our process is to classify the dataset into two categories: malware and benign. Each data point in the dataset is labeled according to the type of software entity it represents—either malicious or benign. After categorization, the dataset undergoes several pre-processing steps to ensure data quality and consistency. This includes normalization to standardize the feature values, making them comparable across the dataset, and duplicate value checking to remove any redundant or repeated records. These pre-processing steps are crucial for achieving accurate model training and evaluation results.

Machine learning algorithm

We experimented with four machine learning techniques for cyber threat detection: Support Vector Machine, Decision Tree, K-nearest neighbors, and Random Forest. These algorithms were chosen based on their ability to handle binary classification problems and their aptitude for spotting patterns in complicated datasets, particularly in malware detection and cyber threat analysis. Their effectiveness in handling complex, high-dimensional data has been demonstrated in numerous studies (Almazroi and Ayub 2024; Shabtai et al 2014; Jang et al 2015; Coronado-De-Alba et al 2016; Milosevic et al 2017; Damshenas et al 2015; Taheri et al 2020; Millar et al 2020; Cai et al 2021; Basheer et al 2024), making them reliable choices for our experiment.

Support vector machine

Support Vector Machine is a supervised Machine Learning algorithm for classification and regression tasks. It works by finding the hyperplane that best separates data into distinct classes. In a classification problem, SVM aims to create the optimal boundary or hyperplane that maximally separates different classes. It does this by selecting a small subset of training points, known as support vectors, that define the hyperplane’s position and orientation. SVM is effective in high-dimensional spaces and is known for its robustness against overfitting, especially with techniques like kernel tricks that enable non-linear classification. Standard kernels include linear, polynomial, and radial basis function (RBF). These allow SVM to handle complex patterns in the data, making it a versatile tool for various applications, including image recognition, text classification, and malware detection (Widodo and Yang 2007).

Decision tree

A Decision Tree is a supervised machine-learning algorithm for classification and regression tasks. It divides the data into subsets based on specific features, following a tree-like structure of decision nodes and branches. Each node in a decision tree represents a test or decision based on a particular attribute, while the branches represent the possible outcomes of that decision. The leaves at the end of the tree represent the final class or value prediction. In a Decision Tree, building the tree involves selecting the best attribute to split the data at each node. This selection is often based on metrics like Gini impurity or information gain (derived from entropy), which measure how well the chosen attribute separates the data into distinct classes. The goal is to create a structure that effectively partitions the data, leading to the most accurate predictions. Decision Trees are commonly used in various applications, from credit scoring to medical diagnosis, and are the foundation for more complex ensemble methods like Random Forests and Gradient Boosting (Niu et al 2024).

K-nearest neighbors

The core concept behind KNN is that similar data points tend to be located near each other. It operates by finding the “k” nearest data points to a given test instance and then using these nearest neighbors to determine the class or value of the test instance. In a KNN classification task, a new data point class is determined by a majority vote among its “k” closest neighbors. In other words, the test instance is assigned the most common class among its neighbors. The value is typically calculated for regression tasks as the average of the “k” nearest neighbors. The algorithm relies on a distance metric to measure the closeness of data points. Typical distance metrics include Euclidean distance, Manhattan distance, and cosine similarity. The choice of “k” (the number of neighbors to consider) and the distance metric can significantly impact the algorithm’s performance. KNN is valued for its simplicity and ease of implementation, but it can be computationally intensive, especially for large datasets, because it requires calculating distances to all training data points. Additionally, KNN is sensitive to the scale of features, so feature normalization or standardization is typically necessary to ensure accurate results. KNN is used in various applications, including pattern recognition, data mining, and recommender systems (Ma et al 2024). Despite its simplicity, KNN can be highly effective, especially when correctly tuned and applied to well-understood datasets.

Random forest

Random Forest is a powerful ensemble learning method for classification and regression tasks. It is built on combining multiple decision trees to create a more robust and accurate predictive model. The “forest” in Random Forest consists of many decision trees, each trained on a random subset of the training data, a technique known as “bootstrapping.” Additionally, each tree considers only a random subset of features when making splits, which reduces the correlation among trees and helps prevent overfitting. In a Random Forest classification task, the final prediction is determined through a majority vote among the individual decision trees. For regression, the prediction is the average of the outputs from the individual trees. This ensemble approach enhances the model’s generalization ability and stability, reducing the likelihood of overfitting that might occur with a single decision tree. Random Forest is widely used in various domains, including bioinformatics, finance, and cybersecurity, where its ability to handle high-dimensional data and provide reliable predictions is highly valued (Sun et al 2024; Vinayakumar et al 2019).

Model training and testing

The labeled and pre-processed dataset is split into 80% for training and 20% for testing using a fivefold cross-validation approach. Each fold maintains the class distribution of the original dataset. This method reduces bias and variance by repeatedly dividing the dataset into five subsets. In each iteration, one subset serves as the validation set, while the remaining four are used for training. This process rotates through all five subsets, allowing every subset to act as a validation set at least once. The training sets are used to build and refine the machine learning models, while the validation sets are used to tune hyperparameters and assess intermediate performance. By doing so, the cross-validation technique helps optimize the models while avoiding overfitting. Once this iterative process is complete, the final model is evaluated on an independent test set comprising 20% of the original data to determine its effectiveness in cyber threat detection tasks. This final evaluation provides an unbiased measure of the model’s performance in a real-world scenario, offering insight into its generalization capabilities.

In this study, we have also implemented hyperparameter tuning to mitigate overfitting. Each algorithm—SVM, Decision Tree, KNN, and Random Forest—hyper-parameters were manually tuned using a grid search approach. This involved systematically evaluating the performance of different combinations of hyper-parameters to identify the optimal settings that minimize overfitting while maximizing model performance. Table 2 presents the hyper-parameter values for each algorithm used in this study. These careful adjustments ensure that our models do not merely memorize the training data but instead capture the underlying patterns, leading to more reliable predictions in real-world applications.

Table 2. Hyper-parameters and ranges

Algorithm	Hyper-parameter	Range/value
SVM	C	0.1–1000
	Kernel	Linear, polynomial, RBF
	Gamma	1e-4–1e-1
DT	Max depth	1–32
	Min samples split	2–10
	Min samples leaf	1–5
	Criterion	[“gini,” “entropy”]
KNN	K	1–20
KNN	Distance metric	Euclidean, Manhattan
RF	Number of trees	100–1000
	Max features	sqrt(n_features), log2(n_features)
	Min samples split/leaf	2–10/1–5
	Criterion	[“gini,” “entropy”]

Performance metrics

This study used a variety of performance metrics to assess the effectiveness of the machine learning models, including the correlation matrix, confusion matrix, ROC curve, precision, recall, specificity, False Positive Rate, False Negative Rate, Negative Predictive Value, Matthews Correlation Coefficient, F1 score, Balanced Accuracy, accuracy, Precision-Recall curve, and F1 Score vs. Threshold plot.

Correlation matrix

A correlation matrix is a table that displays the correlation coefficients between pairs of variables in a dataset. The correlation coefficient is a measure of the linear relationship between two variables, typically ranging from − 1 to 1.

Confusion matrix

This table visually represents a classification algorithm’s performance by comparing predicted classes against actual classes. It provides detailed insights into the model’s performance, including counts of True Positives (TPs), False Positives (FPs), True Negatives (TNs), and False Negatives (FNs) for each class.

Receiver operating characteristic curve

The ROC curve shows the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across various threshold levels. It helps evaluate a classifier’s performance by highlighting its behavior under different classification thresholds.

Precision

Precision indicates the proportion of accurate optimistic predictions among all optimistic predictions made by the classifier. It measures the classifier’s accuracy in identifying positive instances without misclassifying negative ones. The formula for precision is:

Precision = \frac{TP}{TP + FP}

Recall

Also known as sensitivity, recall measures the proportion of actual positive instances the classifier correctly identifies. It assesses the model’s ability to detect all positive instances, calculated using the formula:

Recall = \frac{TP}{TP + FN}

Specificity

Measures the proportion of actual negatives that are correctly identified. It can be measured using the following formula:

Specificity = \frac{TN}{TN + FP}

False positive rate

FPR measures the proportion of actual negatives that are incorrectly identified as positives using the formula:

F P R = \frac{FP}{FP + TN}

False negative rate

FNR measures the proportion of actual positives that are incorrectly identified as negatives. It can be calculated using Eq. (5)

F N R = \frac{FN}{FN + TP}

Negative predictive value

NPV measures the proportion of negative identifications that are correct using the following formula:

N P V = \frac{TN}{TN + FN}

Matthews correlation coefficient

A balanced measure that can be used even if the classes are of very different sizes.

M C C = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

F1 score

The F1 score is the harmonic mean of precision and recall, offering a balanced measure that combines both metrics into a single value. The F1 score provides a more comprehensive view of a model’s performance, as it considers the classifier’s accuracy in predicting positive instances (precision) and its ability to detect all actual positive instances (recall). The formula to calculate the F1 score is:

F 1 - Score = \frac{2 \times Precision \times Recall}{Precision + Recall}

Balanced accuracy

It is an accuracy metric that balances the class imbalance.

Balanced accuracy = \frac{TPR + TNR}{2 S}

Accuracy

Accuracy measures the proportion of correctly classified instances (both true positives and true negatives) out of the total instances. It can be calculated using Eq. (10):

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Precision-recall curve

This plots precision (positive predictive value) against recall (sensitivity). It is particularly useful for imbalanced datasets, where the positive class is rare. It helps evaluate a model’s performance in terms of its ability to identify the positive class.

F1 score vs. threshold plot

This plot shows how the F1 score changes as one adjusts the decision threshold. It combines precision and recall into a single metric, balancing them.

Comparative analysis

This study used a set of evaluation metrics collected during model assessment to compare the performance of four machine learning models—Support Vector Machine, Decision Tree, K-Nearest Neighbors, and Random Forest. By analyzing these metrics, we aimed to identify which algorithm performs best for cyber threat identification, gaining insights into the strengths and limitations of each technique. This comparison allows for a deeper understanding of the efficacy of different algorithms in terms of accuracy, precision, recall, F1 score, and other key performance indicators. It also highlights how each model handles the challenges associated with cyber threat detection, such as imbalanced datasets, overfitting, and scalability, providing guidance on which techniques are best suited for different scenarios in cybersecurity.

Explainable AI in malware detection

Explainable AI is a critical field that seeks to make the decision-making processes of machine learning models transparent and interpretable, especially in high-stakes domains like cybersecurity. In the context of malware detection, where machine learning models are often used to identify potentially harmful software, understanding why a model makes a particular prediction is as important as the prediction itself. This transparency builds trust in the model’s decisions but also aids in refining and improving the model by providing insights into its behavior. In this study, two prominent XAI techniques have been employed: LIME and SHAP.

Local interpretable model-agnostic explanations

LIME is a popular technique for interpreting the predictions of any machine learning model, regardless of its complexity or underlying architecture. The core idea behind LIME is to approximate the behavior of a complex, “black-box” model with a simpler, more interpretable model, but only in the vicinity of a specific prediction. This localized approach allows a more intuitive understanding of why the model made a particular decision. The LIME algorithm operates through the following steps:

Generate perturbations: LIME begins by creating slight variations (or perturbations) of the input data around the point of interest. These perturbations help to explore the model’s behavior in the neighborhood of the original input.
Obtain predictions: The perturbed samples are then passed through the black-box model to obtain predictions. This provides a dataset of inputs and corresponding outputs reflecting the model’s local behavior.
Fit a simple model: LIME then fits a simple, interpretable model (such as a linear model) to this dataset of perturbed samples and their predictions. This simple model is intended to approximate the decision boundary of the complex model within the local region around the original input.
Explain the prediction: The simple model explains the original prediction, highlighting which features most influenced the black-box model’s decision-making process.

Mathematically, for a given input x and a complex model f, LIME aims to find a simpler model g using Eq. (11):

g (z^{'}) \approx f (z) for all z^{'} close to g (z^{'})

z′ represents the perturbed samples in a local neighborhood around x.

SHapley additive explanations

SHAP is a widely used method for interpreting the predictions of machine learning models. It is grounded in cooperative game theory and aims to fairly allocate the contribution of each feature to the model’s prediction. The core concept behind SHAP is to treat each feature as a “player” in a game where the prediction is the “payout.” The SHAP value for a feature represents the average contribution of that feature across all possible combinations of features, ensuring a fair distribution of the prediction’s impact among the features. The SHAP algorithm follows these steps:

Compute the contribution of each feature: For a given input, SHAP calculates the contribution of each feature by considering the prediction difference when that feature is included or excluded from the model.
Aggregate contributions: SHAP averages the contributions over all possible subsets of features, ensuring that each feature’s importance is fairly assessed in different contexts.
Assign SHAP values: The final SHAP value assigned to each feature reflects its average contribution to the prediction, providing a clear and theoretically grounded measure of feature importance.

Mathematically, the Shapley value for a feature i is given by Eq. (12):

\emptyset_{i} = \frac{1}{N} \sum_{S \subset N (i)} [v (S \cup \{i\}) - v (S)]

where:

N is the set of all features.
S is a subset of features that does not include i.
v(S) is the value function (e.g., the model’s prediction) for the feature subset S.

The SHAP values ensure that the contributions of each feature add up to the difference between the prediction for the instance and the average prediction over all the cases using Eq. (13).

f (x) = \emptyset_{0} + \sum_{i = 1}^{M} \emptyset_{i}

where,

\emptyset_{0}

is the average prediction over the dataset and

\emptyset_{i}

are the Shapley values for each feature.

Advanced interaction-aware techniques

Explainable AI, while effective in providing explanations for model predictions, may not fully capture complex interactions between features due to its reliance on simpler surrogate models. To address this limitation, our study has incorporated interaction-aware techniques, such as Partial Dependence Plots, within the surrogate model to enhance the understanding of feature interplay and their combined impact on model predictions. Partial Dependence Plots are valuable tools that illustrate the relationship between a specific feature (or pair of features) and the predicted outcome while averaging out the effects of other features. PDPs offer insights into the influence of individual features on the model’s predictions, which is particularly important for ensuring the interpretability and transparency of complex models.

Performance evaluation and optimization of explainable AI techniques

This study evaluated the performance of LIME and SHAP in malware detection, focusing on calculating and minimizing latency. We employed early stopping and optimized vital parameters such as the Number of Perturbations, Kernel Width, and Number of Background Samples to achieve this. These optimizations aimed to reduce the latency of the explanations generated by these techniques, thereby enhancing their practical applicability in real-time malware detection systems.

Chi-square test in validating explanations of XAI

The Chi-Square test is a vital statistical tool for validating explanations in Explainable AI. Comparing observed and expected frequencies helps assess the significance of features highlighted by XAI methods like LIME and SHAP. This test ensures that the features deemed necessary by these models are genuinely impactful in predicting outcomes, thus reinforcing the reliability of the explanations. High Chi-Square values and low P-values indicate strong associations between features and model predictions, enhancing the transparency of the model’s decision-making process. Furthermore, the Chi-Square test enables the evaluation of consistency across different XAI techniques and provides a quantitative basis for interpreting the importance of features. This integration of statistical validation with XAI enhances the credibility and robustness of the model explanations, offering a clearer understanding of how features influence predictions.

Results and discussions

This section presents detailed results for all the Machine Learning algorithms employed in the proposed malware detection system, focusing on Explainable AI techniques. To ensure robust and efficient computations, the experiments used the premium version of Google Colab with GPU support, utilizing the Python programming language.

Correlation matrix for malware detection

Figure 5 displays the correlation matrix for the features in the dataset used for malware detection. The correlation matrix is a visual representation that shows the pairwise correlation coefficients between the different features. These coefficients range from − 1 to 1:

One indicates a perfect positive correlation between two features.
0 indicates no correlation.
-1 indicates a perfect negative correlation.

Fig. 5 [Images not available. See PDF.]

Correlation matrix for the features

Interpretation of correlation matrix

The diagonal elements represent the correlation of each feature with itself, which is always 1. The color bar on the right indicates the strength of the correlation. Yellow represents a strong positive correlation (close to 1). Green and Blue shades represent weaker correlations or negative correlations. Dark Blue represents a strong negative correlation (close to − 1).

Features like millisecond, state, total_vm, and stime may have different levels of correlations with other features. A high positive correlation between these features suggests that they move together, which could imply redundancy. For example, if total_vm and exec_vm show a high correlation, they might capture similar information. Conversely, features with low or negative correlations, such as mm_users and min_flt, are likely independent and provide unique information to the model.

Performance of ML techniques

This study employed a systematic approach to hyper-parameter tuning to reduce the risk of overfitting and enhance the performance of each algorithm—SVM, Decision Tree, KNN, and Random Forest. A range of hyper-parameter values was initially explored for each algorithm to determine the most effective configuration. The process involved evaluating the performance of different combinations of these hyper-parameters on the training dataset through cross-validation. After extensive testing, the hyper-parameters that resulted in the best balance between bias and variance were selected as the final values. Table 3 presents the range of hyper-parameters considered for each algorithm and the final values chosen for this study.

Table 3. Optimized hyper-parameters considered for each algorithm

Algorithm	Hyper-parameter	Range explored	Optimized value
SVM	C	0.1–1000	10
	Kernel	Linear, polynomial, RBF	RBF
	Gamma	1e-4–1e-1	1e-4
DT	Max depth	1–32	20
	Min samples split	2–10	5
	Min samples leaf	1–5	2
	Criterion	[“gini,” “entropy”]	gini
KNN	K	1–20	5
KNN	Distance metric	Euclidean, Manhattan	Euclidean
RF	Number of trees	100–1000	400
	Max features	sqrt(n_features), log2(n_features)	sqrt(n_features)
	Min samples split/leaf	2–10/1–5	10
	Criterion	[“gini,” “entropy”]	gini

The proposed malware detection model’s performance evaluation is depicted through the confusion matrix analysis in Fig. 6. This confusion matrix provides a detailed breakdown of the model’s predictions, illustrating its accuracy in classifying malware and benign samples.

Fig. 6 [Images not available. See PDF.]

Confusion matrix of the ML models

The confusion matrix for the Support Vector Machine model, focusing on the benign class, shows that the model accurately classified 50.32% of the benign samples as true positives and had zero false positives, meaning no malware samples were wrongly identified as benign. However, the model had a high rate of false negatives, with 23.99% of the benign samples misclassified as malware. The true negative rate, where the model correctly identified the malware as not benign, was 25.69%. These results suggest that while the SVM model effectively avoids false positives, it struggles with misclassifications, leading to a high rate of false negatives.

The confusion matrix for the Decision Tree model, focusing on the benign class, shows that the model correctly identified 50.31% of the benign samples as true positives and had an extremely low false positive rate of 0.01%, indicating only a tiny proportion of malware samples were misclassified as benign. The model had no false negatives, meaning it did not misclassify any benign samples as malware, demonstrating high accuracy in detecting benign instances. The true negative rate, where malware samples are correctly identified as not benign, stood at 49.68%. These results suggest that while the Decision Tree model effectively identifies benign samples, there’s a small risk of false positives and a moderate ability to detect malware.

The confusion matrix for the K-Nearest Neighbors model reveals the following for the benign class: the model achieved a true positive rate of 50.06%, correctly identifying benign samples. However, the false positive rate was 0.26%, indicating that a small fraction of malware was mistakenly classified as benign. The false negative rate was also 0.31%, showing that a few benign samples were incorrectly labeled as malware. The true negative rate, which reflects the correct classification of malware as not benign, was 49.37%. These results suggest that while KNN has a solid ability to recognize benign samples, its false positive and false negative rates point to areas for improvement.

The confusion matrix for the Random Forest model reveals its performance on the benign class. The model achieved a True Positive rate of 50.32%, indicating that this percentage of benign samples was correctly identified as benign. It demonstrated zero False Positives, suggesting that no malware samples were incorrectly classified as benign. The False Negatives were also zero, meaning no benign samples were misclassified as malware. The True Negatives stood at 49.68%, indicating that this percentage of malware samples was accurately identified as non-benign. The Random Forest model has a high rate of accuracy when identifying benign samples (100% in this case), with no errors in false positives or false negatives. This suggests that the model effectively differentiates between benign and malware samples, leading to high reliability in benign detection.

The ROC curve, depicted in Fig. 7, illustrates the classification performance of the machine learning models used in this study. The AUC values for the evaluated models are as follows: Support Vector Machine achieved 0.7586, Decision Tree and Random Forest scored a perfect 1.0, and K-Nearest Neighbors scored 0.9943. An AUC of 1.0 represents ideal classification, while a value closer to 0.5 indicates random guessing. These results suggest that the Decision Tree, Random Forest, and KNN models demonstrate excellent classification capabilities, with Random Forest and Decision Tree achieving perfect scores. In contrast, SVM has a lower but reasonable AUC, indicating room for improvement.

Fig. 7 [Images not available. See PDF.]

ROC curve of the ML models

Figure 8 represents the Precision-Recall Curve of the ML models. The curve for SVM shows that the model maintains moderate precision while recall varies across thresholds. The curve indicates that as recall increases, precision slightly decreases. The SVM model tries to balance precision and recall, but there is a trade-off. It performs decently, but there are some challenges in maintaining high precision and high recall, reflected in the relatively moderate AUC. The precision-recall curve for the Decision Tree is close to ideal, with precision and recall being very high across various thresholds. The DT model performs exceptionally well, maintaining high precision and recall, indicating that it makes few mistakes in false positives and negatives. The KNN model’s curve is similar to the Decision Tree’s, with high precision and recall maintained across various thresholds. The KNN algorithm performs strongly with high precision and recall, making it a reliable model for this classification task. The curve suggests the model effectively distinguishes between classes with few false positives or false negatives. The Random Forest model displays an almost perfect precision-recall curve, with precision and recall near 100% across thresholds. The Random Forest algorithm excels at this classification task, showing high precision and recall. This suggests the model is highly accurate and reliable, with minimal false positives and negatives.

Fig. 8 [Images not available. See PDF.]

Precision-recall curve of the ML models

Figure 9 represents the F1 Score vs. Threshold Plots for various machine learning models. The F1 score for SVM peaks at a particular threshold, representing the optimal balance between precision and recall. As the threshold moves away from this point, the F1 score decreases. This behavior suggests that the SVM model has a specific threshold at which it performs best. Beyond this optimal threshold, the model either becomes more conservative, prioritizing precision, or more aggressive, prioritizing recall, which results in a decline in the F1 score. The F1 score for the Decision Tree remains consistently high across a wide range of thresholds, indicating robust performance. This model does not rely heavily on threshold adjustments to achieve optimal results. The F1 score remains near its maximum regardless of the threshold, highlighting the model’s stable and reliable classification performance. Like the Decision Tree, the KNN model’s F1 score remains consistently high across a broad range of thresholds. This stability indicates that the model’s classification quality does not significantly degrade as the threshold changes, making it a reliable option with robust performance across different thresholds. The Random Forest model achieves nearly perfect F1 scores across all thresholds, indicating optimal and stable performance. This consistency suggests that the Random Forest model is the most reliable among the four, as it consistently balances precision and recall to maintain peak performance across all thresholds.

Fig. 9 [Images not available. See PDF.]

F1 Score vs. threshold plots of the ML models

Comparative analysis

Table 4 comprehensively compares different machine learning classifiers used for malware detection. It assesses their effectiveness across various performance metrics, including precision, recall, specificity, FPR, FNR, NPV, MCC, F1 score, balanced accuracy, and overall accuracy.

Table 4. Comparative analysis of ML algorithms

Model	Precision (%)	Recall (%)	Specificity (%)	FPR (%)	FNR (%)	NPV (%)	MCC	F1 Score (%)	Balanced Accuracy (%)	Accuracy (%)
SVM	84	76	87	0	48.29	67.72	0.5917	74	75.86	75.85
Decision tree	100	99	99.98	0.02	0	100	0.9998	99	99.99	99.99
KNN	99	99	99.48	0.52	0.62	99.38	0.9886	99	99.43	99.42
Random forest	100	100	100	0	0	100	1	100	100	100

In malware detection, the SVM model exhibits a precision of 84% and a recall of 76%. This means that while the model is relatively good at identifying true malware instances, it misses some, as reflected in the F1 score of 74%. The specificity of 87% indicates the model’s effectiveness in correctly identifying benign software. However, the high false negative rate of 48.29% suggests it may struggle to detect certain malware types, potentially allowing them to pass as benign. The overall accuracy of 75.85% and MCC of 0.5917 highlight the SVM’s moderate performance in distinguishing between malware and benign files, making it somewhat reliable but not optimal.

The Decision Tree model shows near-perfect performance, with a precision of 100% and a recall of 99%, ensuring almost all detected malware instances are accurate. The F1 score of 99% further supports its effectiveness. With a specificity of 99.98% and a minimal FPR of 0.02%, the model is highly reliable in identifying benign software and minimizing false alarms. The balanced accuracy of 99.99% and the MCC of 0.9998 reflect the model’s robust capability to accurately classify malware and benign files, making it highly effective for malware detection.

The KNN model also performs exceptionally well in malware detection, with a precision and recall of 99%, indicating its strong ability to identify malware correctly. The specificity of 99.48% and a low FPR of 0.52% suggest that KNN is highly accurate in distinguishing benign files from malware, though slightly less so than the Decision Tree. The model’s F1 score of 99%, balanced accuracy of 99.43%, and MCC of 0.9886 indicate robust and consistent performance, making it a reliable choice for malware detection.

The Random Forest model achieves perfect scores across all metrics, with 100% precision, recall, specificity, and accuracy. This suggests that the model is flawless in detecting malware and benign files, with no false positives or negatives. The MCC of 1 highlights the model’s perfect balance between sensitivity and specificity, ensuring the highest level of reliability in malware detection.

In summary, while all models show varying degrees of effectiveness in malware detection, the Decision Tree and Random Forest classifiers stand out for their near-perfect or perfect performance. While moderately effective, SVM may require further tuning or additional features to improve its detection capabilities, especially in reducing the false negative rate. KNN also demonstrates strong performance, making it a viable option for malware detection tasks.

Explainable AI in malware detection

The Explainable AI techniques SHAP and LIME explanation provide deep insights into how various features contribute to classifying a process as malware. We have implemented LIME and SHAP in the highest accuracy ML model, Random Forest. These visualizations help explain the decision-making process of the Random Forest model by highlighting the importance and influence of specific features on the prediction.

LIME explanation for malware detection

This LIME plot in Fig. 10 provides a local interpretation of how a machine-learning model classifies a particular instance as malware with a probability of 1.00. The plot highlights the most influential features in the decision-making process, showing the feature values and their contributions to the prediction.

Fig. 10 [Images not available. See PDF.]

The explanation of the random forest model using the LIME algorithm

General structure

Prediction probabilities: The plot shows the model’s predicted probabilities, for instance, classified as benign (0.00) and malware (1.00). The orange bar indicates the likelihood of being classified as malware, which is 1.00 (100%).
Bar chart (right side): The bar chart on the right visualizes the impact of each feature on the model’s prediction.
Orange bars: Features that push the prediction towards malware.
Blue bars: Features that push the prediction towards benign.
Feature list (bottom left): The feature list displays the names and values of the most important features contributing to the prediction.

Explanation of each feature with domain insights

static_prio < = 14352.00 (Orange, Malware)

Explanation

The static_prio feature refers to a process’s static priority. Malware often manipulates process priorities to gain higher access to CPU resources, enabling it to execute malicious activities more efficiently without being preempted by other processes.

Domain insight

In cybersecurity, elevated or altered process priorities are a common tactic malware uses to ensure that it can perform its operations without interruption. This is particularly relevant when malware seeks to hijack system resources for activities such as cryptojacking or DDoS attacks.

Importance

High. The orange bar shows a significant positive contribution toward the malware classification. A lower static_prio value (as seen here) could suggest that the process has been prioritized in a way that is consistent with malware behavior.

maj_flt < = 114.00 (Blue, Benign)

Explanation

Maj_flt refers to the number of major page faults that occur when a process accesses non-resident memory pages. A lower number of significant faults is typically associated with benign processes that efficiently manage memory.

Domain insight

Benign processes are usually well-optimized, and memory usage is managed effectively, resulting in fewer page faults. Malware, on the other hand, may not be optimized for efficiency, leading to higher page faults.

Importance

Moderate. The blue bar indicates that the low number of maj_flt pushes the prediction slightly toward benign. However, this influence is outweighed by other features making the classification towards malware.

shared_vm < = 114.00 (Blue, Benign)

Explanation

The shared_vm feature measures the number of shared memory pages. Malware may often use shared memory to communicate between processes or inject code into legitimate processes.

Domain insight

Low values in shared memory usage could indicate typical benign behavior. Malware often exploits shared memory to hide its activities or to escalate privileges, but in this instance, the low shared_vm suggests less suspicious behavior.

Importance

Moderate. In this case, the blue bar shows that the lower value of shared_vm supports a benign classification, suggesting typical memory usage. However, stronger malware indicators override this effect.

end_data < = 114.00 (Blue, Benign)

Explanation

End_data represents the code segment’s end address. In benign processes, this value typically falls within a specific range.

Domain insight

Consistent memory allocation patterns are often a hallmark of benign processes. Malware may manipulate or obfuscate memory allocations to avoid detection, but the regularity here indicates benign behavior.

Importance

Moderate. The blue bar suggests that the process’s memory allocation (as indicated by end_data) aligns with benign behavior, but it is not enough to counter the features suggesting malware.

nivcsw > 45.00 (Orange, Malware)

Explanation

Nivcsw refers to the number of involuntary context switches. A higher number of context switches might indicate suspicious process behavior, such as frequent CPU relinquishing, which can signify process manipulation by malware.

Domain insight

Frequent context switches can occur when malware runs covert processes or interacts with other methods to avoid detection. This behavior indicates sophisticated malware that tries to blend in with regular system activity.

Importance

Low to moderate. The orange bar indicates a contribution towards malware classification, but its influence is less pronounced than other features.

free_area_cache > 4.00 (Orange, Malware)

Explanation

Free_area_cache represents the size of the first available memory hole. A larger cache size might indicate suspicious memory allocation practices, often associated with malware.

Domain insight

Irregularities in memory allocation, such as unusually large or small holes, can be a sign of memory corruption or manipulation by malware. This feature is essential in detecting advanced threats that exploit memory vulnerabilities.

Importance

Low. The orange bar shows a minor contribution to the malware classification. While not a significant factor, it supports the overall prediction of malware.

vm_truncate_count < = 12648.00 (Orange, Malware)

Explanation

This feature is associated with managing virtual memory areas. A low count might indicate that the process has less typical memory management, which malware could exploit.

Domain insight

Malware might manipulate virtual memory management to avoid detection or to execute payloads. Anomalies in this area, such as an unusually low vm_truncate_count, could be a red flag for malicious activity.

Importance

Low. The contribution towards malware is relatively minor, indicating that while the process’s memory management is somewhat suspicious, it is not the most influential factor.

utime < = 378137.00 (Blue, Benign)

Explanation

Utime represents the user time for a process. A lower utime might suggest that the process has not utilized much CPU time, which is more common in benign processes.

Domain insight

Processes with low CPU usage are often benign, performing minimal background tasks or being idle. Malware might also appear idle to avoid detection, but combined with other features, a low utime suggests benign behavior.

Importance

Low. The blue bar slightly pushes toward benign, indicating that the process’s CPU usage aligns with typical benign behavior. However, this influence is minimal in the final prediction.

nivcsw < = 341974.00 (Orange, Malware)

Explanation

The second occurrence of nivcsw here with a higher threshold suggests that excessive context switching could indicate suspicious process activity.

Domain insight

High-context switching might indicate a process trying to evade detection or perform malicious activities while maintaining a low profile. This can be a technique used by rootkits or other stealthy malware.

Importance

Very Low. The contribution is minimal, adding only a slight push towards malware, showing that context switching alone is not a strong indicator.

stime < = 4.00 (Orange, Malware)

Explanation

Stime refers to the system time used by a process. A lower value might suggest minimal system engagement, typical of certain types of malware that aim to avoid detection by staying idle.

Domain insight

Malware that stays dormant or performs tasks without heavily engaging the system can be harder to detect. The low stime might suggest that this process attempts to remain under the radar, a common tactic in advanced persistent threats (APTs).

Importance

Very Low. The contribution towards malware is minimal, suggesting that this feature alone is not a strong indicator but supports the overall malware classification.

This LIME explanation plot provides a detailed breakdown of the factors influencing the model’s decision, showing the impact of individual features and how they combine to form the final prediction of the instance as malware. Integrating domain-specific insights helps to connect the model’s behavior to real-world cybersecurity practices, enhancing the interpretability of the results.

SHAP force plot explanation for malware detection

This SHAP force plot in Fig. 11 visualizes the contributions of individual features to a specific prediction made by the Random Forest model in malware detection. Below is a plot breakdown, including domain-specific insights into the significance of each feature.

Fig. 11 [Images not available. See PDF.]

The explanation of the random forest model using the SHAP algorithm

General structure

Base value: The base value is the average model output across the training dataset, representing the starting point before considering the specific features of this instance. It reflects the model’s baseline confidence in predicting malware across all the cases.
Red arrows represent features that increase the prediction probability towards 1 (indicating malware).
Blue arrows represent features that decrease the prediction probability towards 0 (indicating benign).

Explanation of each feature with domain-specific insights

shared_vm = 0.732 (Red, Positive Contribution)

Explanation

The shared_vm feature represents the number of memory pages shared between different processes. Malware might utilize shared memory with legitimate processes to hijack or interfere with their execution, thereby hiding its activities.

Domain insight

Malware often exploits memory sharing to inject code into legitimate processes or communicate between malicious components. High shared memory usage could indicate attempts to subvert normal process execution, a hallmark of sophisticated malware.

Importance

High. This feature significantly contributes to the model’s decision to classify the instance as malware, indicating suspicious memory-sharing activity.

millisecond = 0.7081 (Red, Positive Contribution)

Explanation

This feature likely represents the timing of the process analysis. Malware might be executed at specific times to avoid detection or to align with the system’s vulnerabilities.

Domain insight

Timing-based attacks, such as those that trigger during specific system states or low system monitoring periods, are common strategies used by malware to evade detection. Aligning the process’s timing with known malicious patterns significantly raises suspicion.

Importance

High. The significant positive contribution suggests that the process’s timing correlates with known patterns of malicious activity.

static_prio = 0.3745 (Red, Positive Contribution)

Explanation

Static_prio indicates a process’s static priority. Malware might manipulate process priority to control system resources, ensuring the process can execute without interference.

Domain insight

Malware often exhibits elevated or altered process priorities as it attempts to prioritize its execution over other system processes. This tactic is crucial when the malware needs consistent access to CPU resources, such as during data exfiltration or crypto mining operations.

Importance

Moderate. The positive contribution suggests that the process’s priority is suspicious and aligns with behaviors often associated with malware.

nivcsw = 0.156 (Red, Positive Contribution)

Explanation

Nivcsw refers to the number of involuntary context switches. Malware might manipulate context switching to control CPU usage and avoid detection.

Domain insight

High numbers of involuntary context switches can indicate that a process is frequently relinquishing control of the CPU. Malware might use this tactic to perform tasks sporadically and avoid detection by security systems that monitor continuous CPU usage.

Importance

Moderate. Although its contribution is more minor, a higher nivcsw could signal unusual process behavior indicative of malware.

end_data = 0.5987 (Red, Positive Contribution)

Explanation

End_data represents the end address of the code segment. Malware might manipulate this to control how memory is used and allocated, possibly to hide its presence.

Domain insight

Manipulating memory addresses, especially the end of code segments, is a common tactic in advanced malware that seeks to control or obfuscate memory usage patterns. This can be particularly important in avoiding detection by memory analysis tools.

Importance

Moderate. The positive contribution indicates that the process’s memory usage pattern is suspicious, pointing to potential malware activity.

vm_truncate_count = 0.8662 (Red, Positive Contribution)

Explanation

This feature is associated with managing virtual memory areas (VMAs). Malware might manipulate memory management to evade detection by security mechanisms.

Domain insight

Memory management, particularly involving the handling of virtual memory areas, is a critical target for malware. Malware can attempt to evade detection or exploit system vulnerabilities by controlling or truncating VMAs. This feature’s high value suggests that the process is engaging in abnormal memory management.

Importance

Very High. The substantial positive contribution makes this one of the most critical features indicating malware, highlighting significant abnormal behavior in memory management.

utime = 0.05808 (Red, Positive Contribution)

Explanation

Utime represents the user time consumed by the process. Malware might manipulate or limit user time to minimize its footprint and avoid drawing attention.

Domain insight

Processes that exhibit shallow user time could indicate malware designed to execute quickly and quietly. This low activity level can be a deliberate strategy to avoid detection by systems that monitor prolonged CPU usage.

Importance

Low. While the positive contribution raises some suspicion, its relatively minor impact suggests it is not a dominant factor in malware classification.

free_area_cache = 0.156 (Blue, Negative Contribution)

Explanation

Free_area_cache represents the size of the first available memory hole in the address space. A typical value might suggest that memory usage patterns are consistent with benign processes.

Domain insight

Normal memory allocation patterns, such as typical memory holes, usually indicate benign processes. Malware might exploit unusually large or small memory holes to inject itself into memory, but in this case, the free_area_cache value aligns with expected benign behavior.

Importance

Moderate. The negative contribution suggests that this feature is more aligned with benign behavior, pushing against the malware classification.

maj_flt = 0.9507 (Blue, Negative Contribution)

Explanation

Maj_flt refers to the number of major page faults that occur when a process accesses non-resident memory pages. A high number of page faults could indicate typical benign processes engaging in normal memory management.

Domain insight

Benign processes might experience high page faults when accessing non-resident memory pages, especially during heavy computational tasks. This is usually considered normal behavior and not necessarily indicative of malicious activity.

Importance

High. The substantial negative contribution suggests that the process’s memory access patterns are typical of benign activity, making this feature a key argument against malware classification.

fs_excl_counter = 0.9507 (Blue, Negative Contribution)

Explanation

This feature tracks the file system’s use of exclusive resources. Malware might try to monopolize resources to avoid detection or to prevent other processes from interfering with its activities.

Domain insight

Benign processes share resources efficiently and do not monopolize them. A high fs_excl_counter could indicate regular file system activity, countering the suspicion of malware.

Importance

High. The vital negative contribution indicates that the resource usage aligns with benign processes, effectively counterbalancing other features and pushing the prediction toward malware.

This SHAP force plot effectively demonstrates how various features influence the model’s final prediction, providing transparency in the decision-making process for malware detection. By integrating domain-specific insights, the explanation highlights which features are most influential and clarifies why these features impact the model’s decisions, connecting the model’s behavior to established cybersecurity knowledge.

Decision-making of LIME and SHAP

The LIME explanation breaks down the decision process by showing the most influential features for this specific prediction. In this case, the most significant features pushing the prediction toward malware are static_prio, nivcsw, and vm_truncate_count. These features align with known malware behaviors, such as process prioritization, context switching, and memory management manipulation. Features like maj_flt, shared_vm, and end_data suggest benign behavior but are not strong enough to counter the overall malware prediction. Features such as time and time have minimal influence on the final decision but collectively support the overall prediction.

The SHAP force plot illustrates how each feature pushes the prediction toward malware or benign classification. Critical features like shared_vm, vm_truncate_count, and millisecond have a significant positive contribution, indicating suspicious behavior in memory usage, process priority, and timing—common malware indicators. Conversely, features like maj_flt and fs_excl_counter show substantial negative contributions, suggesting patterns typical of benign processes. This plot provides a comprehensive view of how individual features interact to affect the model’s final decision, with shared_vm and vm_truncate_count being critical in pushing the prediction toward malware.

In both visualizations, domain-specific knowledge enhances the interpretation. For instance, high shared_vm values may indicate malicious memory-sharing practices, while unusual vm_truncate_count suggests potential memory management manipulation by malware. Conversely, typical maj_flt values may indicate benign process behavior, providing a solid counterpoint against the malware classification. These insights bridge the gap between raw feature importance and real-world implications in cybersecurity.

Interaction-aware techniques in malware detection

LIME creates local surrogate models to explain individual predictions. However, LIME might not fully capture the interactions between features because it relies on simpler, linear surrogate models that assume feature independence. The PDPs help visualize how each feature independently influences the model’s predictions. However, in real-world scenarios, features often interact complexly, which LIME’s local linear approximations may not fully capture. By incorporating interaction-aware techniques such as PDPs, we can better understand how features interact. For instance, if there is an interaction between static_prio and shared_vm, a LIME model might miss it. Still, a PDP could help reveal the nature of this interaction, indicating whether certain combinations of feature values lead to a higher or lower likelihood of being classified as malware. The Partial Dependence Plots in Fig. 12 illustrate the marginal effect of three specific features—static_prio, maj_flt, and shared_vm—on the predicted probability of the Random Forest model identifying a sample as malware. These PDPs provide insights into how each feature influences the model’s prediction independently of other features, particularly useful in explainable AI techniques like LIME.

Static priority (static_prio)

Fig. 12 [Images not available. See PDF.]

Partial dependence plots of LIME in malware detection

This feature represents the static priority assigned to a process. In malware detection, a high static_prio might indicate a process that has been given undue priority, possibly because the malware is trying to ensure its execution over other processes. The plot shows that the predicted probability of the instance being malware is higher when static_prio is lower, mainly when it is below 20,000. This suggests that lower priority values are more strongly associated with malware, potentially because these might be less critical, less scrutinized processes that malware might exploit. As static_prio increases beyond 20,000, there is a steep drop in the probability of the instance being malware, indicating that higher priority values are associated with benign processes, likely because such processes are essential and less likely to be compromised by malware.

Major page faults (maj_flt)

This feature represents the number of major page faults that occur when a process needs to access a part of memory that is not currently in physical RAM, requiring loading from disk. Frequent major page faults could indicate abnormal memory usage patterns. The PDP shows a linear decrease in the predicted probability of malware as the value of maj_flt increases. This suggests that more major page faults are associated with benign behavior, possibly because normal, legitimate processes are more likely to cause page faults during their operation. The negative slope indicates that fewer major page faults might indicate more efficient or stealthy (potentially malicious) memory usage, hence a higher probability of being classified as malware.

Shared virtual memory (shared_vm)

This feature refers to the amount of virtual memory shared between processes. Malware might share memory with other processes to either manipulate them or hide its operations. Similar to maj_flt, the PDP for shared_vm also shows a decreasing trend, where higher shared memory values correspond to a lower likelihood of the instance being malware. This trend suggests that extensive memory sharing is more characteristic of benign processes, which might frequently share memory for legitimate reasons like inter-process communication. On the other hand, lower memory sharing might be more typical of malware trying to operate independently to avoid detection.

Incorporating PDPs alongside LIME explanations provides a more comprehensive view of feature importance and interactions, particularly in complex models used for malware detection. While LIME gives us valuable local explanations, PDPs complement this by offering insights into the global behavior of the model, helping to ensure that the model’s predictions are interpretable, reliable, and aligned with domain-specific knowledge in cybersecurity.

Performance evaluation and optimization of explainable AI techniques

We have evaluated the time taken to generate explanations using LIME in the context of malware detection. Our findings indicate that the time taken for LIME to create an explanation is approximately 0.1807s, while SHAP takes approximately 0.3109 s. We have taken significant steps to minimize latency while ensuring the system remains efficient for real-time threat detection. Specifically, we have optimized both methods as follows:

Number of perturbations: We reduced the number of perturbations generated by LIME to 500 samples. This reduction was carefully chosen to balance the trade-off between speed and explanation accuracy. Through empirical testing, we found that 500 samples provided a reliable explanation with significantly reduced latency.
Kernel width: The kernel width was also adjusted to fine-tune the locality of the surrogate model, further enhancing the efficiency without sacrificing the interpretability of the results.
Number of background samples: For SHAP, we optimized the number of background samples used to calculate the Shapley values, setting it to 332. This configuration was selected after extensive testing, demonstrating that it provides accurate explanations with lower latency, making the approach suitable for real-time applications.
Early stopping: Early stopping criteria were implemented to halt computations once sufficient explanation accuracy was achieved, preventing unnecessary processing and further reducing latency.

By applying these optimizations, we have successfully minimized the latency of both LIME and SHAP, thereby enhancing the practicality and applicability of our Explainable AI approach in malware detection. This approach ensures that the explanations are generated quickly enough to be usable in real-time while maintaining the high level of interpretability necessary for effective decision-making.

Real-life testing of the proposed model

After developing the proposed system, we conducted real-life testing to evaluate the effectiveness of our Random Forest and Decision Tree models in detecting malware. Initially, the models showed exceptional performance during the training phase, with the Random Forest model achieving 100% accuracy and the Decision Tree model achieving 99.99% accuracy.

We collected a dataset specifically designed for malware classification to further validate our model. This dataset was provided by the cybersecurity experts at the ICT Cell of Shanto-Mariam University of Creative Technology, Dhaka, Bangladesh. The data collection was ethically approved by the IT Director of the ICT Cell, ensuring compliance with ethical standards.

The dataset comprised 20 records, each with 35 features relevant to malware detection. Out of these 20 records, five samples are illustrated in Fig. 13, demonstrating the diversity and complexity of the dataset. We then tested our top-performing models—Random Forest and Decision Tree—on this unseen data. The Random Forest model correctly classified 18 out of the 20 samples, achieving a classification accuracy of 90%. On the other hand, the Decision Tree model correctly classified 15 out of the 20 samples, resulting in a classification accuracy of 75%. These results demonstrate the robustness of the Random Forest model, which maintained high accuracy even when tested on new, unseen data. While slightly less accurate, the Decision Tree model still showed considerable efficacy in malware classification. This real-life testing highlights the practical applicability of our models in real-world cybersecurity scenarios.

Fig. 13 [Images not available. See PDF.]

Samples of the real-life testing dataset

We have implemented the Chi-Squared test to validate the results we obtained from LIME and SHAP in this new dataset. Table 5 highlights the top 10 features determined by the Chi-Squared test, which is essential for understanding their relationship with classifying malware versus benign files. Each feature listed has been evaluated based on its Chi-Squared statistic and corresponding P-value, indicating the strength of its association with the target variable. The Chi-Squared statistic measures how a feature’s observed frequency of occurrences relates to the expected frequency if there is no association between the feature and the target variable. A higher Chi-Squared value indicates a stronger association between the feature and the target variable. The P-value is the probability value associated with the Chi-Squared statistic. It tells the likelihood of observing the data (or something more extreme) if the null hypothesis is true (i.e. if there is no association between the feature and the target). A very small P-value (typically < 0.05 or < 0.01) indicates strong evidence against the null hypothesis, suggesting that the feature is significantly associated with the target variable.

static_prio: With a Chi-Squared value of 1237.12 and an exceedingly low P-value (5.215309e-271), this feature demonstrates a very strong association with file classification. A process’s static priority may influence its scheduling and execution behavior, making it a crucial factor for distinguishing between benign and malicious processes.
vm_truncate_count: This feature has a Chi-Squared value of 669.57 and a P-value of 1.237111e-147. A high truncate count in virtual memory could indicate potential malicious behavior, as malware may frequently manipulate memory allocations to evade detection.
maj_flt: The number of major page faults is significant, with a Chi-Squared value of 422.48 and a P-value of 7.040616e-94. Major faults occur when a process accesses a page not currently in memory, and a high frequency may suggest resource-heavy behavior typical of malware.
hiwater_rss: This feature shows a Chi-Squared value of 254.55 and a P-value of 2.643911e-57. The high-water mark of resident set size (RSS) reflects the maximum amount of memory used, which may reveal unusual memory usage patterns associated with malware.
nivcsw: With a Chi-Squared value of 240.31 and a P-value of 3.361705e-54, the number of involuntary context switches is significant. High context switch rates can indicate heavy resource usage or multitasking behaviors commonly seen in malicious processes.
total_vm: This feature has a Chi-Squared value of 216.63 and a P-value of 4.904289e-49. The total virtual memory allocated to a process is crucial for identifying potentially malicious activity, as excessive memory usage may be characteristic of certain types of malware.
end_data: With a Chi-Squared value of 206.47 and a P-value of 8.093773e-47, the end of the data segment can provide insights into memory segmentation, which malware might exploit to execute its payload.
shared_vm: This feature has a Chi-Squared value of 187.55 and a P-value of 1.090264e-42. The amount of shared virtual memory can help identify suspicious inter-process communication, often a hallmark of malware trying to evade detection.
Millisecond: With a Chi-Squared value of 142.71 and a P-value of 6.787172e-33, the running time in milliseconds is significant. Processes that exhibit unusual execution times can be indicative of malicious activity.
usage_counter: This feature shows a Chi-Squared value of 131.17 and a P-value of 2.273138e-30. A high usage counter may suggest excessive resource utilization by a process, potentially pointing towards malware attempting to exploit system resources.

Table 5. Chi-square test

Features	Chi-squared value	P-value
static_prio	1237.12	5.215309e-271
vm_truncate_count	669.57	1.237111e-147
maj_flt	422.48	7.040616e-94
hiwater_rss	254.55	2.643911e-57
nivcsw	240.31	3.361705e-54
total_vm	216.63	4.904289e-49
end_data	206.47	8.093773e-47
shared_vm	187.55	1.090264e-42
millisecond	142.71	6.787172e-33
usage_counter	131.17	2.273138e-30

The results indicate that the top features identified by the Chi-Squared test are highly significant in predicting the classification of malware versus benign files. The extremely low P-values for all features suggest strong evidence against the null hypothesis, indicating that these features relate to the classification outcome. By implementing the Chi-Squared test, we have validated our results from LIME and SHAP, reinforcing the importance of these features in our analysis. Furthermore, consulting expert studies have confirmed that these features are the most significant for malware detection. This means incorporating these significant features into machine learning models for malware detection could enhance the model’s performance and accuracy. By focusing on these features, analysts and data scientists can better understand the underlying behaviors associated with malware and develop more effective detection strategies.

Discussion

In this study, we employed four machine learning algorithms—Support Vector Machine, Decision Tree, K-Nearest Neighbors, and Random Forest—to detect malware using a comprehensive dataset sourced from Kaggle. Among these, the Random Forest model demonstrated superior performance, achieving 100% accuracy across all evaluation metrics, including precision, recall, and specificity. This high level of performance suggests that RF is particularly effective in distinguishing between benign and malware activities within the dataset. To mitigate the potential for overfitting, we implemented five-fold cross-validation during model training and tuned the hyperparameters for each algorithm, as detailed in Table 3. In this study, we manually adjusted the hyper-parameters, such as the number of trees in the Random Forest, the kernel type and regularization parameters in SVM, the number of neighbors in KNN, and the maximum depth of the Decision Tree. These adjustments were based on iterative experimentation, where different combinations of parameters were evaluated to find the optimal settings that balance bias and variance. For instance, in the Random Forest model, the number of trees was carefully selected to provide sufficient ensemble learning without leading to an overly complex model.

Similarly, in the SVM, the choice of the kernel (linear, polynomial, radial basis function) and the regularization parameter (C) were fine-tuned to ensure that the model generalized well to unseen data. The KNN model’s performance was optimized by selecting the appropriate number of neighbors (k) and balancing the trade-off between bias (underfitting) and variance (overfitting). The Decision Tree model was pruned to prevent it from becoming too deep, which would otherwise lead to overfitting on the training data. These hyper-parameter tuning efforts, combined with the five-fold cross-validation technique, aimed to reduce overfitting and ensure that the models perform well on the training data and new, unseen data.

In our study, the dataset was carefully curated from the Kaggle library, consisting of 100,000 records evenly split between malware and benign data. This balanced dataset ensured that our models could learn to distinguish between the two classes without bias. We applied several preprocessing techniques to enhance the quality and reliability of the data. These steps included label encoding to convert categorical features into numerical values, checking for and handling null values to avoid inaccuracies, normalizing features to ensure consistency in scale, and checking for duplicate values to prevent redundancy in the dataset.

Our study not only emphasizes the positive outcomes, such as high accuracy and performance metrics but also gives due consideration to false positives and false negatives. These factors are critical in evaluating model performance in a practical cybersecurity context. False positives occur when benign instances are mistakenly classified as malware, leading to unnecessary alerts and operational disruptions. In our study, the Random Forest model achieved a False Positive Rate of 0%, indicating that it did not produce any false positives. This result suggests that the model effectively distinguished benign activities from malware without generating false alerts. False negatives, on the other hand, occur when the model does not detect actual malware. This is a significant concern as it can allow malicious activities to go undetected, posing serious security risks. In our study, the Random Forest model achieved a False Negative Rate of 0%, demonstrating its effectiveness in detecting all malware instances in the dataset. This outcome highlights the model’s capability to identify malware accurately without missing any instances. Our study provides a balanced evaluation of the model’s performance by addressing false positives and false negatives. High performance metrics alone do not fully capture the model’s practical efficacy. Therefore, we have included a detailed analysis of these critical factors to ensure that our results are accurate but also relevant and reliable for real-world cybersecurity applications.

Our machine-learning models also have certain limitations. The SVM model exhibited significant drawbacks, as reflected in its metrics. Although it achieved a 0% False Positive Rate, the model had a notably high False Negative Rate of 48.29%. This elevated FNR suggests that nearly half of the malware instances went undetected, posing a severe risk in cybersecurity contexts where such oversights could allow malicious activities to go unnoticed. The model’s lower precision and recall further underscore its challenges in effectively identifying malware while minimizing false alarms. While the Decision Tree performed better in controlled conditions, it demonstrated only 75% accuracy when tested on real-life data, indicating potential overfitting or limitations in generalizing to unseen data.

Similarly, the KNN model, despite its high precision and recall of 99%, exhibited an FPR of 0.52% and an FNR of 0.62%. Although these rates are relatively low, they still indicate that the model is imperfect and could produce errors in critical situations. The Random Forest model delivered outstanding performance with 100% precision, recall, accuracy, and 0% FPR and FNR. However, when applied to real-life testing data, its accuracy dropped to 90%, suggesting that while the model is robust, there are challenges in maintaining its performance outside of the controlled testing environment.

Comparative analysis with existing works

Table 6 compares the proposed approach and existing studies, illustrating performance metrics and outcomes. The information presented in Table 6 indicates that our proposed work consistently surpasses other studies, demonstrating superior results across various evaluation criteria. This comparative table underscores the effectiveness of our methodology, showing higher accuracy, precision, recall, and F1 score compared to other approaches. By outperforming existing studies, the proposed approach establishes a new benchmark for malware detection, reinforcing its potential for broader applications in cybersecurity and cyber threat analysis. This comparison in Table 6 validates the robustness and reliability of our proposed method, suggesting its viability for real-world implementation in detecting and mitigating cyber threats.

Table 6. Comparative analysis of the proposed work with existing studies

Author	Precision	Recall	F1-score	Accuracy
Taha and Barukab (2022)	94.1%	94.1%	94.1%	94.1%
Vinayakumar et al. (2019)	89.9%	88.6%	89.2%	89.5%
Our study (RF)	100%	100%	100%	100%

Impact of the proposed study on cybersecurity

The proposed study significantly impacts cybersecurity by offering a highly effective approach to malware detection and cyber threat analysis. Its superior accuracy, precision, recall, and F1 score performance contribute to a more reliable detection system, reducing false positives and negatives. This improved detection rate helps to minimize the risk of successful cyberattacks, thereby preventing data breaches and system disruptions.

The LIME and SHAP algorithm analysis provides confidence in the model’s ability to distinguish between malware and benign instances effectively. The LIME explanation highlights that while some features slightly indicate benign behavior—such as maj_flt, shared_vm, end_data, vm_truncate_count, and utime—their influence is minimal compared to the vital malware indicators. The critical features, notably static_prio, nivcsw, free_area_cache, new, and time, overwhelmingly contribute to the malware classification. In SHAP analysis, key features such as shared_vm, vm_truncate_count, and milliseconds make significant positive contributions, highlighting suspicious activities related to memory usage, process priority, and timing—factors often associated with malware. On the other hand, features like maj_flt and fs_excl_counter exhibit vital negative contributions, indicating behaviors typically observed in benign processes. This detailed breakdown of feature contributions aids in validating the model’s performance and offers insights for potential improvements. For instance, understanding which features are most influential can help refine feature engineering processes, ensuring that the most critical data points are captured and utilized.

LIME may not fully capture feature interactions due to its reliance on simpler, linear surrogate models. To address this, we utilized Partial Dependence Plots in LIME explanations for three key features: static_prio, maj_flt, and shared_vm. In malware detection, a low static_prio (below 20,000) is associated with a higher likelihood of the instance being malware, as malware may manipulate process priorities to ensure its execution. The PDP for maj_flt shows that an increase in significant page faults correlates with benign behavior, indicating that normal processes are more likely to cause such faults. Conversely, efficient or stealthy memory usage, which may result in fewer page faults, suggests a higher probability of malware. For shared_vm, higher memory sharing typically aligns with benign processes, while lower sharing may indicate malware, as it often operates independently to evade detection.

Moreover, the study enhances cybersecurity response strategies, allowing organizations to respond more effectively to threats. Its scalability ensures that it can be integrated across various security frameworks, from individual devices to large-scale enterprise networks. This flexibility also means it can adapt to evolving cyber threats. Beyond its technical contributions, the findings of this study have the potential to influence practical applications, particularly in commercial cybersecurity products, significantly. By incorporating explainability through LIME and SHAP, the model enhances threat detection and empowers engineers—primarily licensed professionals—to interpret and justify their decisions effectively. This capability is crucial for improving productivity and ensuring that cybersecurity measures are transparent and understandable within organizations and external stakeholders.

The study’s approach, therefore, sets a new benchmark for malware detection and positions itself as a valuable tool that could be commercialized to support engineers in their daily tasks. By providing clear explanations of model predictions, this research advances cybersecurity and enhances professionals’ ability to communicate their recommendations and actions confidently. This study is critical in advancing cybersecurity research and practical application, providing a solid foundation for future studies, and contributing to a safer digital environment.

Conclusion

The growing frequency and complexity of malware in today’s digital landscape threaten the security and stability of computer networks and electronic devices. Malicious software, crafted with the intent to harm, can cause severe disruption by compromising sensitive data, corrupting essential operations, and leading to system downtime. This rapidly evolving threat landscape underscores the need for advanced cyber threat detection systems to identify and mitigate risks proactively.

Our study explored Machine Learning’s potential in cyber threat detection, specifically classifying dangerous and benign entities within digital ecosystems. By leveraging the power of machine learning, we aimed to uncover patterns and trends that could lead to more accurate predictions and robust cybersecurity. To achieve this, we tested four distinct Machine Learning algorithms: Support Vector Machine, Decision Tree, K-Nearest Neighbors, and Random Forest, using a dataset sourced from Kaggle. This dataset was thoroughly preprocessed, including normalization and label assignment, to ensure reliable malware and benign data categorization. We employed a robust k-fold cross-validation approach to split the dataset into training and validation sets, allowing us to refine model performance while reducing bias and variation.

Additionally, we manually tuned hyperparameters for each model to reduce the risk of overfitting further, ensuring that our models could generalize effectively to new, unseen data. Our results revealed varying levels of performance across the models. Random Forest emerged as the top performer, achieving a remarkable accuracy rate of 100%, indicating its exceptional capability to distinguish between benign and malicious entities. This high level of accuracy reinforces the importance of machine learning in enhancing cyber defense systems, providing a more effective means to detect and respond to cyber threats.

This study has utilized Explainable AI techniques LIME and SHAP, which provide comprehensive and interpretable explanations for the predictions made by machine learning models, enhancing transparency and trust in the decision-making process. The LIME explanation reveals that although certain features—such as maj_flt, shared_vm, end_data, vm_truncate_count, and utime—slightly suggest benign behavior, their impact is relatively minor when compared to the vital indicators of malware. Critical features like static_prio, nivcsw, free_area_cache, new, and time predominantly drive the classification toward identifying malware. In contrast, the SHAP analysis shows that features like shared_vm, vm_truncate_count, and millisecond significantly contribute to the likelihood of an instance being classified as malware. These features reflect suspicious activities related to memory usage, process priority, and timing—common red flags for malware. Conversely, maj_flt and fs_excl_counter exhibit strong negative contributions, indicating patterns more typical of benign processes. This detailed understanding of feature importance ensures the model’s decisions are transparent and logically based.

LIME’s reliance on simpler, linear surrogate models may limit its ability to capture intricate interactions between features fully. To overcome this limitation, we employed Partial Dependence Plots alongside LIME explanations, specifically focusing on three critical features: static_prio, maj_flt, and shared_vm. In malware detection, the static_prio feature plays a crucial role, with lower values (below 20,000) being more strongly associated with malware. This is because malware often manipulates process priorities to ensure it runs uninterrupted, elevating its chances of evading detection. The PDP for this feature clearly illustrates how a decrease in priority correlates with an increased likelihood of an instance being classified as malware, providing valuable insights into how malicious entities exploit process prioritization. The maj_flt feature, which tracks significant page faults, shows a different pattern. Our analysis reveals that an increase in significant page faults is typically linked to benign processes, as legitimate operations are more likely to cause such faults due to regular memory management activities.

Conversely, a lower occurrence of major page faults might indicate more efficient or stealthy memory usage, a characteristic often seen in malware attempting to minimize its footprint and avoid triggering alerts. The shared_vm feature, which measures the extent of virtual memory shared between processes, provides additional context. High levels of shared memory usage generally point to benign processes that frequently share resources for legitimate purposes, such as inter-process communication. On the other hand, malware tends to operate in isolation to evade detection, resulting in lower shared memory usage. The PDP for shared_vm underscores this trend, showing a decreasing likelihood of malware as shared memory usage increases. By integrating these PDP insights with LIME explanations, we address the limitations of LIME’s linear approach and gain a deeper understanding of how specific features contribute to malware classification. This combined analysis allows for a more nuanced interpretation of model behavior, ultimately leading to better-informed decisions in refining feature engineering and enhancing overall model performance.

We evaluated the time required to generate explanations using LIME and SHAP in the context of malware detection, finding that LIME takes approximately 0.1807s, while SHAP takes around 0.3109 s. We optimized both methods to reduce latency and ensure real-time efficiency by reducing LIME’s perturbations to 500 samples, fine-tuning the kernel width, setting SHAP’s background samples to 332, and implementing early stopping criteria. These optimizations successfully minimized latency, making our Explainable AI approach more practical for real-time malware detection while maintaining the necessary level of interpretability for effective decision-making.

After developing our system, we conducted real-life testing to assess the effectiveness of our Random Forest and Decision Tree models in malware detection. Initially, these models performed exceptionally well during training, with Random Forest achieving 100% accuracy and Decision Tree 99.99%. To validate these results, we tested the models on a dataset provided by the ICT Cell of Shanto-Mariam University, comprising 20 records with 35 features. Random Forest maintained high accuracy, correctly classifying 18 out of 20 samples (90% accuracy), while the Decision Tree classified 15 out of 20 (75% accuracy). We also used the Chi-Squared test to validate the features identified by LIME and SHAP, confirming their significance in distinguishing between malware and benign files. The top 10 features, such as static_prio, vm_truncate_count, and maj_flt, showed very low P-values, indicating a strong association with malware detection. This validation supports our models’ robustness and highlights critical features for improving malware detection strategies.

These findings have significant implications for cybersecurity. They demonstrate that by integrating Machine Learning into cyber threat detection, we can improve the resilience of computer networks and protect digital assets from harm. The study also highlights the importance of ongoing research to stay ahead of evolving cyber threats, as cybercriminals continuously develop new tactics and malware variants. Ultimately, our research contributes to the broader goal of strengthening cybersecurity infrastructure and ensuring the safety and integrity of digital ecosystems. The successful application of machine learning in this context provides a robust defense mechanism and lays the groundwork for future advancements in the field. Our results suggest that continued investment in machine learning-based cyber threat detection will be vital to safeguarding modern information systems against increasingly sophisticated attacks. Looking ahead, we plan to implement additional machine learning algorithms to enhance our threat detection capabilities further. This will involve exploring a more comprehensive range of algorithms and techniques to continuously improve the accuracy and efficiency of cyber threat detection systems. Investing in these advanced machine-learning approaches will be crucial for safeguarding modern information systems against increasingly sophisticated attacks.

Author contribution

This work was carried out in collaboration between all authors. Farida Siddiqi Prity, Md Shahidul Islam, and Emran Hossain Fahim designed the study and wrote the first draft of the manuscript. Md. Maruf Hossain managed the analyses of the study. Sazzad Hossain Bhuiyan, Md. Ariful Islam, and Mirza Raquib managed the literature searches and helped in coding during the review process. All authors read and approved the final manuscript.

Data availability

The datasets generated during the current study are available from the corresponding author on reasonable request.

Declarations

Ethics approval

This article does not contain any studies with human participants and animals performed by any of the authors.

Conflict of interest

The authors declare no competing interests.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Aamir, M; Iqbal, MW; Nosheen, M; Ashraf, MU; Shaf, A; Almarhabi, KA; Alghamdi, AM; Bahaddad, AA. AMDDLmodel: android smartphones malware detection using deep learning model. PLoS One; 2024; 19, 1 e0296722. [DOI: https://dx.doi.org/10.1371/journal.pone.0296722]

Alam, S; Qu, Z; Riley, R; Chen, Y; Rastogi, V. DroidNative: automating and optimizing detection of android native code malware variants. Comput Secur; 2017; 65, pp. 230-246. [DOI: https://dx.doi.org/10.1016/j.cose.2016.11.011]

Almazroi, AA; Ayub, N. Deep learning hybridization for improved malware detection in smart Internet of Things. Sci Rep; 2024; 14, 1 7838. [DOI: https://dx.doi.org/10.1038/s41598-024-57864-8] 1486.92037

Alzaylaee, MK; Yerima, SY; Sezer, S. DL-Droid: deep learning based android malware detection using real devices. Comput Secur; 2020; 89, 101663. [DOI: https://dx.doi.org/10.1016/j.cose.2019.101663]

Basheer N, Pranggono B, Islam S, Papastergiou S, Mouratidis H (2024) Enhancing malware detection through machine learning using XAI with SHAP framework. In IFIP international conference on artificial intelligence applications and innovations. Springer Nature Switzerland, Cham, pp 316-329. https://doi.org/10.1007/978-3-031-63211-2_24

Bensaoud, A; Kalita, J; Bensaoud, M. A survey of malware detection using deep learning. Mach Learn Appl; 2024; 16, 100546.1088.65532

Bostani, H; Moonsamy, V. Evadedroid: a practical evasion attack on machine learning for black-box android malware detection. Comput Secur; 2024; 139, 103676. [DOI: https://dx.doi.org/10.1016/j.cose.2023.103676]

Cai, L; Li, Y; Xiong, Z. JOWMDroid: android malware detection based on feature weighting with joint optimization of weight-mapping and classifier parameters. Comput Secur; 2021; 100, 102086. [DOI: https://dx.doi.org/10.1016/j.cose.2020.102086] 1513.05348

Coronado-De-Alba LD, Rodríguez-Mota A, Escamilla-Ambrosio PJ (2016) Feature selection and ensemble of classifiers for android malware detection. In 2016 8th IEEE Latin-American conference on communications (LATINCOM). IEEE, pp 1-6. https://doi.org/10.1109/LATINCOM.2016.7811605

Damshenas, M; Dehghantanha, A; Choo, KKR; Mahmud, R. M0droid: an android behavioral-based malware detection model. J Inform Privacy Secur; 2015; 11, 3 pp. 141-157. [DOI: https://dx.doi.org/10.1080/15536548.2015.1073510]

Deng, X; Cen, M; Jiang, M; Lu, M. Ransomware early detection using deep reinforcement learning on portable executable header. Clust Comput; 2024; 27, 2 pp. 1867-1881. [DOI: https://dx.doi.org/10.1007/s10586-023-04043-5] 07909910

Fernando, DW; Komninos, N. FeSAD ransomware detection framework with machine learning using adaption to concept drift. Comput Secur; 2024; 137, 103629. [DOI: https://dx.doi.org/10.1016/j.cose.2023.103629] 07947618

Gulmez, S; Kakisim, AG; Sogukpinar, I. XRan: explainable deep learning-based ransomware detection using dynamic analysis. Comput Secur; 2024; 139, 103703. [DOI: https://dx.doi.org/10.1016/j.cose.2024.103703]

Jang, JW; Kang, H; Woo, J; Mohaisen, A; Kim, HK. Andro-AutoPsy: anti-malware system based on similarity matching of malware and malware creator-centric information. Digit Investig; 2015; 14, pp. 17-35. [DOI: https://dx.doi.org/10.1016/j.diin.2015.06.002] 1383.05025

Kaggle (2024) “Malware Detection,” [Online]. Available: https://www.kaggle.com/datasets/nsaravana/malware-detection. Accessed 2 Jan 2024

Kouliaridis V, Barmpatsalou K, Kambourakis G, Wang G (2018) Mal-warehouse: a data collection-as-a-service of mobile malware behavioral patterns. In 2018 IEEE SmartWorld, ubiquitous intelligence & computing, advanced & trusted computing, scalable computing & communications, cloud & big data computing, internet of people and smart city innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE, pp 1503-1508. https://doi.org/10.1109/SmartWorld.2018.00260

Ksibi, A; Zakariah, M; Almuqren, L; Alluhaidan, AS. Efficient android malware identification with limited training data utilizing multiple convolution neural network techniques. Eng Appl Artif Intell; 2024; 127, 107390. [DOI: https://dx.doi.org/10.1016/j.engappai.2023.107390]

Ma, X; Han, X; Zhang, L. An improved k-nearest neighbor algorithm for recognition and classification of thyroid nodules. J Ultrasound Med; 2024; 43, 1025. [DOI: https://dx.doi.org/10.1002/jum.16429] 1254.15014

Millar S, McLaughlin N, Martinez del Rincon J, Miller P, Zhao Z (2020) DANdroid: a multi-view discriminative adversarial network for obfuscated android malware detection. In proceedings of the tenth ACM conference on data and application security and privacy, pp 353-364. https://doi.org/10.1145/3374664.3375746

Milosevic, N; Dehghantanha, A; Choo, KKR. Machine learning aided android malware classification. Comput Electr Eng; 2017; 61, pp. 266-274. [DOI: https://dx.doi.org/10.1016/j.compeleceng.2017.02.013]

Nasser, AR; Hasan, AM; Humaidi, AJ. DL-AMDet: deep learning-based malware detector for android. Intell Syst Appl; 2024; 21, 200318.07890971

Niu, W; Feng, Y; Xu, S; Wilson, A; Jin, Y; Ma, Z; Wang, Y. Revealing suicide risk of young adults based on comprehensive measurements using decision tree classification. Comput Human Behav; 2024; 158, 108272. [DOI: https://dx.doi.org/10.1016/j.chb.2024.108272]

Nobakht, M; Javidan, R; Pourebrahimi, A. SIM-FED: secure IoT malware detection model with federated learning. Comput Electr Eng; 2024; 116, 109139. [DOI: https://dx.doi.org/10.1016/j.compeleceng.2024.109139]

Poornima, S; Mahalakshmi, R. Automated malware detection using machine learning and deep learning approaches for android applications. Measurement: Sensors; 2024; 32, 100955.0529.46025

Potha, N; Kouliaridis, V; Kambourakis, G. An extrinsic random-based ensemble approach for android malware detection. Connect Sci; 2021; 33, 4 pp. 1077-1093. [DOI: https://dx.doi.org/10.1080/09540091.2020.1853056]

Seyfari, Y; Meimandi, A. A new approach to android malware detection using fuzzy logic-based simulated annealing and feature selection. Multimed Tools Appl; 2024; 83, 4 pp. 10525-10549. [DOI: https://dx.doi.org/10.1007/s11042-023-16035-z] 1383.11074

Shabtai, A; Tenenboim-Chekina, L; Mimran, D; Rokach, L; Shapira, B; Elovici, Y. Mobile malware detection through analysis of deviations in application network behavior. Comput Secur; 2014; 43, pp. 1-18. [DOI: https://dx.doi.org/10.1016/j.cose.2014.02.009] 1452.62098

Smmarwar, SK; Gupta, GP; Kumar, S. Android malware detection and identification frameworks by leveraging the machine and deep learning techniques: a comprehensive review. Telematics Inform Rep; 2024; 14, 100130. [DOI: https://dx.doi.org/10.1016/j.teler.2024.100130]

Sun, Z; Wang, G; Li, P; Wang, H; Zhang, M; Liang, X. An improved random forest based on the classification accuracy and correlation measurement of decision trees. Expert Syst Appl; 2024; 237, 121549. [DOI: https://dx.doi.org/10.1016/j.eswa.2023.121549] 1485.93370

Taha, A; Barukab, O. Android malware classification using optimized ensemble learning based on genetic algorithms. Sustainability; 2022; 14, 21 14406. [DOI: https://dx.doi.org/10.3390/su142114406] 1530.93181

Taheri, R; Ghahramani, M; Javidan, R; Shojafar, M; Pooranian, Z; Conti, M. Similarity-based android malware detection using Hamming distance of static binary features. Futur Gener Comput Syst; 2020; 105, pp. 230-247. [DOI: https://dx.doi.org/10.1016/j.future.2019.11.034]

Thakur, P; Kansal, V; Rishiwal, V. Hybrid deep learning approach based on lstm and cnn for malware detection. Wirel Pers Commun; 2024; 136, 3 pp. 1879-1901. [DOI: https://dx.doi.org/10.1007/s11277-024-11366-y] 1531.65071

Vinayakumar, R; Alazab, M; Soman, KP; Poornachandran, P; Venkatraman, S. Robust intelligent malware detection using deep learning. IEEE Access; 2019; 7, pp. 46717-46738. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2906934]

Widodo, A; Yang, BS. Support vector machine in machine condition monitoring and fault diagnosis. Mech Syst Signal Process; 2007; 21, 6 pp. 2560-2574. [DOI: https://dx.doi.org/10.1016/j.ymssp.2006.12.007] 1181.68292

Yerima, SY; Sezer, S. Droidfusion: a novel multilevel classifier fusion approach for android malware detection. IEEE Trans Cybernet; 2018; 49, 2 pp. 453-466. [DOI: https://dx.doi.org/10.1109/TCYB.2017.2777960] 1288.34067

Word count: 15206

Show less

Machine learning-based cyber threat detection: an approach to malware detection and security with explainable AI insights

Content area

Abstract

Full text