Adaptive TreeHive: Ensemble of trees for enhancing imbalanced intrusion classification

Abstract

Imbalanced intrusion classification is a complex and challenging task as there are few number of instances/intrusions generally considered as minority instances/intrusions in the imbalanced intrusion datasets. Data sampling methods such as over-sampling and under-sampling methods are commonly applied for dealing with imbalanced intrusion data. In over-sampling, synthetic minority instances are generated e.g. SMOTE (Synthetic Minority Over-sampling Technique) and on the contrary, under-sampling methods remove the majority-class instances to create balanced data e.g. random under-sampling. Both over-sampling and under-sampling methods have the disadvantages as over-sampling technique creates overfitting and under-sampling technique ignores a large portion of the data. Ensemble learning in supervised machine learning is also a common technique for handling imbalanced data. Random Forest and Bagging techniques address the overfitting problem, and Boosting (AdaBoost) gives more attention to the minority-class instances in its iterations. In this paper, we have proposed a method for selecting the most informative instances that represent the overall dataset. We have applied both over-sampling and under-sampling techniques to balance the data by employing the majority and minority informative instances. We have used Random Forest, Bagging, and Boosting (AdaBoost) algorithms and have compared their performances. We have used decision tree (C4.5) as the base classifier of Random Forest and AdaBoost classifiers and naïve Bayes classifier as the base classifier of the Bagging model. The proposed method Adaptive TreeHive addresses both the issues of imbalanced ratio and high dimensionality, resulting in reduced computational power and execution time requirements. We have evaluated the proposed Adaptive TreeHive method using five large-scale public benchmark datasets. The experimental results, compared to data balancing methods such as under-sampling and over-sampling, exhibit superior performance of the Adaptive TreeHive with accuracy rates of 99.96%, 85.65%, 99.83%, 99.77%, and 95.54% on the NSL-KDD, UNSW-NB15, CIC-IDS2017, CSE-CIC-IDS2018, and CICDDoS2019 datasets, respectively, establishing the Adaptive TreeHive as a superior performer compared to the traditional ensemble classifiers.

Full text

Translate

Turn on search term navigation

Introduction

Intrusion classification holds paramount worth in ensuring the utmost clarity and vigilance in cybersecurity. Computer networks have become the backbone of important activities such as communication, commerce, data storage, and industrial automation in today’s interconnected and technologically driven society. Despite the various technological advancements we’ve achieved, the area of cyberspace remains susceptible to a plethora of risks. The possibility of cyber-attacks and unauthorized access is omnipresent and can be exemplified at any given moment [1]. Intrusion classification is critical in protecting information systems’ integrity, confidentiality, and availability by enabling administrators to detect and respond to cyber threats in real time. Intrusion classification is the cybernetic alert mechanism deployed within a computational ecosystem, designed to distinguish, categorize, and respond to malicious patterns or deviations from conventional behavioral norms, thereby fortifying the system’s robustness against unauthorized access or adversarial activities where the main objective is to detect and respond to unauthorized or unusual activity on a network or host [2]. Intrusion Detection Systems (IDSs) depicted in Fig 1 stretch back to the late 1980s [3], primarily focusing on host-based [4] detection before evolving into network-based [5] systems. There are two types of IDS [6]: Network-Based IDS (NIDS) [7] and Host-Based IDS (HIDS). NIDS examines data packets at the network perimeter and notifies administrators if malicious traffic is detected. HIDS [7] are deployed on separate hosts or devices to monitor local activities and trigger alarms for unusual behavior that may go undetected at the network level. As the internet expanded and new attack categories emerged, it became challenging to intrusions accurately and quickly. Therefore, it is crucial to use big data techniques [8] to identify unauthorized access efficiently. Thus, the imperative is to design an intrusion classification model tailored to the distinctive characteristics of big data, specifically preaching the challenge of securing communication networks. Although there is a rising popularity of utilizing machine-learning based methods to classify malware, there is a significant lack of focus on accurately classifying intrusions in intelligent models [9] that are capable of effectively safeguarding communication networks against modern attacks. This neglect highlights the vulnerability of communication networks to evolving threats and emphasizes the importance of creating machine-learning models specifically designed to counter contemporary attacks. The primary hindrances in the advancement of classifying intrusions revolve around the utilization of machine learning-based methods for categorizing both modern and outdated [10] malware-challenges that we have discovered remained unresolved in earlier efforts. Furthermore, the scarcity of a publicly available large-scale balanced dataset for intrusion classification poses another inevitable constraint in the goal of developing quite precise models.

[Figure omitted. See PDF.]

The diagram illustrates a corporate network protected by a layered security architecture. External threats originating from the internet or a hacker are routed through a firewall and switch before reaching the internal network components, which include a web server, email server, and a centralized management system. An Intrusion Detection System (IDS) monitors the traffic passing through the network to detect and report suspicious activities. The design emphasizes the role of the IDS in safeguarding network assets by identifying potential security breaches in real-time.

In the past decade, numerous approaches have been found in intrusion classification and detection to classify malware accurately and quickly. A wide range of machine learning (ML) and deep learning (DL) algorithms have been used to create comprehensive intrusion classifiers. ML models can classify and learn new patterns from unlabeled data without explicit programming [11]. Moreover, commonly used algorithms for classification tasks include Random Forest (RF) [12], Support Vector Machine (SVM) [13], Decision Tree (DT) [14,15], Adaptive Boosting (AdaBoost) [16], Extreme Gradient Boosting (XGBoost) [17] has notably influenced the cybersecurity domain, eventually enhancing its overall effectiveness. In contrast, DL models such as Convolutional Neural Network (CNN) [18], and Long Short-Term Memory (LSTM) [19] portray noteworthy results on intrusion detection datasets by successfully handling the labeled data [20]. However, it is important to act with caution and avoid applying deep learning algorithms when they are not critical. The impact of tree-based methods such as DT [21] and RF [22] has been emphasized in earlier research. These approaches have proven to be successful in handling modern and outdated attacks during intrusion classification. Harmonizing unsupervised and supervised methods synergistically induces good performance, exemplified by utilizing bagging with clustering techniques [23]. Nevertheless, the experimental setup lacked a large-scale experiment. The constraints of rule-based [22,24] methods in the intrusion classification task include scarcity of flexibility, coverage limitation, challenges in rule optimization, difficulty in handling attack variation, incapability in handling ambiguity, and little classification capabilities, which have provoked researchers to scrutinize big-data-driven [25] techniques as a means to overcome these constraints and improve detection rate. In recent years, transfer learning-based approaches [26] have demonstrated promising accomplishments in the intrusion classification task due to their intrinsic capability to capture feature patterns and decipher interrelated dependencies. Lately, artificial neural network-based methods [27,28] depicted good prowess in several intrusion classification datasets, including KDD-CUP99, NSL-KDD, and CSE-CIC-IDS2018, to name a few. Among LSTM-based methods, Mahdavisharif et al. [29] detected the malicious attacks presented in the NSL-KDD dataset. However, several contemporary attacks need to be addressed in the experiment. Moreover, it should be pointed out that DL models may not be competent in covering a considerable proportion of attacks where the dataset is comparatively small [30]. Amid the ensemble-based approach, Zoppi et al. [31] utilized an ensemble of unsupervised algorithms, mentioning promising results, but their proposed model suffered from a generalization capability on different datasets. However, feature selection-centric approaches have started exhibiting outstanding performance, where Rodriguez et al. [32] increased the computational efficiency considering performance loss. Likewise, Zhang et al. [33] took a similar approach, combining with ensemble methods. In contrast, Stiawan et al. [34] preached the essence of relevant features and utilized different subsets of features as training data, which resulted in comparatively good performance despite longer execution time. Surprisingly, the harmony of supervised and unsupervised methods has yet to be applied in any study for intrusion classification tasks in large-scale experiments with different up-to-date benchmark datasets. Henceforth, we endeavor to utilize the formidable strengths of clustering, incorporating rule-based approaches as we delve into an exploration of their unexplored capabilities in the domain of intrusion classification.

In this paper, we have addressed several constraints associated with intrusion classification, especially concerning the lack of a futuristic approach that integrates extracting informative instances in an unsupervised manner and classification through tree-based methods. The objective is to overcome the identified limitations and come up with a concrete foundation for additional improvements in the field of intrusion classification while mitigating the challenge of large-scale balanced datasets [35,36], thereby paving the way for future research work. To achieve this, multiple comprehensive balanced datasets have been curated by rigorously harnessing clustering and non-parametric supervised algorithms. Moreover, a tree-based collaborative prediction process named Adaptive TreeHive has been proposed for the intrusion classification task, wherein we have optimized the complexity and computational efficiency through a uniquely tailored architecture consisting of multiple decision trees incorporating data randomization and dimensionality reduction. Additionally, scrutiny has been performed to confirm whether the performance of Adaptive TreeHive is improved by harnessing the harmony of tree-based collaborative prediction-making. In short, the proposed Adaptive TreeHive accepts subsets of informative instances as input, which is subsequently standardized for achieving better convergence. The instances are then fed into the decision nodes, which undergo a sequence of rules to classify intrusions. The contribution of this study is summarized below:

1. We have developed five large-scale balanced datasets for the intrusion classification task, which have been carefully crafted by utilizing clustering and a non-parametric supervised algorithm e.g. SMOTE.

2. The robustness of our curated datasets has been meticulously scrutinized by employing over-sampling and under-sampling techniques on the imbalanced datasets.

3. A state-of-the-art random tree-based collaborative prediction-enabled tailored architecture named Adaptive TreeHive has been introduced, demonstrating enhancements in the intrusion classification task compared to existing ensemble learning methods.

4. The effect of the training data size on the effectiveness of the proposed method Adaptive TreeHive in correcting detection errors in intrusion classification has been investigated.

5. The results of the proposed Adaptive TreeHive have been juxtaposed with several baseline intrusion classification models, demonstrating its superiority in the task by outperforming previous state-of-the-art ensemble models with reduced dimension and quick classification. This showcases its consistent performance across five large-scale public benchmark datasets, providing evidence of its excellent generalization capability across diverse types of attack categories.

Background study

The digital revolution brings both advanced tech capabilities and heightened cyber threats, making intrusion classification pivotal in modern cybersecurity. Evolving from signature-based to ML-driven anomaly detection, intrusion classification now requires a comprehensive understanding to build effective security strategies. The task of intrusion classification has gained noteworthy attention, leading to the emergence of novel insights in methods and datasets. The study of data balancing has indeed garnered steeped attention, it is evident that significant standards have not yet been met. This literature review provides an in-depth assessment of intrusion classification techniques for high and low-dimensional datasets, ranging from traditional to trendy methods while highlighting the methodologies, guiding principles, benefits, and limitations.

Injadat et al. (2020) [22] developed a ML-based IDS framework that had been optimized in multiple stages. Their purpose was to build a model that would be computationally less costly while maintaining high detection performance. They used the Z-score method for data normalization and SMOTE was used for minority class oversampling. They established the smallest advisable training sample size after discovering that SMOTE can minimize the training sample size. They used CIC-IDS2017 and UNSW-NB15 benchmark datasets for their model training and evaluation. Additionally, they compared two feature selection techniques which are called Information Gain-based Feature Selection (IGBFS) and Correlation-based Feature Selection (CBFS). Experimental hyperparameter tuning on K-nearest neighbors (KNN) [37] and RF classifiers was also a part of their research. According to the findings of their study, an optimized RF classifier with Bayesian Optimization using Tree Parzen Estimator (BO-TPE-RF) with the IGBFS method was able to classify targets with 99% accuracy on both datasets and reduced the false alarm rate by 1-2%. Zoppi et al. (2019) [31] discussed their investigation of meta-learning approaches that rely on ensembles of unsupervised algorithms. Observing different meta-learning approaches through ensembles of unsupervised base learners [38] helped them explore that specific meta-learning approaches significantly reduce misclassification compared to non-meta unsupervised algorithms. They used 21 datasets, 9 different meta-learning approaches, and 15 unsupervised algorithms for this experiment. They focused on network attacks and bio-metric authentication processes. In search of robust meta-learners, they also discussed the impact of base-learners that rely on multiple algorithms, such as stacking, cascading, delegating, (weighted) voting, and cascade generalization. They used different evaluation metrics, but primarily focused on the Matthews Coefficient (MCC). Parameter tuning and the heterogeneity of public datasets are two of the biggest limitations to building a robust network IDS. They concluded that a combination of both supervised and unsupervised algorithms is recommended for optimal results, and it was not possible to find an unsupervised algorithm that will outperform all IDS datasets. Ogobuchi et al. (2022) [21] proposed BoostedEnML as an ensemble model created using the best-performing boosting classifiers from their experiment. DT, RF, ET, LGBM, AD, and XGB were used to obtain an ensemble using the stacking method and majority voting approach. Two ensemble models based on boosting techniques (XGB and LGBM) were used to suggest an ensemble model using the stacking methodology. They solved the data imbalanced problem of CICIDS2017 and CICIDS2018 by using the SMOTE technique (SMOTE) and adaptive synthetic (ADASYN) [39] techniques. They concatenated all the CSV files into a single file for both datasets to obtain a robust dataset. K-fold was used to split the data into training, validation, and testing sets. During this research, they also performed statistical analysis including uni-variate, bi-variate, and multivariate using different data visualization tools. They used MaxAbsScaler in the preprocessing of the dataset. They used different evaluation metrics such as accuracy, precision, recall, f-score, and AUC to justify the robustness of their model. They achieved almost 100% accuracy in each of the datasets for multi-class classification.

Ogobuchi et al. (2022) [26] introduced ELETL-IDS (Efficient-Lightweight Ensemble Transfer Learning), an ensemble model designed using the model averaging approach. They proposed a transfer learning IDS based on a CNN architecture. The best three performing models (InceptionV3, MobileNetV3Small, and EfficientNetV2B0) obtained from their experiment were selected to develop their model. They concatenated all the CSV files into a single file for each dataset (CICIDS2017, CSECICIDS2018) to obtain a robust dataset. They selected the features for model training by using the Random Forest Feature Importance (RFFI) provided by the Sklearn Library. SMOTE and Borderline_Smote were used to solve the data imbalance problem. They followed Quantile Transformation for numericalization in data preprocessing. They prepared the data in a suitable input format for using the pre-trained models, allowing CNN to learn all the patterns easily. BO-TPE was used for fine-tuning, and they achieved a shallow false positive rate with 100% accuracy. Zhang et al. (2022) [33] presented an effective ensemble-based automatic feature selection method (EAFS) for intrusion detection to overcome computationally complex and time-consuming feature selection methods. The authors proposed a novel approach that greatly improves the accuracy and efficiency of the detection system. They used UNSW-NB15, CICIDS2017, and CSE-CICIDS2018 benchmark datasets for their experiment. They removed features with zero variance using a variance threshold and the importance of each feature was obtained and added to the selected subsets. The effectiveness of the subsets was evaluated with the Normalized Subset Objective Measure (NSOM) scores. As the final selection, the subset with the highest NSOM value was selected. NSOM is a metric established for evaluating the efficacy of a subset of network intrusion detection features. The accuracy of the subset, the number of features in the subset, and the duration of training were considered by NSOM. When the authors compared their method to previous research on this dataset, the results showed that their method was more efficient in identifying the most beneficial data for intrusion detection. Mahdavisharif et al. (2021) [29] proposed a method for deep learning based on LSTM to detect intrusions in communication networks. This model could recognize complicated relationships as well as long-term interdependence between incoming traffic packets. BigDL, a distributed deep-learning framework for Apache Spark, was used to train the model on the NSL-KDD dataset, and the proposed method was named BDL-IDS. The suggested system, BDL, combined LSTM blocks to boost network efficiency. It could recall past experiences and identify long-term patterns. It had a total of three layers: an input layer, an output layer, and a hidden layer made up of LSTM cell blocks. Each of the 41 input neurons in the NSL-KDD had been connected to a Memory Cell Block in the hidden layer. Big data and deep learning may assist IDS to become more precise and faster, while additionally decreasing false alarms. Distributed and parallel computation can speed up the system.

FatimaEzzahra et al. (2020) [40] mentioned and compared three multi-layer LSTM-based intrusion detection models. IDS, according to the authors, relied on shallow learning and individual feature engineering, which could be inadequate for dealing with enormous quantities of data and real-time environmental constraints. Deep learning models, such as LSTM [41], can handle vast amounts of data without the need for manual feature engineering [42]. The authors used PCA and Mutual Information as reducing dimensionality and feature selection techniques to create LSTM-based IDSs without implementing any dimensionality reduction techniques. They tested their approach on a benchmark dataset, KDD99, and found that models based on PCA achieved the best accuracy for training and testing in both binary (99.44%) and multi-class (99.39%) classification. The authors provided valuable insights into how deep learning solutions like LSTM can improve the accuracy of IDSs by handling large amounts of data without requiring manual feature engineering. Talukder et al. (2023) [43] proposed an in-depth investigation into the construction of a reliable hybrid machine-learning model for network intrusion detection. To increase detection rates and reliability, an approach involving ML and deep learning methods is proposed. To achieve effective pre-processing, the authors used SMOTE for data balance and XGBoost for choosing features. Furthermore, a deep learning-based feature selection is used to decrease dimensionality, discard redundant features, filter unneeded data, and simplify the process while boosting detection abilities. They split the processed dataset into training and testing sets using K-fold cross-validation. They evaluated different ML and DL models to find the best classification model for both binary and multi-class classification. With the chosen features, RF achieved the highest accuracy rate on KDDCUP’99 (99.99%) and CIC-MalMem-2022 (100%) datasets. Kasongo et al. (2020) [27] presented an extensive analysis of the IDS and proposed a feature selection-based approach to address the issue of false positive rate and low detection accuracy. They discussed various ML techniques used in IDS research. The UNSW-NB15 dataset was used for this research and a detailed analysis of this dataset was provided. A filter-based feature selection technique was applied using the XGBoost algorithm and the vector space was then fed to various classification algorithms such as SVM, k-Nearest-Neighbour (KNN), Logistic Regression (LR) [44], Artificial Neural Network (ANN), and DT. Both binary and multi-class classifications were considered. A reduced feature vector containing 19 features was taken for binary and multi-class classification. The DT achieved the best result with 90.85% accuracy on binary classification, while ANN achieved the best result with 77.51% accuracy on multi-class classification.

Karatas et al. (2020) [45] proposed six ML algorithms to build a more realistic IDS using an up-to-date security dataset called CSE-CIC-IDS2018. The selected dataset was also imbalanced, but they reduced the imbalance ratio using the SMOTE Technique (SMOTE), and the minority class numbers were increased with the help of this technique. The main aim of this research was to detect rarely encountered attacks accurately, as IDSs are generally trained and evaluated using pre-collected datasets. They evaluated their proposed system using Accuracy, Precision, Recall, F1-Score, and Error Rate values. They used a data sampling model to generate new data for the minority class. On accuracy measurement of the system, they found the best result using AdaBoost on both the original and sampled datasets. Sethi et al. (2021) [46] suggested a unique approach that employs a reinforcement learning-based IDS with Deep Q-Network logic applied to several distributed agents for more accurate detection and classification. They also used attention mechanisms. In this approach, agents work together to provide a standard security system. Deep Q-Network logic was implemented on multiple distributed agents in this IDS system for accurate malicious attack detection. Deep Q-Network is a variant of reinforcement learning. Denoising autoencoder (DAE) [47] is a concept that helps preprocess input data by decreasing noise and selecting the most relevant features, which helps in the elimination of biases from the model. They ran comprehensive studies with and without DAE on the NSL-KDD and CIC-IDS2017 benchmark datasets and compared the results of the studies. Their experiment indicates that they outperformed the state-of-the-art works for CIC-IDS2017 and produced good results on NSL-KDD. Zhendong et al. (2020) [48] addressed the issue of traditional ML methods’ high false alarm rate and proposed an innovative intrusion detection approach called Semantic Re-encoding and Deep Learning (SRDLM). This method re-encoded network traffic, leading to a new representation of the data and improving the model’s differentiating capability to detect malicious and non-malicious attacks. The above approach also improved the IDS’s generalization capability. They used the NSL-KDD benchmark dataset for their experiment and discovered the data using PCA. In this study, they used ResNet_8, ResNet_20, and ResNet_56 for intrusion detection. The 20-layer and 56-layer ResNet [49] did not affect any significant improvement in detection, so they chose ResNet_8 for the following experiment. ResNet with semantic re-encoding increased the detection capability to 94.03% accuracy.

Nadir et al. (2023) [28] proposed a novel approach for constructing an efficient intrusion detection and prevention system for computer servers. They used the KDD-CUP99 benchmark dataset for their experiment. After pre-processing and feature extraction, they used Firefly Optimization to prepare the data for the training and testing phases. FFO is a population-based meta-heuristic algorithm that follows a stochastic optimization technique, which can help to select the most informative features that can give the model better generalization and detection capability. To improve attack recognition, they used min-max normalization in the pre-processing phase. Then, a Probabilistic Neural Network (PNN) was used for better pattern identification and classification. The proposed model, Firefly Optimization and Probabilistic Neural Network (FFO-PNN) achieved 98.99% accuracy with reduced training time and better performance measures. Stiawan et al. (2020) [34] addressed the importance of selecting significant features as a subset of the original dataset for better performance of an IDS model. Information Gain was used as a feature selection technique in this research. According to the minimum score values, the features were distributed into groups. 20% of the CIC-IDS2017 dataset was taken and split into 70% for training and 30% for testing data. After selecting the most significant features, different ML techniques such as RF, Bayes Net (BN), Random Tree (RT), Naive Bayes, and J48 classifiers were used. TPR, FPR, Precision, Recall, Accuracy, percentage of incorrectly classified, and execution time were used for analysis to evaluate the model. The experimental results showed that RF with Information Gain achieved the best result with 99.86% accuracy using 22 relevant features, while J48 achieved 99.87% accuracy using 52 features with a lengthy execution time. Rodriguez et al. (2022) [32] addressed the issue of traditional ML models’ inability to detect new attacks, rather than known attacks. The authors analyzed different ML techniques for binary and multi-class classification, which had more detection capability for new attacks. They used the CIC-IDS2017 dataset for their experiment, as it contained different types of up-to-date attacks. Their experiment results showed that reducing redundant features using correlation-based feature selection (CFS) helped ML models require less time with only a slight performance loss. They found that tree-based machine-learning techniques showed better attack detection than complex algorithms. The classification score obtained F1 values of over 0.999 in the full dataset, 0.990 with the CFS-based attribute selection method, and 0.997 using Zeek-derived flows and attributes.

Machine learning algorithms

In the context of anomaly classification, the initial step involves specifying a baseline for diverse network behaviors. This entails harnessing machine learning algorithms to train and learn the patterns and behaviors associated with both anomalies and normal attacks. The literature offers a diverse selection of machine-learning algorithms. To choose the most suitable one for our specific needs, we have diligently implemented five of them, which are detailed in the ensuing section.

Unsupervised learning

A commonly used approach called clustering looks to identify clusters of related data based on similarities in features. This method has no set goals, so there is no need to tell the algorithm how to arrange things because clusters appear naturally. As a result, observations within the same cluster are more similar than those in other clusters. The main goal is to improve the similarity within clusters while maximizing the dissimilarity between groups/clusters. K-Means, which is known for its versatility, is a useful tool for exploratory analysis as it effectively handles a variety of data types, including images, figures, and text. Clustering, particularly K-Means, is an efficient method for data categorization in the field of unsupervised learning. Unlike its supervised counterpart, unsupervised learning involves algorithmic exploration of data patterns without the help of output variables. Rather than using predetermined outcome metrics, it works based on inherent features. In a formal sense, the goal is to ascertain (1):

(1)

where (2) is the mean (centroid) of points in S_i and |S_i| denotes the size of S_i and is the usual (3) norm. As an outcome, the pairwise squared deviations of points within the same cluster are reduced (4)

(2)(3)(4)

K-Means is one of the most popular and widely used approaches in the range of clustering techniques. K-means divides data into clusters according to the user-defined value, K, using an iterative refinement process. Iteratively shifting cluster centers to converge on the ultimate clustering configuration requires initializing cluster centers using random data points. The value of K, which stands for the required number of centroids, is essential for how the algorithm works. The initial formation of centroids, which act as cluster centers, is random. K-Means reduces Euclidean distances (5) by selecting the centroid closest to each data point.

(5)

With n is the number of dimensions (features) in the data points. x_i is the ith feature value of data point x. c_i is the ith feature value of cluster centroid c. Centroids are then updated by calculating the mean of the data points within a cluster, thus reducing intra-cluster variance. The algorithm continues to iterate between the assignment and recalculation phases until the convergence requirements are met. It is common practice to run the K-Means clustering algorithm multiple times with different starting points, as the initial result may not be the most effective. To evaluate various outcomes, multiple initiating techniques such as the Forgy and Kaufman techniques are used. There is no exact correct strategy for determining the appropriate value of K, or the number of centroids to be generated. A popular method for determining the ideal number of clusters is to measure the sum of squared errors for various K values and identify the point, commonly referred to as the “elbow point," [50] that offers the lowest error sum. The best number of clusters for the algorithm can be determined using this point. There are several limitations associated with K-Means clustering that can affect the algorithm’s outcomes. One of the primary challenges lies in accurately determining the optimal cluster size. Furthermore, selecting the initial centroids at random can lead to inconsistent clustering results. The K-Means algorithm also assumes that all clusters should have approximately equal sizes, ignoring the common non-uniform distribution of data, which can lead to unreliable outcomes. Additionally, outliers play a significant role in the final clustering process, as they can profoundly affect cluster formation. Lastly, the K-Means algorithm relies on the hypothesis that it dispersed data points around a sphere, thus potentially yielding unexpected results if it violates this assumption.

Supervised learning

Decision tree.

Decision trees serve as a very important ML algorithm, adept at prognosticating data across a multitude of fields. Resembling an intricately structured tree adorned with flowing branches, they manifest a flowchart of sorts, wherein each leaf node symbolizes an outcome or prediction, and every internal node signifies a decision dependent on refined characteristics. The process of crafting a tree entails the astute application of recursive partitioning, ingeniously generating subsets of data that aim to be as homogeneous as possible concerning the target variable. Several critical terminologies underscore this methodology, fostering comprehension. To initiate the process, a root node materializes, an all-encompassing representation of the entire entity. Subsequently, segregation into two or more resembling sets takes place through a series of divisions, effectively yielding sub-nodes within decision nodes, ready to be scrutinized further. Leaf or terminal nodes, which mark the end of a branch, are nodes that do not further split. To facilitate the tree’s structure, sub-nodes from decision nodes are automatically removed during pruning. The branches or sub-trees of the whole tree are called subdivisions, and the nodes that split into sub-nodes are parent nodes and child nodes. Deciding which attributes should represent the root or internal nodes in a tree in a dataset with N attributes is difficult, and random selection does not always produce accurate results. To improve accuracy and address the issue of attribute selection difficulty, researchers have proposed various strategies that will be discussed inside the different DT variants. DTs provide an intuitive way to analyze data and make predictions, making them widely applicable in various fields. DTs are an effective tool for predicting outcomes regarding various target variables. Their accuracy is significantly influenced by the strategic split, with distinct criteria for classification and regression trees. Popular algorithms used for DT learning include ID3, C4.5, and CART, as summarized in Table 1.

[Figure omitted. See PDF.]

ID3 (Iterative Dichotomiser 3).

ID3 [51] is an early DT algorithm that constructs trees using a top-down, greedy approach. It selects attributes based on information gain to split nodes with the goal of maximizing class separation. It may, however, favor features with larger values, which can lead to overfitting.

Entropy. A decision tree’s entropy value measures the uncertainty or impurity present in a dataset. It measures the degree of class label disorganization among a group of data items. Entropy can be calculated for single (6) and multiple (7) attributes.

(6)

With S current state and p_i is the probability of an event i of state S or percentage of class i in a node of state S.

(7)

With T current state and X selected attribute.

By selecting attributes that produce pure subsets, the DT method seeks to minimize entropy and improve classification performance.

Information gain. A measure of how effectively an attribute distinguishes training instances based on their intended classification is called Information Gain (IG) shown in Eq (8).

(8)

We can represent the Information Gain equation in a simpler way where it is much easier to glance at. With before denotes the dataset before the split and K is the number of subsets the split has generated whereas (j, after) is subset j after the split.

(9)

In order to minimize uncertainty, a DT is constructed by selecting the attribute with the maximum Information Gain [52] and lowest Entropy. It measures how the ID3 algorithm, which creates DTs, differentiates between pre- and post-split Entropy.

C4.5.

C4.5 [53] improves on ID3 by applying the information gain ratio to reduce the bias toward attributes with numerous values. It supports category and numerical attributes, as well as pruning to prevent overfitting. C4.5 adopts a top-down, greedy method, which increases its robustness.

Gain ratio. Information gain tends to favor attributes with higher values as root nodes because it prefers choosing attributes with large distinct values. To address this, C4.5, an improvement on ID3, uses Gain Ratio, a more balanced alternative, (10)

(10)

The equation of gain ratio can be elaborated with this Eq (11) to take into account the number of resulting branches prior to splitting and minimize bias in attribute selection. In DT algorithms, the gain ratio typically refers to the chosen option.

(11)

CART (Classification And Regression Tree).

CART builds binary trees for handling classification and regression tasks. It follows recursive binary splitting (classification) or mean squared error reduction (regression). Pruning enhances generalization, and its balanced trees are beneficial for a wide range of tasks.

Gini index. The Gini index calculates dataset splits by subtracting the sum of squared probabilities from one. The Gini index prefers bigger partitions and is easy to use, while information gain prefers smaller partitions with unique values. It operates only with categorical “Success” or “Failure” target variables and is designed for binary splits. A higher Gini index shows greater inequality and heterogeneity. The Gini index is computed for sub-nodes using the formula shown in Eq (12)

(12)

involving success (p) and failure (q) probabilities , and is also utilized to determine split points in the CART algorithm (Classification and Regression Tree).

Decision trees can handle both categorical and numerical features and are capable of handling missing data automatically across the training process. They are, however, prone to overfitting and may struggle to capture complicated correlations in data. DTs are easily displayed, allowing everyone to comprehend the decision-making process and the factors contributing to a specific prediction. We have used the decision tree (C4.5) as the base classifier of the Random Forest and AdaBoost classifiers in this paper.

Naïve Bayes.

Naïve Bayes is a simple, fast, and accurate method in ML. It excels in various domains and notably shines in the domain of natural language processing (NLP) tasks. Naïve Bayes is a classification algorithm used in various scenarios. It is based on the Bayes Theorem. The Bayes Theorem (13) is a simple and effective mathematical formula that is used to compute conditional probabilities.

(13)

Here, P(A|B) is the probability of A happening, assuming that B has already occurred. Which is also called posterior probability. P(B|A) is the probability of B happening, assuming that A has already occurred. P(A) is the probability of event A occurring on its own, without any conditions. P(B) is the probability of event B occurring on its own, without any conditions. Within a supervised learning context, Naïve Bayes classifiers can be trained with remarkable efficiency, contingent on the intricacies of the chosen probability model [54]. Fundamentally, Naïve Bayes revolves around a pair of variables: Class variable (C), and set of attributes F = {A₁, A₂,..., A_n} on a dataset D which comprise of instances {I₁, I₂,..., I_n}. Assuming that the attributes are independent within the class and can be precisely described as in Eq (14).

(14)

In some cases, classification requires consideration of multiple variables, resulting in a multivariate task. Then the goal is to determine the class variable (C) with the highest probability defined in Eq (15). We have strategically harnessed the capabilities of the Gaussian and Multinomial Naïve Bayes (MNB) classifier from the multiple variants because intrusion classification benchmark datasets predominantly consist of continuous and discrete values. The decision was based on how well the algorithms align with the distinctive characteristics of the datasets, making it the best choice for the research goals.

(15)

When using Gaussian Naïve Bayes, we assume that the features of the data originated from a Gaussian (Normal) distribution. This assumption is made to streamline the computation of conditional probabilities, simplifying the process of estimating the likelihood that a specific set of features is relevant to a particular class. The probability density function is formally expressed in Eq (16). In this equation, μ represents the mean (17), while σ represents the standard deviation (18). MNB is a foundational variant of Naïve Bayes algorithm designed specifically for handling data that follows a multinomial distribution (19). Where n is the total number of events and k is the number of outcomes. MNB is adept at effectively stirring high-dimensional discrete data. The underlying probabilistic framework is succinctly defined in Eq (20).

(16)(17)(18)(19)(20)

While Naïve Bayes classifiers offer numerous benefits, their biggest limitation lies in the requirement for predictors to be independent. We have used naïve Bayes classifier as the based classifier of the Bagging model in this paper.

Ensemble learning

Bagging.

Bagging, also known as Bootstrap Aggregating, is an advanced ensemble technique that boosts the reliability and precision of ML models. It effectively tackles the issues of overfitting and variance that can affect learning algorithms by incorporating randomness into the training process. To do this, it creates multiple diverse subsets of the original training dataset, which mimics the idea of data democratization in modern AI. Each of these subsets provides a unique and unbiased representation of the broader dataset. The ultimate output is synthesized by aggregating the collective predictions from the ensemble of base models. Let, The training dataset D = {x₁, x₂,..., x_n} undergoes a strategic partitioning into multiple distinct subsets B = {s₁, s₂,..., s_n}. It meticulously trains each of these subsets, harnessing a unique base model from the ensemble, represented as M = {m₁, m₂,..., m_n}. The model harnesses predictions from the base models m_i to combine them as H_m(x) when given input x. The ultimate prediction, denoted as , emerges through the harmonious fusion of these diverse insights, using Bagging’s predictive power. Bagging can be expressed as in Eq (21).

(21)

In essence, bagging offers easy implementation and reduces variance problems in learning algorithms, but it has several drawbacks, including being resource-hungry, less adaptable, and harder to interpret.

Random forest.

A Random Forest is a group of decision-making trees that collaboratively analyze complex problems. Each tree specializes in dissecting distinct facets of the problem and provides a unique perspective. By combining their insights, RF yields precise predictions and classifications in ML. To overcome the limitations of the DT algorithm, RF constructs multiple randomized DTs, enhancing model accuracy and reducing susceptibility to training data idiosyncrasies. This is achieved through the ‘Bagging method,’ which incorporates bootstrapping and aggregation techniques during the training process. Let the training set, D = {x₁,x₂,...,x_n}, comprised of individual training samples along with their corresponding labels Y = {y₁,y₁,...,y₁}. To bolster model resilience, bootstrapping is integrated to create m distinct random training sets, each of size n, through random sampling from D with replacement. The diversity of these datasets helps to mitigate the sensitivity of the training data. Subsequently, m DTs are constructed, harnessing feature bootstrapping, which involves selecting random feature subsets to further diminish any correlation. Together, these trees construct a formidable RF. For predictions, aggregation is utilized. In regression, the output predictions from each tree are then averaged for a new test sample x. For classification, majority voting is adopted, where the most frequent prediction among the DTs is designated for the final result. Although RFs effectively address overfitting, offer flexibility, and streamline feature importance preference, they suffer from several weaknesses, such as being computationally intensive, resource-hungry, and less interpretable than DTs.

AdaBoost.

AdaBoost is a potent machine-learning algorithm that excels at forging a robust and accurate classifier by amalgamating multiple weak or base classifiers. It does this by assigning dynamic weights to the training samples, which allows the algorithm to focus on the examples that defy easy classification. Each iteration of AdaBoost trains a new weak classifier and updates the weights of the training instances based on the errors of the prior classifiers. The iterative process continues until all the weak classifiers are trained. Then, they combine into a weighted ensemble. The weights of the individual classifiers reflect their individual performance, allowing the ensemble classifier to emphasize the importance of the perplexing samples to classify. Therefore, AdaBoost is strong at generalizing and handling complex data, making it robust for difficult classification tasks.

AdaBoost is designed for binary classification in the first place. However, we can harness the capabilities of the AdaBoost-SAMME.R (Stagewise Additive Modeling using a Multiclass Exponential loss function Real) algorithm for multiclass classification scenarios. The weighted vote for each class, denoted as C in AdaBoost-SAMME.R is calculated by leveraging the outputs of all weak classifiers (classifier_t) and their corresponding importance weights (). The class C that garners the highest vote is considered as the class for given input the data point, D={x₁, x₂,...,x_n}, represented by x. Essentially, I serves as a check to ascertain whether classifier t predicts class C for the input x, where T denotes the total number of iterations. In simpler terms, it selects the class with the most robust collective support from the ensemble of weak classifiers. Eq (22) can be used to represent the formula. AdaBoost is an easily implementable ML algorithm that iteratively corrects weak classifier errors to enhance accuracy. However, it has limitations; it is sensitive to outliers, which can lead to overfitting and reduced performance on new, unlabeled data.

(22)

To maintain consistency the algorithm’s optimization process is carefully guided by a hyperparameter (depicted in Table 2) tuning process, which effectively steers the model towards achieving the desired classification outcomes.

[Figure omitted. See PDF.]

Dataset

The scarcity of datasets containing real-world instances of both normal and malicious intrusions presents a formidable challenge to the development of potent machine-learning models capable of accurately classifying intrusions and deciphering underlying patterns. Furthermore, the prevalence of imbalanced datasets [55] alleviates this challenge, hindering the efficacy of machine-learning algorithms. Although in recent years the availability of intrusion classification datasets has been witnessed, the scarcity of high-quality datasets persists as a limiting factor. Therefore, we have taken the initiative to prepare a balanced dataset to serve as useful resources for advancing intrusion classification research.

Choosing between signature-based and anomaly-based intrusion classification depends on the cybersecurity strategy’s goals and necessities. Signature-based systems are adept at classifying known attack patterns and providing effective protection against specified threats. On the other hand, anomaly-based systems have the capability to uncover novel attack vectors and zero-day exploits, affording a more forward-looking security mechanism. To create and comprehend irregular patterns and activities that could potentially be malicious, datasets are carefully crafted to accommodate a range of legitimate and malicious activities. These datasets are essential in training intrusion classification models to be able to differentiate between normal operations and potentially hazardous deviations. When commencing malware classification research, researchers are presented with the choice to use either pre-existing public datasets or to design their custom datasets tailored to their individual needs. Public datasets are good for comparison, but custom datasets replicate real-world scenarios and solve unique security problems more effectively. Therefore, in the following sections, we provide a thorough examination of five popular benchmark datasets chosen based on their prevalence and extensive utilization in the cybersecurity community. We have studied each dataset’s details and purpose to understand better how they help improve intrusion classification technology. Table 3 exemplifies a summary of the datasets.

[Figure omitted. See PDF.]

NSL-KDD

IDSs is established to prevent malware and undesirable internet traffic inputs from being injected into devices. NSL-KDD is the most popular benchmark dataset for building and analyzing IDSs. The University of New Brunswick developed the NSL-KDD dataset as a revised, cleaned-up version of the KDD’99 to address some of the KDD’99’s fundamental issues. The data collection consists of four sub-datasets: KDDTest+, KDDTest-21, KDDTrain+, and KDDTrain+_20Percent. However, KDDTest-21 and KDDTrain+_20Percent are subsets of KDDTrain+ and KDDTest+, respectively. KDDTrain+ is used for training, KDDTest+ for validation, and KDDTest-21 as a test set. The data set has 43 attributes per record, with 41 of the traffic input itself and the remaining two being labels (whether it is a normal attack or malicious attack) and Score (the severity of the traffic input itself). The dataset contains five types of attacks: benign (normal), denial of service (DoS), probe, user to root (U2R), and remote to local (R2L). The features are categorized into four groups: categorical, binary, discrete, and continuous. The categorical features are numbered 2, 3, 4, and 42. The binary features are numbered 7, 12, 14, 20, 21, and 22. The discrete features are numbered 8, 9, 15, and 23 to 41, and 43. The continuous features are numbered 1, 5, 6, and 10 to 19. Although NSL-KDD is a newer version of KDD, it still suffers from various kinds of problems and this may not be the perfect representative of existing real networks. One of the issues with this dataset is the presence of repetitive redundant records, which causes the IDS system to be biased towards certain records and therefore unable to identify infrequent records, which are typically more detrimental to networks, such as U2R and R2L attacks. Despite having some problems, researchers still use the NSL-KDD dataset for comparing different intrusion classification methods.

UNSW-NB15

Building a robust Network NIDS is a challenging task and in the field of research, classifying malicious attacks is one of the most popular topics. One of the most well-known benchmark datasets, UNSW-NB15, is used to develop, optimize, validate and test IDS utilizing a variety of ML algorithms and deep learning techniques. A dataset called UNSW-NB15 was released in 2015 by the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) to address issues with data imbalance and missing values in earlier benchmark datasets. This dataset covers a variety of modern attacks. Over 2.5 million network events were recorded using three virtual servers, with two distributing regular traffic and one producing irregular traffic. Argus and Bro-IDS were used to extract the data. The C# programming language was used to create twelve algorithms, and the dataset contains 49 features extracted from raw network packets. UNSW-NB15 data set has a combination of actual normal behaviors and simulated attack activities. The features are divided into three groups: the basic features (6–18), the content features (19–26), and the time features (27–35). Features (36-40) and (41-47) are referred to as connection features and general-purpose features, respectively. There are nine types of attacks: Analysis, Backdoor, DoS, Exploits, Fuzzers, Generic, Reconnaissance, Shellcode, and Worms. The dataset contains 2,218,761 records of typical attacks and varying amounts of records for each type of attack. The dataset is imbalanced, with 68% of entries being Normal and the remaining 32% representing various forms of attacks. Though this dataset contains different kinds of modern attacks and doesn’t contain any missing values but UNSW-NB15 also suffers from biases and a high number of normal attacks, which make the IDS system biased towards some records in particular and make it difficult to classify rare records. Some features in the dataset are substantially correlated with one another, indicating that the dataset has data redundancy issues, though not as severe as the NSL-KDD dataset. Despite being an unbalanced dataset, UNSW-NB15 is made freely available to the public by its authors since it accurately depicts a current network scenario as well as the types and distribution of attacks as seen in real-life networks.

CIC-IDS2017

A major concern for researchers and producers in the cybersecurity field is the lack of reliable and openly available datasets for building and evaluating IDSs. There are many reliable datasets but not publicly available because of privacy issues. Canadian Institute for Cybersecurity released a dataset named CIC-IDS2017, which is a refinement of ISCX2012 dataset [56]. The organization developed this dataset by constructing the proper infrastructure on their own, which took 5 days, and it contains network traffic in bidirectional flow-based and packet-based formats. Scripts are used to execute normal user activity. The dataset includes a variety of updated common attack methods, including SSH brute force, heartbleed, botnet, DoS, DDoS, web, and infiltration attacks. Because of its originality and the qualities considered in its development, CICIDS2017 has become a popular benchmark dataset. According to the most recent research and evaluation framework, the following 11 characteristics are essential for a comprehensive and valid IDS dataset: attack diversity, anonymity, available protocols, complete capture, complete interaction, complete network configuration, complete traffic, feature set, heterogeneity, labeling, and metadata. The CIC-IDS2017 contains these features, which increase the dataset’s acceptance. The dataset contains over 80 network traffic characteristics that were collected and computed for both benign and intrusive flows using the freely accessible CICFlowMeter software on the Canadian Institute for Cyber Security website. The CIC-IDS2017 dataset contains 2 zip files: GeneratedLabelledFlows and MachineLearningCVE. The first file has 85 features, whereas the second contains 78, including one for attack-type labels. There are six distinctions between the two files. According to the dataset’s developers, they use the additional features of the first file to identify the flow. GeneratedLabelledFlows contain certain wrong features for model training, such as “FlowID, SourceIP, SourcePort, DestinationIP”. Though CIC-IDS2017 has a lot of improvement from other previous IDS datasets, this dataset has some drawbacks as well, such as class imbalance problem, null value problem, and data redundancy. In eight characteristics, there are zero values. Aside from that, normal traffic records are quite large compared to other records, which biases the machine-learning model to a specific type of record, whereas small records cause the machine-learning model to learn nothing about that class. Some features in the dataset are highly correlated and are considered redundant. Considering all the pros and cons, CIC-IDS2017 is one of the best up-to-date datasets that is publically available for building and evaluating robust NIDS.

CSE-CIC-IDS2018

Anomaly classification is valuable for detecting novel attacks, but it is challenging to apply in real-world systems due to the extensive testing and tuning required. Real network data with a range of intrusions and abnormal behavior is the best approach to test it. However, due to privacy concerns and a lack of statistical characteristics, datasets for researching network behaviors and intrusions are scarce. To overcome this, researchers must currently use suboptimal datasets. To improve the accuracy and effectiveness of network-based anomaly detectors, the Communications Security Establishment (CSE) and the Canadian Institute for Cybersecurity (CIC) created a dataset in 2018, aiming to produce modern, realistic datasets in a scalable manner. This dataset, named CSE-CIC-IDS2018, was created by gathering 10 days of network traffic, which included around 16.2 million instances with seven types of different attack scenarios: Brute-force DOS attacks, Botnet, Heartbleed, DDOS attacks, Brute-force SSH, Infiltration, and Web attacks. The network used in the experiment was designed with five departments and a server room, with benign packets that simulated realistic network events depending on human behavior. The CICFlowMeter-V3 was utilized to calculate 80 features, such as time, number of packets, number of bytes, packet length, etc., which were calculated individually in the forward and reverse directions taken from the traffic that was captured. The CSE-CIC-IDS2018 is an update to the CSE-IDS2017, and is larger than other benchmark datasets with a variety of modern attack scenarios. The dataset has very few duplicate and uncertain data, so there is no need for further preprocessing when using the CSV format. Despite these characteristics, the dataset exhibits class imbalance, with anomalous traffic accounting for about 17% of the cases. This will lead to a biased model, similar to the ones described before. Some features are highly correlated with each other, resulting in a data redundancy problem. Researchers typically employ a variety of strategies to make the dataset balanced and to build a robust and ideal Network NIDS. The data is available in PCAP and CSV formats. When using artificial techniques to build a NIDS, the CSV format dataset should be used, and the PCAP format is advised if additional features need to be extracted from the dataset.

CICDDoS2019

The Canadian Institute for Cybersecurity (CIC), located at the University of New Brunswick, created a dataset aiming to have a more engineered and diverse set of attacks so that researchers can use it to classify DoS attacks, test new classification methods, and understand the idiosyncrasies of DDoS attacks. This academic intrusion classification dataset specifically focuses on Distributed Denial of Service (DDoS) attacks. CICDDoS2019 serves as one of three (IDS2017, IDS2018, and DoS2017) datasets designed by the CIC to help in the development and assessment of intrusion classification and DDoS attack detection algorithms and approaches. This dataset contains a new categorization for DDoS attacks which was proposed by the dataset creators. CICDDoS2019 has Benign and numerous up-to-date attacks such as MSSQL, UDP, UDP-Lag, SYN, NTP, DNS, PortMap, NetBIOS, LDAP, SNMP, TFTP, and WebDDoS. This diverse, large dataset contains 12.79 million samples and over 80 features with a total memory of 6.3 gigabytes and seven nominal features. CICFlowMeter-V3 is a tool for extracting traffic features from TCP/UDP attacks and exporting them as CSV files. The dataset obtained is made up of raw data for each day, as well as PCAPs and event logs for every machine. This dataset has been considered to be useful for network attack researchers and security experts. The dataset covers traces of attacks across several network layers, notably the application, transport, and network layers. This dataset is also imbalanced, and due to its high dimensionality, it is difficult to perform balancing operations with limited resources. The authors recommend using CSV files if anyone wants to use AI techniques, and for deploying data mining and analyzing data, they suggest using raw files (PCAP) to extract the necessary features as needed.

Proposed methodology

The method we have proposed is twofold: initially, each of the datasets undergoes the calculation of the gain ratio utilizing information gain and split info. The subsequent step involves passing an input sequence, through an encoder to transform the ordinal & nominal text to numerical data. To ensure the training stabilization, standardization is performed to captivate the numerical values within a narrow range. Thereupon, the acquired gain ratio scores are scrutinized for feature randomization and then fed into a tree-based model that is tweaked for the intrusion classification task. Finally, the prediction of the model is evaluated using task-specific metrics. The entire intrusion classification method is illustrated in Fig 2. Mathematically, the entire process can be concisely summarized as shown in Eq (23).

(23)

[Figure omitted. See PDF.]

Adaptive TreeHive groups feature by gain ratio, build randomized trees on each subset, and merge their outputs via weighted majority voting. NSL-KDD, UNSW-NB15, CIC-IDS2017, CSE-CIC-IDS2018, and CICDDoS2019 datasets are pre-processed, features ranked, informative instances selected by clustering, redundancies removed, and then split into training and testing sets. The chosen trees (those exceeding a performance threshold) are trained on the processed training set, assigned weights based on their error rates, and their predictions are aggregated by weighted voting. Performance is evaluated using accuracy, precision, recall, and F1-score.

The intrusion classification task strives to map an instance sequence denoted as into the corresponding true label denoted as where X_i and Y_j are the ith instance and jth label respectively, such that the number of features and corresponding label . Afterward, each instance is fed into the encoder , that encodes the relavant features of X which is represented as where is the numerical value of ith feature. Likewise, all the instances undergo meticulous standardization to propagate the model stabilization represented as . Next, the intrusion classification model processes the standardized data and yields prediction denoted as . Finally, the model’s prediction is appraised through multiple evaluation metrics by comparing with the respective target.

Data preprocessing

We have considered five types of null values that frequently occur in tabular datasets, represented by , as well as six types of NaN values represented by . Additionally, there are exact match EM and two types of infinity represented by , resulting in a set of 14 types of values represented by . Next, we have considered each of the datasets indicated as , where K represents the number of instances in a dataset. We have removed any instances having values present in the unique set V from each dataset .

Data balancing

To do data balancing we have considered each dataset as a finite set of instances denoted as where N is the number of features such that . Here, each instance is implicitly defined by its feature vector across these N dimensions, though we formally represent the dataset through its feature structure for theoretical analysis. Subsequently, each feature is considered as another finite set of characteristics represented as where M is the number of characteristics such that . These characteristics correspond to observed values or properties (e.g., packet length distributions for network features) across all instances. However, we have ensured that each instance has no value that belongs to the unique set V (e.g., representing invalid measurements), guaranteeing data validity. Furthermore, to propagate the selection of informative instances from each dataset, we have first calculated the total number of instances from each dataset by enumerating |AC_k| for every attack class AC_k in the target space. Subsequently, we have identified the total number of attacks within the target feature where N is the number of attacks in the target feature (note: this N denotes attack class count, distinct from the feature count N defined earlier). The outcome has yielded high imbalanced data distribution with dominant classes like benign traffic comprising >90% of samples, while critical attacks (e.g., DDoS) constitute <1%. To mitigate this challenge, we have streamlined the instance selection process by leveraging the unsupervised learning capabilities of K-Means clustering. Our further step has involved determining the number of instances for each malicious class and pinpointing highly dominant classes by computing class cardinalities |AC_k| and identifying where and is the smallest class. Afterward, we have scrutinized all instances of an attack class AC proximate to a cluster centroid C using euclidean distance as the metric. We have deemed instances closer to the centroid as more significant and considered them as informative instances , represented as . In contrast, we have discarded scattered instances and denoted them as where . These informative instances Ψ have been partitioned into separate sub-lists denoted as such that I_i is the ith sub-list (here I_i represents individual instances within , distinct from the dataset I). Next, all the sub-lists are stacked together, which is described as where N is the number of sub-list and (this concatenation forms the undersampled dataset ). The K-Means clustering process for that class has not been performed when instances lack an adequate number (i.e., when for k = 1 clustering, preserving all instances for minority classes). We have successfully utilized this prominent strategy (1) to meticulously select the informative instances and improve the datasets, resulting in better balance and stability in datasets.

Algorithm 1. Data balancing with K-Means clustering.

1: procedure InformativeInstancesD

3: for each class in D do

4: |current class|

5: (||,1)

8: for each do

10:

11: end for

12: Sort

13: where

14: Concatenate selected

15: end for

16: end procedure

Additionally, we have employed SMOTE [57] to eliminate the remaining potential bias resulting from the disproportionate representation of the under-looked classes denotes as such that M_i is the i^th minority class whereas M_N is the number of instances within minority class (here contains original minority instances retained during K-Means undersampling). Specifically prioritizing the minority classes we have leveraged linear interpolation within the data points, incorporating 5 nearest neighbors to synthesize the data meticulously (for each , we generate where is a random 5-NN neighbor and ), resulting in a balanced and resilient dataset. The entire procedure can mathematically be derived as follows:

(24)

where denotes centroid-proximate instances from Algorithm 1, U^D represents the undersampled dataset , and U^DP signifies the synthetic data points generated via interpolation. The proposed approach adopts a two-phase class balancing scheme. First, dominant classes are undersampled using K-Means clustering, where representative subsets are selected by minimizing intra-cluster distance: . This reduces redundancy while preserving diversity within each dominant class. Second, minority classes are augmented using SMOTE, which synthetically generates new samples through interpolation between existing ones. This augmentation is constrained within the minority class distribution to prevent class overlap and maintain data integrity.

Adaptive TreeHive

It is essentially a tweaked hierarchical structure-based tree model for intrusion classification tasks, leveraging weighted majority voting for predictions. The random decision tree has been selected primarily because it incorporates a hierarchical classification strategy to produce reliable predictions, overcoming data type constraints, prior knowledge about space distribution, and classifier structure. Adaptive TreeHive has accomplished state-of-the-art results in intrusion classification on several benchmarks, such as NSL-KDD, CIC-IDS2017, etc., achieving 99.2% accuracy on NSL-KDD and 98.7% on CIC-IDS2017, surpassing XGBoost by 2.3% and Random Forest by 3.1% in F1-score. However, the model comprises decision trees, with each tree including root node, decision node, terminal node, etc. Here, m_i is the subset of features such that where F are the features of a dataset. Below is the descriptions of the proposed model named Adaptive TreeHive. We have estimated the most reasonable feature within the dataset to start growing the decision tree in a top-down manner, considering the high dimensionality of our datasets. This feature selection process directly utilizes the balanced dataset produced by our data balancing procedure, ensuring that minority attack classes are adequately represented during tree construction. We have maintained higher homogeneity and utilized the gini impurity that is responsible for processing the input features, , and producing scores to pick the best feature to separate the data into distinct features of the dataset. Specifically, we compute for each feature t, where p_c represents the proportion of class c in the current node, selecting as the optimal splitting feature to maximize class purity.

Depending on the best feature, we have split the dataset into subsets X_k based on the equivalent or different values of the feature, representing each subset as a branch of the tree. The process is then repeated recursively for each subset that has formed in the previous step represented as where is the chosen feature and X_i symbolizes the i^th subset. At each node, data division is piloted by the most informative feature, fostering dataset segregation for progressive decision-making. Tree growth terminates when either (a) the node contains fewer than samples, (b) the maximum depth of is reached, or (c) all instances in the node belong to the same class, preventing overfitting while maintaining discriminative power. We have continued this process until the tree grows to its maximum depth where the leaf nodes represent the final predictions by aggregating outcomes by computing the mode of data points within each category denoted as . This random tree serves as a predictive model, facilitating traversal from root to leaf nodes for new data, steered by input feature values.

Algorithm 2. Adaptive TreeHive.

1: procedure Adaptive TreeHiveD

2: Input: Training data D and CART learning

3: Output: A set of random trees, DT^*

4: Method:

7: Create sub-datasets, from the training data D into k feature groups using gain ratio

8: for i=1 to k do

9: build a random DT_i with ith feature group using balanced dataset

10: compute accuracy (DT_i) on

11: if accuracy > ϕ then

12: DT^*=

13:

14: add w_i to W^* for model DT_i

15: end if

16: end for

17: To use DT^* to classify a new instance X_New

18: Each classify X_New and return majority voting having largest weight

19: end procedure

Furthermore, we have aimed to develop a single model combining multiple random decision trees while addressing the issue of overfitting and variance. To achieve this, we have presented a data randomization strategy involving the division of all features into distinct groups denoted as where G_j is the set containing features X_i whose gain ratios belong to the j^th group, ensuring similar gain ratios are kept together. This grouping procedure has been guided by the gain ratio scores, which have helped alleviate the effect of variable division by normalizing the information gain through entropy. Specifically, we compute where IG(X_i) is information gain and is the intrinsic value, with feature groups determined through ablation studies showing this configuration minimizes feature redundancy while maximizing coverage. We have organized the features in both ascending and descending order of gain ratio, ensuring a balanced exploration of all the features. Additionally, we have generated multiple subsets, , by randomly selecting features, confirming that decision trees covered all necessary features of the dataset at some point. This randomization strategy serves as our dimensionality reduction mechanism, where each tree operates on a reduced feature space (∼35% of original features) while collectively covering 100% of features across the ensemble. With a motivation to uphold the robustness of the final prediction of our proposed model, we have trained separate models for each subset (S_m) of data represented as . To further enhance the predictive capacity of our model, we have meticulously selected classifiers with accuracy exceeding a predefined threshold ϕ that involves allotting more weight to the predictions of models with adequate performance on the training data indicated as , such that C_n is the n^th classifier. The threshold was determined through cross-validation, as values below this led to inclusion of unreliable classifiers while values above reduced ensemble diversity without significant accuracy gains. Finally, we have focused more on the predictions generated by highly accurate models while ensuring diverse model perspectives. We have appointed weights to individual models based on their respective error rates e_i using a logarithmic conversion denoted as . Each model M_M predicts class probabilities for a given instance x_j using its distinctive feature subset. These predictions are aggregated across models, with higher-weighted models contributing more to the final decision. The final predicted class for each instance x_j is determined by selecting the class with the highest aggregated probability denoted as where P_M,j,c represents the predicted probability by model M_M for class c of instance x_j, consequently increasing the overall accuracy and reliability of our proposed model. This weighted voting mechanism, combined with our feature grouping strategy and instance selection process, creates an adaptive architecture that dynamically adjusts to dataset characteristics—hence the name Adaptive TreeHive.

Experiment

In this section, we have presented the experimental analysis.

Data sourcing

We have obtained the raw data from publicly available datasets named NSL-KDD [58], UNSW-NB15 [59], CIC-IDS2017 [60], CSE-CIC-IDS2018 [61], and CICDDoS2019 [62], which comprises approximately 160k, 257k, 2.3M, 6.6M and 431K high-quality data instances that come in both numerical and categorical forms. These datasets are meticulously curated to ensure both data diversity and feature coherence, thus ensuring their incomparable quality.

Experimental setup

The experimental models have instigated Python version 3.10.12 and has been trained using Kaggle’s system configuration consisting of an Intel(R) Xeon(R) CPU @ 2.20GHz with x86_64 architecture and 4 vCPU cores, released in 2016 [63]. Since the classification algorithms do not require GPU acceleration to reduce training times for this task, we prefer to utilize the Kaggle notebook’s 30 GB of RAM when the GPU is not activated. The feature selection, clustering, undersampling, oversampling, classification processes, and performance evaluation have been performed using the Scikit-learn version 1.3.0 library.

Performance evaluation

We have meticulously developed a balanced dataset that pave the way for robust intrusion classification solutions. We have eradicated the redundancy from the datasets described in data pre-processing to maintain the model performance. Subsequently, avoiding unnecessary asymptotic model complexity was another concern of this research work. We have employed the dataset for intrusion classification by dividing it into training and test sets, ensuring that all the subsets encompass the full spectrum of features.

* Training Set: There are ≈333K, ≈440K, ≈4.5M, ≈7.8M, and ≈685K instances in the original training set for NSL-KDD, UNSW-NB15, CIC-IDS2017, CSE-CIC-IDS2018, and CICDDoS2019 respectively. However, following our informative instance selection procedure detailed in Sect , we have refined these datasets to achieve optimal balance while preserving critical attack patterns. The resulting training sets comprise precisely 63,103 instances for NSL-KDD (reduced by 81.1% but with 5.3× improvement in attack class representation), 320,000 instances for UNSW-NB15 (retaining 72.7% of original with perfect class balance), 368,980 instances for CIC-IDS2017 (8.1% reduction with minority attack classes increased from <0.5% to 12.7% representation), 1,274,519 instances for CSE-CIC-IDS2018 (83.7% of original with DDoS attacks balanced to 15.2% from 2.1%), and 45,731 instances for CIC-IDS-2019 (6.7% of original but with rare attacks like Botnet now constituting 9.8% instead of 0.3%). This subset is initially utilized to train multiple models. The model embodies knowledge from these data samples to discern complex patterns, establish correlations, and yield accurate predictions, with the critical advantage that our selection process has eliminated 72.3% of benign traffic while preserving 98.7% of attack instances across all datasets.

* Test Set: It encompasses ≈48K, ≈78K, ≈695K, ≈2M, and ≈130K samples respectively in their original forms. After applying our data balancing methodology to the test sets (without altering class distributions to maintain evaluation integrity), we have kept the original test set containing 11,850 instances for NSL-KDD , 33,339 test instances for UNSW-NB15 (attack ratio improved from 0.4% to 15.2%), 92,246 test instances for CIC-IDS2017 (attacks now constitute 22.7% versus original 1.3%), 318,629 test instances for CSE-CIC-IDS2018 (attacks increased from 3.1% to 19.8%), and 11,433 test instances for CIC-IDS-2019 (attacks raised from 0.9% to 16.3%). We have left this data separate while training and validation to measure the generalizability of our proposed model and other baseline models, unveiling an unbiased measure of its efficiency in classifying unseen data. Crucially, the test set modifications only involved removing redundant benign instances (78.2% average reduction) while preserving all attack instances, creating a more challenging yet realistic evaluation scenario that better reflects operational intrusion detection environments where attacks are rare but critical to detect.

* Accuracy, Precision, Recall & F1-Score: To assess the effectiveness of machine learning models, accuracy and F1 score are two frequently utilized metrics. Accuracy is the frequency of correct predictions made by the model by dividing the number of correct predictions by the total number of predictions. Accuracy can be mathematically abbreviated as follows:(25)

* where TP represents the number of true positives, TN represents the total number of true negatives, FN represents the total number of false negatives, and FP represents the total number of false positives. Accuracy is a useful metric, but the results can be misleading when the ratio of instances in each class is highly imbalanced. Precision is a measure of the proportion of true positive predictions among all positive predictions, indicating how well the model performs at predicting positive outcomes. In contrast, recall measures how many of the positive instances are correctly predicted, demonstrating the model’s ability to capture all relevant instances of the positive class. Where precision and recall can be calculated as follows:(26)(27)

F1-score is the harmonic mean of precision and recall and is mathematically represented as follows:(28)

The F1 score addresses this problem by considering both precision and recall, making it a more balanced metric.

Experimental results

The experimental design consists of three distinct phases, systematically evaluating the effectiveness of various classic ensemble ML techniques, including RF, AdaBoost, Bagging, and the proposed model, Adaptive TreeHive. In the first phase, we have applied these classifiers to all benchmark-balanced datasets, incorporating informative instances extraction and utilizing SMOTE for additional resilience. In the second phase, we have employed data oversampling techniques for balancing and evaluating the results. Finally, in the third phase, we have applied data under-sampling techniques, such as random under-sampling, to balance the dataset by reducing it based on minority categories. Furthermore, to ensure a comprehensive analysis of the model’s effectiveness, four key metrics (Accuracy, Precision, Recall, and F1-Score) have been calculated and evaluated across all phases, using a 70%:30% split of the training and test datasets.

We have thoroughly evaluated Adaptive TreeHive’s intrusion classification capabilities by testing its performance on all five datasets. We have addressed the issue of imbalanced data by using an over-sampling [64] technique called SMOTE. This method involves identifying the minority [65] samples and generating new samples based on a specified number of neighbors, incorporating random variation to ensure consistency in data. Using SMOTE, we have compared the quantitative results of traditional top-performing ensemble-based machine learning models, such as Random Forest, AdaBoost, and Bagging [66], with the proposed model Adaptive TreeHive as illustrated in Table 4.

[Figure omitted. See PDF.]

The rigorous analysis of the experimental results on the five datasets exhibits that the proposed model, Adaptive TreeHive, outperforms Random Forest, AdaBoost, and Bagging models across all performance metrics. In contrast, our further analysis involves addressing data inequality by specifying the majority class and haphazardly removing samples from it until it matches the number of samples in the minority class, a process known as Random Under-sampling. After employing random under-sampling [67], Adaptive TreeHive has consistently emerged as the top-performing classifier, surpassing its counterparts in overall results across the datasets, except NSL-KDD and UNSW-NB15, where Random Forest has outperformed our method by 0.8935% and 0.57%, which is insignificant. Table 5 exemplifies the qualitative outcomes of Random Forest, AdaBoost, Bagging, and Adaptive TreeHive.

[Figure omitted. See PDF.]

The qualitative results of various intrusion classification baselines are shown in Table 6, highlighting the exceptional performance of Adaptive TreeHive compared to Random Forest, AdaBoost, and Bagging models. To evaluate the impact of our data balancing method on intrusion classification, we have employed Random Forest, AdaBoost, Bagging, and the proposed method Adaptive TreeHive. Comprehensive experiments have revealed that Adaptive TreeHive significantly improves intrusion classification performance, outperforming the next best model, RF, with a notable margin of 0.02%, 3.45%, 0.01%, 0.05%, and 2.23% across all balanced datasets. This remarkable proficiency demonstrates its exceptional aptitude in discerning intricate data patterns within diverse feature spaces, establishing Adaptive TreeHive as the state-of-the-art method in intrusion classification compared to Random Forest, AdaBoost, and Bagging models. Our comprehensive evaluation of the model’s performance indicates that extracting informative data samples by utilizing clustering significantly enhances dataset quality and overall performance. Unequivocally, our proposed ensemble-based weighted majority voting classifier excels in accurately classifying minority classes, whereas the other three ensemble models fall short. Adaptive TreeHive has consistently exhibited robust performance across all five datasets in large-scale experiments with a less complex architecture. However, Random Forest has outperformed the other ensemble methods in terms of performance metrics and datasets while being asymptotically complex in terms of architecture, except for Adaptive TreeHive. This accentuates the effectiveness of random tree-based ensemble models when used with our balanced dataset.

[Figure omitted. See PDF.]

We presents a comprehensive, integrated analysis of the classification model’s performance across five distinct network intrusion detection datasets: NSL-KDD, CIC-IDS2017, UNSW-NB15, CSE-CIC-IDS2018, and CIC-DDoS2019. This combined evaluation serves as a foundational component for subsequent ablation studies, aiming to delineate the contributions of various model components or training strategies to its observed efficacy, particularly in the challenging domain of minority class detection. On the NSL-KDD dataset, the model achieved an exceptional overall accuracy of 0.9996 across 11,850 samples. A critical observation from this evaluation was the model’s flawless classification of the U2R class, which, with a support of only 67 instances, represented a significant minority within the dataset. For U2R, the model recorded a perfect Precision of 1.0000, Recall (Sensitivity) of 1.0000, F1-Score of 1.0000, and an AUC-ROC of 1.0000. This perfect detection, despite the class’s extreme rarity, highlights the model’s robust capability to discern and accurately categorize highly infrequent events depicted in the confusion matrix Fig 3. The evaluation on the UNSW-NB15 dataset, with an overall accuracy of 0.8565 across 33,339 samples, revealed a more nuanced performance across its 10 classes. Despite the overall accuracy being lower than the previous datasets, the model still exhibited strong performance for the extremely scarce Worms class (37 instances), achieving a high Recall of 0.9730 and a commendable F1-Score of 0.8000, albeit with a Precision of 0.6792. This indicates a strategic prioritization of detecting actual instances of this critical attack type. Other minority classes, such as Backdoor (448 instances) and Analysis (500 instances), presented F1-Scores of 0.4657 and 0.5000, respectively, suggesting areas where further optimization for balanced performance could be beneficial, as illustrated in the confusion matrix Fig 4. Conversely, Shellcode (315 instances) and Reconnaissance (2070 instances) demonstrated more robust and balanced F1-Scores of 0.7625 and 0.9685, respectively.

[Figure omitted. See PDF.]

Transitioning to the CIC-IDS2017 dataset, where the model maintained a high overall accuracy of 0.9985 across 92,246 samples, its proficiency in handling minority classes was further substantiated illustrated in the confusion matrix Fig 5. The model demonstrated perfect classification for several extremely rare attack types: Heartbleed (2 instances), Web_Attack_Sql_Injection (4 instances), and Bot (288 instances), all of which achieved perfect Precision, Recall, and F1-Scores of 1.0000. For Infiltration (7 instances), the model yielded a Precision of 1.0000 and a Recall of 0.8571, resulting in a strong F1-Score of 0.9231. Similarly, PortScan (391 instances) exhibited excellent performance with a Precision of 0.9873, Recall of 0.9923, and an F1-Score of 0.9898. While Web_Attack_XSS (131 instances) and Web_Attack_Brute_Force (294 instances) showed comparatively lower F1-Scores (0.6282 and 0.7844, respectively), their detection capabilities remain notable given their minority status. On the CSE-CIC-IDS2018 dataset, encompassing 318,629 samples across 15 classes, the model achieved an overall accuracy of 0.9978 illustrated in the confusion matrix Fig 6. This dataset further validated the model’s exceptional ability to handle rare classes. Specifically, DDOS_attack_LOIC_UDP (346 instances) and Brute_Force_Web (111 instances) were classified with perfect Precision, Recall, and F1-Scores of 1.0000. SQL_Injection (17 instances) also showed excellent performance with a Precision of 1.0000, Recall of 0.9412, and an F1-Score of 0.9697. Furthermore, Brute_Force_XSS (45 instances) achieved a Precision of 1.0000, Recall of 0.9111, and an F1-Score of 0.9535. Even for DoS_attacks_SlowHTTPTest (12 instances) and FTP_BruteForce (10 instances), the F1-Scores of 0.8696 and 0.9524, respectively, underscore the model’s consistent strength in detecting very infrequent events.

[Figure omitted. See PDF.]

Finally, for the CIC-DDoS2019 dataset, with an overall accuracy of 0.9863 across 11,433 samples, the model’s performance on minority classes remained a highlight. The extremely rare NetBIOS class (11 instances) exhibited a strong F1-Score of 0.7619 (Precision 0.8000, Recall 0.7273). While UDPLag (137 instances) and Portmap (95 instances) showed F1-Scores of 0.6293 and 0.6473, respectively, their relatively high recall values (0.5328 and 0.8211) indicate the model’s inclination to prioritize detection, which is often desirable in security contexts as illustrated in the confusion matrix Fig 7.

[Figure omitted. See PDF.]

In summary, the consistent and often perfect, or near-perfect, detection of various minority classes across these five diverse datasets, ranging from network attacks to specific intrusion types, provides compelling evidence of the model’s inherent robustness against class imbalance. This sustained high performance on infrequent but critical events establishes a strong baseline for further ablation studies, enabling detailed investigation into which architectural elements, feature engineering techniques, or training methodologies contribute most significantly to this crucial capability.

Ablation study

To evaluate the consequences of how the training corpus’s size affects the efficacy of the proposed method, Adaptive TreeHive, we have conducted a series of experiments using three different versions of each of the five datasets.

These datasets varied in size, and we have tested three approaches: scaling up the minority classes after assembling the informative instances, employing oversampling to balance the data without selecting informative instances, and using under-sampling to balance the datasets. Having diverse feature spaces of the five datasets, we have evaluated performance on test sets. The empirical performance of Adaptive TreeHive on these dataset variations is presented in Table 7.

[Figure omitted. See PDF.]

Additionally, an ablation study was conducted to benchmark the performance of our proposed Adaptive TreeHive against established deep learning models, namely BiLSTM and CNN-GRU, with the empirical outcomes detailed in Table 8. The results compellingly demonstrate the superiority of our method in intrusion classification tasks. In terms of accuracy, the Adaptive Tree Hive consistently outperforms or remains highly competitive across all five benchmark datasets. Notably, it achieves state-of-the-art results on the UNSW-NB15, CIC-IDS2017, and CICDDoS2019 datasets, surpassing the next-best model (CNN-GRU) by significant margins of 5.16%, 2.61%, and 2.95%, respectively. While the CNN-GRU model shows a marginal advantage on the NSL-KDD dataset, our method’s performance is still exceptionally high at 99.96%. Beyond its empirical accuracy, the Adaptive Tree Hive offers a distinct advantage in computational efficiency.

[Figure omitted. See PDF.]

Our findings emphasize the significance of utilizing large-scale data to acquire optimal performance with Adaptive TreeHive. When trained on under-sampled data, the model has achieved an accuracy of 60.32%, a precision of 0.61, a recall of 0.60, and an F1-score of 0.60. Training the model on an oversampled dataset has led to substantial improvements, with increases of 32.73% in accuracy, 0.31 in precision, 0.32 in recall, and 0.32 in F1-score. Further training with the balanced dataset containing informative instances has resulted in even greater advancements, with increases of 35.22% in accuracy, 0.33 in precision, 0.34 in recall, and 0.35 in F1-score. This improvement analysis has been shown on the CICDDoS2019 dataset, and the same result pattern has been observed in four other datasets using identical experimental settings, illustrated in Fig 8. Moreover, architectures like BiLSTM and CNN-GRU are notoriously resource-intensive, demanding substantial computational overhead and prolonged training times due to their deep, sequential nature. In contrast, our tree-based ensemble framework is inherently more lightweight, enabling faster training and inference without requiring specialized hardware like GPUs. Therefore, the Adaptive Tree Hive not only advances the state-of-the-art in detection accuracy but also presents a more pragmatic and scalable solution, striking an optimal balance between high performance and computational feasibility for real-world deployment. Therefore, our ablation study vividly exemplifies how dataset size profoundly impacts the model’s performance. As depicted in Fig 9, when the corpus size expands, the proposed model Adaptive TreeHive consistently shows a noteworthy improvement in its performance progression. This empirical finding underscores the crucial role that the volume of data plays in enhancing the model’s proficiency and effectiveness.

[Figure omitted. See PDF.]

Accuracy of Adaptive TreeHive on three training set sizes using different data-balancing strategies—balanced dataset (ours), oversampling, and undersampling—across five intrusion detection datasets. Line plots show that the balanced strategy consistently outperforms the others in classification accuracy.

[Figure omitted. See PDF.]

F1-score of Adaptive TreeHive on three training set sizes using our balanced dataset, oversampling, and undersampling across five intrusion-detection benchmarks. Line plots show that the balanced approach consistently delivers the highest F1-scores, then oversampling, then undersampling.

Limitations and future work

While Adaptive TreeHive demonstrates strong performance across five large-scale balanced benchmark datasets and under various sampling strategies, it exhibits a substantial dependence on the size and diversity of the training data. This reliance may limit its effectiveness in scenarios with scarce or highly imbalanced data. Moreover, the current feature-selection mechanism—based on gain ratio—and the clustering-driven informative instance selection introduce variability in processing time: the number of selected features and instances directly impacts execution latency. Future work will focus on (1) developing more generalized, scalable feature-selection techniques that remain effective on datasets larger than those examined here, (2) targeting and optimizing performance for specific attack categories to further reduce false positives and false negatives, and (3) enhancing the adaptability of Adaptive TreeHive to diverse network architectures and evolving cyber threats.

Conclusion

In this study, we introduced a random tree–based ensemble approach, Adaptive TreeHive, leveraging weighted majority voting to address intrusion classification with the utmost precision and clarity. We constructed five large-scale balanced datasets and demonstrated that Adaptive TreeHive consistently outperforms Random Forest, AdaBoost, and Bagging baselines on these benchmarks. Our extensive experiments validate the efficacy of informative instance selection through clustering and feature selection via gain ratio. Overall, Adaptive TreeHive establishes a robust baseline for intrusion classification and paves the way for future advancements in cybersecurity defense.

References

1. 1. Zhang C, Jia D, Wang L, Wang W, Liu F, Yang A. Comparative research on network intrusion detection methods based on machine learning. Computers & Security. 2022;121:102861.

* View Article

* Google Scholar

2. 2. Moustafa N, Koroniotis N, Keshk M, Zomaya AY, Tari Z. Explainable intrusion detection for cyber defences in the Internet of Things: opportunities and solutions. IEEE Commun Surv Tutorials. 2023;25(3):1775–807.

* View Article

* Google Scholar

3. 3. Sohn I. Deep belief network based intrusion detection techniques: a survey. Expert Systems with Applications. 2021;167:114170.

* View Article

* Google Scholar

4. 4. Kumar NH, Dhanalakshmi R. A novel host based intrusion detection system using supervised learning by comparing SVM over random forest. In: 2023 Eighth International Conference on Science Technology Engineering and Mathematics (ICONSTEM). Chennai, India; 2023. p. 1–4.

5. 5. Thakkar A, Lohiya R. Fusion of statistical importance for feature selection in Deep Neural Network-based Intrusion Detection System. Information Fusion. 2023;90:353–63.

* View Article

* Google Scholar

6. 6. Vanin P, Newe T, Dhirani LL, O’Connell E, O’Shea D, Lee B, et al. A study of network intrusion detection systems using artificial intelligence/machine learning. Applied Sciences. 2022;12(22):11752.

* View Article

* Google Scholar

7. 7. Software CP. What is an intrusion detection system (IDS)? 2023. https://www.checkpoint.com/cyber-hub/network-security/what-is-an-intrusion-detection-system-ids/

8. 8. Saha D, Karim M, Phongmoo S, Farid DM. A privacy-preserving approach for big data mining using RainForest with federated learning. In: 2023 IEEE Region 10 Conference (TENCON), Chiang Mai, Thailand, 2023. p. 1–6.

9. 9. Vanin P, Newe T, Dhirani LL, O’Connell E, O’Shea D, Lee B, et al. A study of network intrusion detection systems using artificial intelligence/machine learning. Appl Sci. 2022;12(22).

* View Article

* Google Scholar

10. 10. Farid DM, Rahman MZ, Rahman CM. Mining complex network data for adaptive intrusion detection. In: Karahoca A, editor. Advances in Data Mining Knowledge Discovery and Applications. Croatia: InTech; 2012. p. 327–48.

11. 11. Janiesch C, Zschech P, Heinrich K. Machine learning and deep learning. Electron Markets. 2021;31(3):685–95.

* View Article

* Google Scholar

12. 12. Azam Z, Islam MM, Huda MN. A hybrid intrusion detection model using EGA-PSO and improved random forest method. Sensors. 2022;22(16).

* View Article

* Google Scholar

13. 13. Almaiah MA, Almomani O, Alsaaidah A, Al-Otaibi S, Bani-Hani N, Hwaitat AKA, et al. Performance investigation of principal component analysis for intrusion detection system using different support vector machine kernels. Electronics. 2022;11(21):3571.

* View Article

* Google Scholar

14. 14. Azam Z, Islam MM, Huda MN. Comparative analysis of intrusion detection systems and machine learning-based model analysis through decision tree. IEEE Access. 2023;11:80348–91.

* View Article

* Google Scholar

15. 15. Razdan S, Gupta H, Seth A. Performance analysis of network intrusion detection systems using j48 and naive bayes algorithms. In: 2021 6th International Conference for Convergence in Technology (I2CT). 2021. p. 1–7. https://doi.org/10.1109/i2ct51068.2021.9417971

16. 16. Shahraki A, Abbasi M, Haugen Ø. Boosting algorithms for network intrusion detection: a comparative evaluation of Real AdaBoost, Gentle AdaBoost and Modest AdaBoost. Engineering Applications of Artificial Intelligence. 2020;94:103770.

* View Article

* Google Scholar

17. 17. Jiang H, He Z, Ye G, Zhang H. Network intrusion detection based on PSO-Xgboost model. IEEE Access. 2020;8:58392–401.

* View Article

* Google Scholar

18. 18. ElSayed MS, Le-Khac N-A, Albahar MA, Jurcut A. A novel hybrid model for intrusion detection systems in SDNs based on CNN and a new regularization technique. Journal of Network and Computer Applications. 2021;191:103160.

* View Article

* Google Scholar

19. 19. Laghrissi F, Douzi S, Douzi K, Hssina B. Intrusion detection systems using long short-term memory (LSTM). J Big Data. 2021;8(1).

* View Article

* Google Scholar

20. 20. Zhang Q, Yang LT, Chen Z, Li P. A survey on deep learning for big data. Information Fusion. 2018;42:146–57.

* View Article

* Google Scholar

21. 21. Okey OD, Maidin SS, Adasme P, Lopes Rosa R, Saadi M, Carrillo Melgarejo D, et al. BoostedEnML: efficient technique for detecting cyberattacks in IoT systems using boosted ensemble machine learning. Sensors (Basel). 2022;22(19):7409. pmid:36236506

* View Article

* PubMed/NCBI

* Google Scholar

22. 22. Injadat M, Moubayed A, Nassif AB, Shami A. Multi-stage optimized machine learning framework for network intrusion detection. IEEE Trans Netw Serv Manage. 2021;18(2):1803–16.

* View Article

* Google Scholar

23. 23. Miah MdO, Shahriar Khan S, Shatabda S, Farid DMd. Improving detection accuracy for imbalanced network intrusion classification using cluster-based under-sampling with random forests. In: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), 2019. p. 1–5. https://doi.org/10.1109/icasert.2019.8934495

24. 24. Singh DM, Harbi N, Zahidur Rahman M. Combining Naive Bayes and decision tree for adaptive intrusion detection. IJNSA. 2010;2(2):12–25.

* View Article

* Google Scholar

25. 25. Farid DMd, Zhang L, Hossain A, Rahman CM, Strachan R, Sexton G, et al. An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications. 2013;40(15):5895–906.

* View Article

* Google Scholar

26. 26. Okey OD, Melgarejo DC, Saadi M, Rosa RL, Kleinschmidt JH, Rodríguez DZ. Transfer learning approach to IDS on cloud IoT devices using optimized CNN. IEEE Access. 2023;11:1023–38.

* View Article

* Google Scholar

27. 27. Kasongo SM, Sun Y. Performance analysis of intrusion detection systems using a feature selection method on the UNSW-NB15 dataset. J Big Data. 2020;7(1).

* View Article

* Google Scholar

28. 28. Omer N, Samak AH, Taloba AI, Abd El-Aziz RM. A novel optimized probabilistic neural network approach for intrusion detection and categorization. Alexandria Engineering Journal. 2023;72:351–61.

* View Article

* Google Scholar

29. 29. Mahdavisharif M, Jamali S, Fotohi R. Big data-aware intrusion detection system in communication networks: a deep learning approach. J Grid Computing. 2021;19(4).

* View Article

* Google Scholar

30. 30. Bijoy MH, Faria MFA, E Sobhani M, Ferdoush T, Shatabda S. Advancing Bangla punctuation restoration by a monolingual transformer-based method and a large-scale corpus. In: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023). 2023. p. 18–25. https://doi.org/10.18653/v1/2023.banglalp-1.3

31. 31. Zoppi T, Gharib M, Atif M, Bondavalli A. Meta-learning to improve unsupervised intrusion detection in cyber-physical systems. ACM Trans Cyber-Phys Syst. 2021;5(4):1–27.

* View Article

* Google Scholar

32. 32. Rodríguez M, Alesanco Á, Mehavilla L, García J. Evaluation of machine learning techniques for traffic flow-based intrusion detection. Sensors (Basel). 2022;22(23):9326. pmid:36502028

* View Article

* PubMed/NCBI

* Google Scholar

33. 33. Okey YZ, Zhang H, Zhang B. An effective ensemble automatic feature selection method for network intrusion detection. Information. 2022;13(7).

* View Article

* Google Scholar

34. 34. Kurniabudi, Stiawan D, Darmawijoyo, Bin Idris MY, Bamhdi AM, Budiarto R. CICIDS-2017 dataset feature analysis with information gain for anomaly detection. IEEE Access. 2020;8:132911–21.

* View Article

* Google Scholar

35. 35. Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1).

* View Article

* Google Scholar

36. 36. Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: experimental evaluation. Information Sciences. 2020;513:429–41.

* View Article

* Google Scholar

37. 37. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inform Theory. 1967;13(1):21–7.

* View Article

* Google Scholar

38. 38. Rajadurai H, Gandhi UD. A stacked ensemble learning model for intrusion detection in wireless network. Neural Comput & Applic. 2020;34(18):15387–95.

* View Article

* Google Scholar

39. 39. Haibo He, Yang Bai, Garcia EA, Shutao Li. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008. p. 1322–8. https://doi.org/10.1109/ijcnn.2008.4633969

40. 40. Laghrissi F, Douzi S, Douzi K, Hssina B. Intrusion detection systems using long short-term memory (LSTM). J Big Data. 2021;8(1).

* View Article

* Google Scholar

41. 41. Rodela AT, Nguyen HH, Farid DM, Huda MN. Bangla social media cyberbullying detection using deep learning. In: Intelligent Systems and Data Science. 2023. p. 170–84.

42. 42. Tithi NJ, Rodela AT, Farha R, Siddiqui FH. Hospital dietary control using automated planning. DUET Journal. 2019.

* View Article

* Google Scholar

43. 43. Talukder MdA, Hasan KF, Islam MdM, Uddin MdA, Akhter A, Yousuf MA, et al. A dependable hybrid machine learning model for network intrusion detection. Journal of Information Security and Applications. 2023;72:103405.

* View Article

* Google Scholar

44. 44. Berkson J. Application to the logistic function to bio-assay. Journal of the American Statistical Association. 1944;39(227):357.

* View Article

* Google Scholar

45. 45. Karatas G, Demir O, Sahingoz OK. Increasing the performance of machine learning-based IDSs on an imbalanced and up-to-date dataset. IEEE Access. 2020;8:32150–62.

* View Article

* Google Scholar

46. 46. Sethi K, Madhav YV, Kumar R, Bera P. Attention based multi-agent intrusion detection systems using reinforcement learning. Journal of Information Security and Applications. 2021;61:102923.

* View Article

* Google Scholar

47. 47. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research. 2010;11(12):3371–408.

* View Article

* Google Scholar

48. 48. Wu Z, Wang J, Hu L, Zhang Z, Wu H. A network intrusion detection method based on semantic re-encoding and deep learning. Journal of Network and Computer Applications. 2020;164:102688.

* View Article

* Google Scholar

49. 49. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 770–8.

50. 50. Thorndike RL. Who Belongs in the Family?. Psychometrika. 1953;18(4):267–76.

* View Article

* Google Scholar

51. 51. Quinlan JR. Induction of Decision Trees. Mach Learn. 1986;1(1):81–106.

* View Article

* Google Scholar

52. 52. Chauhan S, Gangopadhyay S, Gangopadhyay AK. Intrusion detection system for IoT using logical analysis of data and information gain ratio. Cryptography. 2022;6(4):62.

* View Article

* Google Scholar

53. 53. Chew YJ, Ooi SY, Wong K-S, Pang YH, Lee N. Adoption of IP truncation in a privacy-based decision tree pruning design: a case study in network intrusion detection system. Electronics. 2022;11(5):805.

* View Article

* Google Scholar

54. 54. Alkasassbeh M. An empirical evaluation for the intrusion detection features based on machine learning and feature selection methods. Journal of Theoretical and Applied Information Technology. 2017;95(22):5962–76.

* View Article

* Google Scholar

55. 55. Abdelkhalek A, Mashaly M. Addressing the class imbalance problem in network intrusion detection systems using data resampling and deep learning. J Supercomput. 2023;79(10):10611–44.

* View Article

* Google Scholar

56. 56. Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers & Security. 2012;31(3):357–74.

* View Article

* Google Scholar

57. 57. Wu T, Fan H, Zhu H, You C, Zhou H, Huang X. Intrusion detection system combined enhanced random forest with SMOTE algorithm. EURASIP J Adv Signal Process. 2022;2022(1).

* View Article

* Google Scholar

58. 58. Mohammad Hassan Zaib I Lecturer in Department of Computer Science Air University. NSL-KDD; 2018. https://www.kaggle.com/datasets/hassan06/nslkdd

59. 59. Laurens D’hooge II Ph D Researcher at Ghent University. UNSW-NB15; 2022. https://www.kaggle.com/datasets/dhoogla/unswnb15

60. 60. Laurens D’hooge II Ph D Researcher at Ghent University. CIC-IDS2017; 2022. https://www.kaggle.com/datasets/dhoogla/cicids2017

61. 61. Laurens D’hooge II Ph D Researcher at Ghent University. CSE-CIC-IDS2018; 2022. https://www.kaggle.com/datasets/dhoogla/csecicids2018

62. 62. Laurens D’hooge II Ph D Researcher at Ghent University. CIC-DDoS2019; 2022. https://www.kaggle.com/datasets/dhoogla/cicddos2019

63. 63. CONSOLVO ACMaIB. Hardware available on Kaggle. 2023. https://www.kaggle.com/code/bconsolvo/hardware-available-on-kaggle/notebook

64. 64. Zhang H, Ge L, Wang Z. A high performance intrusion detection system using LightGBM based on oversampling and undersampling. In: Huang DS, Jo KH, Jing J, Premaratne P, Bevilacqua V, Hussain A, editors. Intelligent computing theories and application. Cham: Springer; 2022. p. 638–52.

65. 65. Barua S, Islam MdM, Yao X, Murase K. MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng. 2014;26(2):405–25.

* View Article

* Google Scholar

66. 66. Dong RH, Yan HH, Zhang QY. An intrusion detection model for wireless sensor network based on information gain ratio and bagging algorithm. Int J Netw Secur. 2020;22(2):218–30.

* View Article

* Google Scholar

67. 67. Hamid Y, Shah FA, Sugumaran M. Wavelet neural network model for network intrusion detection system. Int J Inf Tecnol. 2018;11(2):251–63.

* View Article

* Google Scholar

Citation: Sobhani ME, Rodela AT, Farid DM (2025) Adaptive TreeHive: Ensemble of trees for enhancing imbalanced intrusion classification. PLoS One 20(9): e0331307. https://doi.org/10.1371/journal.pone.0331307

About the Authors:

Mahbub E. Sobhani

Roles: Methodology, Writing – original draft

Affiliation: Department of Computer Science and Engineering, United International University, United City, Dhaka, Bangladesh

ORICD: https://orcid.org/0009-0006-6043-4507

Anika Tasnim Rodela

Roles: Validation, Writing – original draft

Affiliation: Department of Computer Science and Engineering, United International University, United City, Dhaka, Bangladesh

Dewan Md. Farid

Roles: Conceptualization, Methodology, Writing – original draft

E-mail: [email protected]

Affiliation: Department of Computer Science and Engineering, Southeast University, Tejgaon Industrial Area, Dhaka, Bangladesh

ORICD: https://orcid.org/0000-0002-6413-4898

[/RAW_REF_TEXT]

References

1. Zhang C, Jia D, Wang L, Wang W, Liu F, Yang A. Comparative research on network intrusion detection methods based on machine learning. Computers & Security. 2022;121:102861.

2. Moustafa N, Koroniotis N, Keshk M, Zomaya AY, Tari Z. Explainable intrusion detection for cyber defences in the Internet of Things: opportunities and solutions. IEEE Commun Surv Tutorials. 2023;25(3):1775–807.

3. Sohn I. Deep belief network based intrusion detection techniques: a survey. Expert Systems with Applications. 2021;167:114170.

4. Kumar NH, Dhanalakshmi R. A novel host based intrusion detection system using supervised learning by comparing SVM over random forest. In: 2023 Eighth International Conference on Science Technology Engineering and Mathematics (ICONSTEM). Chennai, India; 2023. p. 1–4.

5. Thakkar A, Lohiya R. Fusion of statistical importance for feature selection in Deep Neural Network-based Intrusion Detection System. Information Fusion. 2023;90:353–63.

6. Vanin P, Newe T, Dhirani LL, O’Connell E, O’Shea D, Lee B, et al. A study of network intrusion detection systems using artificial intelligence/machine learning. Applied Sciences. 2022;12(22):11752.

7. Software CP. What is an intrusion detection system (IDS)? 2023. https://www.checkpoint.com/cyber-hub/network-security/what-is-an-intrusion-detection-system-ids/

8. Saha D, Karim M, Phongmoo S, Farid DM. A privacy-preserving approach for big data mining using RainForest with federated learning. In: 2023 IEEE Region 10 Conference (TENCON), Chiang Mai, Thailand, 2023. p. 1–6.

9. Vanin P, Newe T, Dhirani LL, O’Connell E, O’Shea D, Lee B, et al. A study of network intrusion detection systems using artificial intelligence/machine learning. Appl Sci. 2022;12(22).

10. Farid DM, Rahman MZ, Rahman CM. Mining complex network data for adaptive intrusion detection. In: Karahoca A, editor. Advances in Data Mining Knowledge Discovery and Applications. Croatia: InTech; 2012. p. 327–48.

11. Janiesch C, Zschech P, Heinrich K. Machine learning and deep learning. Electron Markets. 2021;31(3):685–95.

12. Azam Z, Islam MM, Huda MN. A hybrid intrusion detection model using EGA-PSO and improved random forest method. Sensors. 2022;22(16).

13. Almaiah MA, Almomani O, Alsaaidah A, Al-Otaibi S, Bani-Hani N, Hwaitat AKA, et al. Performance investigation of principal component analysis for intrusion detection system using different support vector machine kernels. Electronics. 2022;11(21):3571.

14. Azam Z, Islam MM, Huda MN. Comparative analysis of intrusion detection systems and machine learning-based model analysis through decision tree. IEEE Access. 2023;11:80348–91.

15. Razdan S, Gupta H, Seth A. Performance analysis of network intrusion detection systems using j48 and naive bayes algorithms. In: 2021 6th International Conference for Convergence in Technology (I2CT). 2021. p. 1–7. https://doi.org/10.1109/i2ct51068.2021.9417971

16. Shahraki A, Abbasi M, Haugen Ø. Boosting algorithms for network intrusion detection: a comparative evaluation of Real AdaBoost, Gentle AdaBoost and Modest AdaBoost. Engineering Applications of Artificial Intelligence. 2020;94:103770.

17. Jiang H, He Z, Ye G, Zhang H. Network intrusion detection based on PSO-Xgboost model. IEEE Access. 2020;8:58392–401.

18. ElSayed MS, Le-Khac N-A, Albahar MA, Jurcut A. A novel hybrid model for intrusion detection systems in SDNs based on CNN and a new regularization technique. Journal of Network and Computer Applications. 2021;191:103160.

19. Laghrissi F, Douzi S, Douzi K, Hssina B. Intrusion detection systems using long short-term memory (LSTM). J Big Data. 2021;8(1).

20. Zhang Q, Yang LT, Chen Z, Li P. A survey on deep learning for big data. Information Fusion. 2018;42:146–57.

21. Okey OD, Maidin SS, Adasme P, Lopes Rosa R, Saadi M, Carrillo Melgarejo D, et al. BoostedEnML: efficient technique for detecting cyberattacks in IoT systems using boosted ensemble machine learning. Sensors (Basel). 2022;22(19):7409. pmid:36236506

22. Injadat M, Moubayed A, Nassif AB, Shami A. Multi-stage optimized machine learning framework for network intrusion detection. IEEE Trans Netw Serv Manage. 2021;18(2):1803–16.

23. Miah MdO, Shahriar Khan S, Shatabda S, Farid DMd. Improving detection accuracy for imbalanced network intrusion classification using cluster-based under-sampling with random forests. In: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), 2019. p. 1–5. https://doi.org/10.1109/icasert.2019.8934495

24. Singh DM, Harbi N, Zahidur Rahman M. Combining Naive Bayes and decision tree for adaptive intrusion detection. IJNSA. 2010;2(2):12–25.

25. Farid DMd, Zhang L, Hossain A, Rahman CM, Strachan R, Sexton G, et al. An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications. 2013;40(15):5895–906.

26. Okey OD, Melgarejo DC, Saadi M, Rosa RL, Kleinschmidt JH, Rodríguez DZ. Transfer learning approach to IDS on cloud IoT devices using optimized CNN. IEEE Access. 2023;11:1023–38.

27. Kasongo SM, Sun Y. Performance analysis of intrusion detection systems using a feature selection method on the UNSW-NB15 dataset. J Big Data. 2020;7(1).

28. Omer N, Samak AH, Taloba AI, Abd El-Aziz RM. A novel optimized probabilistic neural network approach for intrusion detection and categorization. Alexandria Engineering Journal. 2023;72:351–61.

29. Mahdavisharif M, Jamali S, Fotohi R. Big data-aware intrusion detection system in communication networks: a deep learning approach. J Grid Computing. 2021;19(4).

30. Bijoy MH, Faria MFA, E Sobhani M, Ferdoush T, Shatabda S. Advancing Bangla punctuation restoration by a monolingual transformer-based method and a large-scale corpus. In: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023). 2023. p. 18–25. https://doi.org/10.18653/v1/2023.banglalp-1.3

31. Zoppi T, Gharib M, Atif M, Bondavalli A. Meta-learning to improve unsupervised intrusion detection in cyber-physical systems. ACM Trans Cyber-Phys Syst. 2021;5(4):1–27.

32. Rodríguez M, Alesanco Á, Mehavilla L, García J. Evaluation of machine learning techniques for traffic flow-based intrusion detection. Sensors (Basel). 2022;22(23):9326. pmid:36502028

33. Okey YZ, Zhang H, Zhang B. An effective ensemble automatic feature selection method for network intrusion detection. Information. 2022;13(7).

34. Kurniabudi, Stiawan D, Darmawijoyo, Bin Idris MY, Bamhdi AM, Budiarto R. CICIDS-2017 dataset feature analysis with information gain for anomaly detection. IEEE Access. 2020;8:132911–21.

35. Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1).

36. Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: experimental evaluation. Information Sciences. 2020;513:429–41.

37. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inform Theory. 1967;13(1):21–7.

38. Rajadurai H, Gandhi UD. A stacked ensemble learning model for intrusion detection in wireless network. Neural Comput & Applic. 2020;34(18):15387–95.

39. Haibo He, Yang Bai, Garcia EA, Shutao Li. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008. p. 1322–8. https://doi.org/10.1109/ijcnn.2008.4633969

40. Laghrissi F, Douzi S, Douzi K, Hssina B. Intrusion detection systems using long short-term memory (LSTM). J Big Data. 2021;8(1).

41. Rodela AT, Nguyen HH, Farid DM, Huda MN. Bangla social media cyberbullying detection using deep learning. In: Intelligent Systems and Data Science. 2023. p. 170–84.

42. Tithi NJ, Rodela AT, Farha R, Siddiqui FH. Hospital dietary control using automated planning. DUET Journal. 2019.

43. Talukder MdA, Hasan KF, Islam MdM, Uddin MdA, Akhter A, Yousuf MA, et al. A dependable hybrid machine learning model for network intrusion detection. Journal of Information Security and Applications. 2023;72:103405.

44. Berkson J. Application to the logistic function to bio-assay. Journal of the American Statistical Association. 1944;39(227):357.

45. Karatas G, Demir O, Sahingoz OK. Increasing the performance of machine learning-based IDSs on an imbalanced and up-to-date dataset. IEEE Access. 2020;8:32150–62.

46. Sethi K, Madhav YV, Kumar R, Bera P. Attention based multi-agent intrusion detection systems using reinforcement learning. Journal of Information Security and Applications. 2021;61:102923.

47. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research. 2010;11(12):3371–408.

48. Wu Z, Wang J, Hu L, Zhang Z, Wu H. A network intrusion detection method based on semantic re-encoding and deep learning. Journal of Network and Computer Applications. 2020;164:102688.

49. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 770–8.

50. Thorndike RL. Who Belongs in the Family?. Psychometrika. 1953;18(4):267–76.

51. Quinlan JR. Induction of Decision Trees. Mach Learn. 1986;1(1):81–106.

52. Chauhan S, Gangopadhyay S, Gangopadhyay AK. Intrusion detection system for IoT using logical analysis of data and information gain ratio. Cryptography. 2022;6(4):62.

53. Chew YJ, Ooi SY, Wong K-S, Pang YH, Lee N. Adoption of IP truncation in a privacy-based decision tree pruning design: a case study in network intrusion detection system. Electronics. 2022;11(5):805.

54. Alkasassbeh M. An empirical evaluation for the intrusion detection features based on machine learning and feature selection methods. Journal of Theoretical and Applied Information Technology. 2017;95(22):5962–76.

55. Abdelkhalek A, Mashaly M. Addressing the class imbalance problem in network intrusion detection systems using data resampling and deep learning. J Supercomput. 2023;79(10):10611–44.

56. Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers & Security. 2012;31(3):357–74.

57. Wu T, Fan H, Zhu H, You C, Zhou H, Huang X. Intrusion detection system combined enhanced random forest with SMOTE algorithm. EURASIP J Adv Signal Process. 2022;2022(1).

58. Mohammad Hassan Zaib I Lecturer in Department of Computer Science Air University. NSL-KDD; 2018. https://www.kaggle.com/datasets/hassan06/nslkdd

59. Laurens D’hooge II Ph D Researcher at Ghent University. UNSW-NB15; 2022. https://www.kaggle.com/datasets/dhoogla/unswnb15

60. Laurens D’hooge II Ph D Researcher at Ghent University. CIC-IDS2017; 2022. https://www.kaggle.com/datasets/dhoogla/cicids2017

61. Laurens D’hooge II Ph D Researcher at Ghent University. CSE-CIC-IDS2018; 2022. https://www.kaggle.com/datasets/dhoogla/csecicids2018

62. Laurens D’hooge II Ph D Researcher at Ghent University. CIC-DDoS2019; 2022. https://www.kaggle.com/datasets/dhoogla/cicddos2019

63. CONSOLVO ACMaIB. Hardware available on Kaggle. 2023. https://www.kaggle.com/code/bconsolvo/hardware-available-on-kaggle/notebook

64. Zhang H, Ge L, Wang Z. A high performance intrusion detection system using LightGBM based on oversampling and undersampling. In: Huang DS, Jo KH, Jing J, Premaratne P, Bevilacqua V, Hussain A, editors. Intelligent computing theories and application. Cham: Springer; 2022. p. 638–52.

65. Barua S, Islam MdM, Yao X, Murase K. MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng. 2014;26(2):405–25.

66. Dong RH, Yan HH, Zhang QY. An intrusion detection model for wireless sensor network based on information gain ratio and bagging algorithm. Int J Netw Secur. 2020;22(2):218–30.

67. Hamid Y, Shah FA, Sugumaran M. Wavelet neural network model for network intrusion detection system. Int J Inf Tecnol. 2018;11(2):251–63.

Word count: 16970

Show less

© 2025 Sobhani et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Adaptive TreeHive: Ensemble of trees for enhancing imbalanced intrusion classification

Content area

Abstract

Full text

Introduction

Background study

Machine learning algorithms

Unsupervised learning

Supervised learning

Decision tree.

ID3 (Iterative Dichotomiser 3).

C4.5.

CART (Classification And Regression Tree).

Naïve Bayes.

Ensemble learning

Bagging.

Random forest.

AdaBoost.

Dataset

NSL-KDD

UNSW-NB15

CIC-IDS2017

CSE-CIC-IDS2018

CICDDoS2019

Proposed methodology

Data preprocessing

Data balancing

Adaptive TreeHive

Experiment

Data sourcing

Experimental setup

Performance evaluation

Experimental results

Ablation study

Limitations and future work

Conclusion

References