Boosting Software Fault Prediction: Addressing

Full text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

The problem description emphasizes software’s critical role in engineered systems’ functionality and the growing importance of software quality in day-to-day operations [1]. Object-oriented (OO) metrics, such as inheritance, polymorphism, cohesion, and coupling, are identified as key factors for evaluating fault proneness in software. Leveraging these metrics allows for a more nuanced understanding of the software’s potential for defects, ultimately saving testing labor and time [2].

Machine learning (ML) methods have gained prominence in software fault prediction (SFP), and ensemble learning (EL) is highlighted as a valuable addition to the ML toolkit. EL involves the collaborative use of multiple classifiers to learn from a given dataset, leading to improved classification outcomes compared to individual classifiers. This approach is particularly relevant in addressing class imbalance (CI) issues within datasets, a common challenge for classifier learning algorithms [2, 3].

CI, where the distribution of instances among different classes is uneven, is acknowledged as a significant obstacle to the effectiveness of SFP models [4]. The study proposed the incorporation of EL to specifically tackle the challenges posed by imbalanced data, offering a more resilient and effective solution. Modifications to classifiers, including random undersampling (RUS) and random oversampling (ROS), are introduced to enhance fault prediction in OO systems.

The study’s main objectives further emphasize the practical application of EL models in handling CI issues in SFP for OO systems. The aim is not only to address the challenges but also to enhance the overall performance of EL for better fault prediction models. The study also sets out to measure the performance of the proposed models, providing a comprehensive assessment of their efficiency on imbalanced datasets with varying imbalance ratios. These objectives collectively contribute to advancing the understanding and application of EL in the context of SFP, particularly in OO systems facing CI challenges.

2. Background and Literature Review

2.1. ML Overview

ML encompasses the modifications made to artificial intelligence (AI) systems, enabling them to perform various tasks such as recognition, diagnosis, planning, robot control, and prediction. ML involves acquiring theoretical knowledge autonomously from data through inference, model fitting, or learning from examples. Its primary goal is the automated extraction of valuable insights from extensive datasets facilitated by creating robust probabilistic models. Particularly suitable for applications relying on vast amounts of data, ML techniques are effective when a general theoretical framework is not readily available. Learning involves acquiring knowledge, comprehension, or proficiency through experience or study, with the ultimate objective of generalization, reflected in low generalization errors or prediction errors [5, 6].

Zhou [5] categorized ML algorithms based on their intended outcomes. The common types include supervised, unsupervised, semisupervised, reinforcement, and transduction learning.

2.1.1. EL

EL, or multiple classifier systems, is an ML technique that integrates multiple models to enhance overall performance. It has gained increasing interest for its proven effectiveness and versatility in addressing diverse problem domains. Unlike standard learning approaches constructing a single learner, ensemble methods focus on training multiple learners and combining their outputs for superior performance. Early instances date back to Dasarathy and Sheela in 1979 [7], using multiple classifiers to partition feature space. Hansen and Salamon [8] later demonstrated the effectiveness of ensemble neural networks, leading to the development of AdaBoost algorithms by Freund and Schapire in 1997 [9].

Figure 1 illustrates a commonly used ensemble architecture comprising base learners using a base learning algorithm generated from training data. Examples include decision trees [3], neural networks [8], and similar techniques. Ensemble methods can generate homogeneous or heterogeneous base learners, which involves learners of different types using multiple learning algorithms, resulting in heterogeneous ensembles [5]. The study delves into EL to address CI issues in SFP for OO systems.

[figure(s) omitted; refer to PDF]

Ensemble methods are recognized for their superior generalization ability compared to individual-based learners. They enhance the performance of weak learners, which are marginally better than random guessers, transforming them into robust learners capable of making highly accurate predictions. These weak learners are commonly called base or weak learners [5].

There are two primary types of EL: bagging and boosting [3]. Bagging, short for “bootstrap aggregating,” involves training multiple models independently on different subsets of the training data and then combining their predictions [11]. Boosting, on the other hand, is a sequential ensemble method where a sequence of models is iteratively trained, with each model attempting to correct the errors of the previous ones [9].

2.1.1.1. Bagging

Bagging combines two main components: bootstrap and aggregation [5, 11]. This technique assigns equal weight to models generated during EL, reducing variance associated with classification and ultimately enhancing the overall classification process [12].

Bootstrap sampling [13] is employed in bagging to create diverse base learners. The process involves randomly sampling examples with replacements from the original dataset, resulting in some data being selected multiple times and others not at all. The steps in this process, as demonstrated by Zhou [5], include obtaining a sample of training examples through bootstrap sampling, repeating the process N times to generate N samples, training a base learner from each sample, and combining the predictions of all base models for the final prediction.

Bagging adopts well-established strategies to aggregate base learners’ outputs, such as voting for classification tasks and averaging for regression tasks [5]. Figure 2 illustrates how bagging operates based on these steps.

[figure(s) omitted; refer to PDF]

2.1.1.2. Boosting

Boosting is an ML technique characterized by iteratively training a sequence of models, where each subsequent model aims to correct the errors of its predecessors. The fundamental concept is to transform a set of weak classifiers, slightly better than random guessing, into a robust classifier capable of making accurate predictions [3].

Zhou [5] outlines the general boosting procedure, which involves the following steps:

1. Randomly select a subset of the original dataset without replacement.

2. Apply the weak classifier to this subset of data.

3. Adjust the weights of misclassified instances, increasing their weight and decreasing the weight of the correctly classified instances.

4. Train a new weak classifier using the updated weights.

5. Repeat steps 1–4 for a specified number of iterations or until a stopping criterion is met.

6. Combine all the weak classifiers to obtain a final boosted classifier.

Figure 3 illustrates the general flowchart of the boosting algorithm, which can be adapted based on the specific boosting method.

[figure(s) omitted; refer to PDF]

There are three primary types of boosting algorithms: AdaBoost, Gradient Boost, and XGBoost.

• AdaBoost: AdaBoost (adaptive boosting) stands out as a popular algorithm for boosting, introduced by Freund and Schapire in 1997 [9]. It is a robust EL technique that repetitively trains a sequence of models, with each model aimed at correcting the errors of its predecessors. AdaBoost assigns higher weights to misclassified data points in each iteration, training a new model on the updated dataset. The final model is a weighted combination of all models trained during the iterations. AdaBoost finds extensive applications in diverse fields, including face detection, medical diagnosis, and software fault detection.

• Gradient Boost: Gradient boosting is a versatile ML technique for regression and classification tasks. This ensemble method uses a series of weak learners, often decision trees, to iteratively refine the model’s predictions by minimizing the loss function of the previous iteration. Gradient boosting differs from traditional boosting algorithms because it utilizes gradient descent optimization. It fits the new learner to the residual errors of the previous learner instead of adjusting the weights of each data point [14].

• XGBoost: XGBoost (extreme gradient boosting) is an open-source software library developed by Chen and Guestrin in 2016 [15]. It leverages gradient boosting algorithms to implement boosted trees, as shown in Figure 4. XGBoost has gained widespread popularity in ML competitions and real-world applications due to its scalability and efficiency. The algorithm employs gradient boosting, training multiple weak models sequentially, where each model aims to improve upon the errors of the preceding one. XGBoost enhances this approach by introducing additional features such as regularization, parallel processing, and support for missing values. Therefore, the outcome of the XGBoost is a powerful tool well-suited for large-scale data processing and predictive modeling rather than other classifiers [16].

XGBoost has gained significant acceptance in the ML community and has proven effective in achieving cutting-edge results across diverse applications. This includes applications such as image classification [17] and natural language processing [18].

[figure(s) omitted; refer to PDF]

2.2. OO Systems

The OO methodology revolves around objects comprising a set of data and predetermined functions, known as methods, applicable to the data collection. This approach effectively segregates data and control. Over the past decades, the OO approach has emerged as a paramount software engineering (SE) design and has gained widespread use [19]. Consequently, OO software systems are susceptible to various faults. Modifying one component within the OO system may impact other components, given the high interdependence of software components. This interdependence, often measured by software coupling, is heightened by OO system features such as polymorphism, inheritance, and encapsulation [2]. Achieving high system maintainability and readability is contingent on low coupling and high cohesion. Various software metrics have been developed for OO paradigms, including abstraction, class, object, and inheritance, to measure properties such as software quality, coupling, and cohesion [20].

2.2.1. OO Metrics

As OO systems take center stage in SE design, predicting defects in the early stages has become a focal point for SE researchers [19, 21]. Predicting software defects often relies on code metrics, such as OO CK metrics [22], illustrating the system’s static properties. These six metrics (RFC, WMC, CBO, LCOM, NOC, and DIT) aim to evaluate the OO system’s design rather than its implementation [20]. Random forest (RF), a primary EL method, has been employed to predict software fault occurrence. Researchers like Kaur and Malhotra [23] applied the RF approach to jEdit data from the PROMISE repository, showcasing its good predictive performance. In a study by Yu et al. [24], novel process metrics based on package fault rates and class variation degrees were proposed, using logistic regression and Naïve Bayes ML prediction methods for comparison. The results indicated that the proposed models significantly enhanced prediction performance. However, a significant challenge lies in the dataset, which can strongly influence outcomes.

This study uses software metrics to classify OO software into faulty and nonfaulty systems. OO software system quality is assessed by evaluating OO design metrics derived from static analysis data. Numerous researchers have proposed OO metrics for calculating design metrics in the initial stages of a software system [20]. The most common six metrics, known as Chidamber and Kemerer (CK) metrics [22], along with metrics provided by Bansiya et al. [25] and QMOOD metrics by Bansia and Davis [26], evaluate high-level quality features in OO designs. Table 1, as demonstrated in a study by Marian Jureczko [27], provides more details on the 20 metrics utilized in this research as training parameters.

Table 1

Definition of OO metrics.

Metric	Definition
Weighted methods per class (WMC)	WMC is the total number of methods within the class
Depth of inheritance tree (DIT)	DIT is the number of levels of inheritance between a given class and the highest point in the object hierarchy
Number of children (NOC)	NOC metric counts the class’s direct descendants
Coupling between object classes (CBO)	It represents the degree of coupling between a class and other classes
Response for a class (RFC)	RFC counts the number of methods invoked in response to a message or method call received by a class
Lack of cohesion in methods (LCOM)	It calculates the number of method pairs within a class that do not share any instance variables, indicating a potential lack of coherence or organization in the class design
Lack of cohesion in methods (LCOM3)	LCOM3 is an enhancement of LCOM that addresses some of the limitations of the original metric
Afferent couplings (Ca)	Ca denotes the count of classes that rely on the measured class
Efferent couplings (Ce)	Ce indicates the number of classes that depend on the measured class
Number of public methods (NPM)	It is a count of all public methods declared within a class
Data access metric (DAM)	It is calculated as the proportion of private (or protected) attributes to the overall number of attributes declared within the class
The measure of aggregation (MOA)	It assesses the degree to which attributes implement the part–whole relationship by counting the number of class fields with user-defined class types
The measure of functional abstraction (MFA)	It calculates the ratio of inherited methods to the overall number of methods that member methods of a class can access
Cohesion among methods of class (CAM)	This metric measures the similarity between methods of a class, considering their parameter lists
Inheritance coupling (IC)	This metric determines the coupling between a given class and its parent classes. It counts the number of parent classes to which the given class is functionally dependent due to newly defined or redefined methods
Coupling between methods (CBM)	It counts the number of new or redefined methods coupled to all inherited methods. Coupling occurs when at least one of the conditions specified in the IC metric definition is satisfied
Average method complexity (AMC)	It calculates the average method size per class, where the size of a method is defined as the number of Java binary codes it contains
Lines of code (LOC)	It is calculated by adding the number of fields, methods, and instructions in each method of the analyzed class
McCabe’s cyclomatic complexity (CC)	It can be calculated by adding one to the number of unique paths in a method or functionTwo metrics were derived for this purpose:Max (CC)—the highest value of CC among the methods in the investigated classAvg (CC)—the arithmetic mean of the CC value across the methods in the investigated class

2.3. SFP

SFP is identifying software modules that are likely to contain faults. A recent systematic review by Pandey et al. examined studies conducted over the past 3 decades, focusing on using ML techniques for SFP [21, 28–30]. EL models and hybrid models were found to outperform conventional SFP models. For instance, support vector machine (SVM), a classic model, has been enhanced and used with a genetic algorithm (GA) for fault prediction [31]. Bayesian classifiers, cited by Malhotra [32], represent a significant portion of software defect prediction (SDP) studies. Recent advancements include weighted Naïve Bayes (WNB-ID), an enhanced model combining Naïve Bayes and information diffusion, showing promising results [33].

2.4. CI Problem

The CI issue poses a significant challenge in predicting software problems. In SFP, datasets are often imbalanced, with more instances in the “nondefective” class than in the “defective” class. This skewed distribution can bias classifier results toward the majority class, diminishing the importance of minority class cases [34, 35]. Imbalanced learning methods address this challenge at the data, algorithm, and ensemble levels [34].

• Data-level methods involve resampling the imbalanced dataset to achieve a more balanced distribution. Techniques such as synthetic minority oversampling technique (SMOTE), ROS, and RUS are commonly used [36]. This study uses ROS and RUS as sampling techniques, as shown in Figure 5.

1. RUS: Instances of the majority class are randomly removed until the class distribution is balanced.

2. ROS: Instances from the minority class are randomly duplicated until the class distribution is balanced.

• Algorithm-level methods: These approaches modify the learning mechanism to handle unbalanced data, often employing cost-sensitive learning methods that assign a higher cost to the minority class. These methods may not guarantee the optimal model selection for the given dataset [37].

• EL methods: Ensemble techniques combine multiple weak classifiers to create a more robust model. Boosting and bagging are common approaches to integrating basic classifiers into a robust classifier [38]. Recent studies indicate that EL outperforms single classifiers, mainly when dealing with imbalanced datasets [38, 39].

While various methods aim to improve the ML model’s performance in SFP, the CI problem remains challenging, impacting prediction accuracy. Krawczyk’s study highlights the need to address challenges in imbalance learning for the reliability of systems, with SFP discussed as a prominent branch of imbalanced learning [40]. EL, mainly through boosting and bagging, is a powerful approach for handling difficult data in imbalanced learning.

[figure(s) omitted; refer to PDF]

2.4.1. Ensemble Imbalance Classification

EL has emerged as a valuable tool for boosting prediction model performance in recent ML advancements. The fundamental concept behind EL is to amalgamate decisions from a set of learning models to enhance the classification accuracy of defect prediction models [1, 41].

Various ensemble classifiers designed for imbalanced datasets, crucial to this study, are detailed below:

2.4.1.1. BalancedBagging (BB) Classifier

BB effectively addresses imbalanced classification challenges by incorporating bagging and sampling techniques to balance class distribution in training data. This method involves creating new subsets by sampling from the original dataset, ensuring an equal number of instances for both minority and majority classes. The balance is achieved through RUS, eliminating examples from the majority class until equilibrium is reached [42].

The implementation of bagging in this model closely resembles the Scikit-learn implementation. Additionally, this model includes a step to balance the training set at a fit time using a specified sampler. The Python-based imblearn library’s ensemble module implements the BB classifier [43] utilized in this research.

Following the steps outlined in [42], the BB classifier operates as follows:

1. Generate multiple subsamples of the original dataset by randomly selecting instances from the majority class, ensuring that the size of each subsample is equivalent to that of the minority class.

2. Fit a base classifier on each generated subset.

3. Predict class labels for new instances using each trained base classifier.

4. Aggregate predictions from each base classifier using a voting scheme yielding a final prediction for each instance.

The BB classifier’s ability to create balanced data subsets for training base classifiers ensures equal consideration for both minority and majority classes, resulting in enhanced classification performance on imbalanced datasets, as shown in Figure 6.

[figure(s) omitted; refer to PDF]

2.4.1.2. RUSBoost Classifier

Seiffert introduced the RUSBoost classifier, a hybrid algorithm combining data sampling and boosting techniques to improve models’ performance on imbalanced data. RUSBoost integrates RUS data sampling into the AdaBoost algorithm to address the challenges posed by skewed data [44]. Unlike methods employing intelligent mechanisms for eliminating examples, RUS randomly discards instances from the majority class until the desired class distribution is achieved. This RUS is applied at each iteration of the boosting algorithm, effectively mitigating CI during the learning process and resulting in superior classification performance compared to its components [45].

The framework of the RUSBoost classifier involves the following steps (Figure 7):

Step 1: Initialize the weights of each example to 1/m, where m is the number of examples in the training dataset.

Step 2: Iterate T times, and for each iteration:

• Create subsamples of the original dataset through RUS sampling on the majority class, ensuring that the size of each subsample matches that of the minority class.

• Pass the subsample and its weight distribution to the base learner.

• Update the weight distribution for the next iteration.

Step 3: Combine predictions from the ensemble of classifiers using a weighted voting scheme to derive the final prediction for each instance.

[figure(s) omitted; refer to PDF]

2.4.1.3. EasyEnsemble Classifier

EasyEnsemble (EE), introduced by Liu, Wu, and Zhou in 2009, is a classification ensemble method designed to address CI. This approach involves the creation of multiple balanced subsets of the majority class combined with the minority class, resulting in several training datasets. Individual weak classifiers are trained on these datasets, and their predictions are amalgamated to form the final prediction. Although EE produces a single ensemble, it appears as an “ensemble of ensembles” because it utilizes ensemble methods to generate balanced subsets of the majority class, creating an ensemble of classifiers for each subset. It functions as an ensemble of AdaBoost classifiers, each an ensemble of weak classifiers trained on a balanced subset of the majority class. The ultimate prediction is derived by combining the outputs of all AdaBoost classifiers. Notably, boosting reduces bias, while bagging reduces variance [46].

Liu et al. outlined the steps of the EE classifier. In Step 1, the EE technique generates M-balanced subsamples of the majority class through the bootstrap method. In Step 2, each subsample is paired with the minority class to create M-balanced training datasets. Step 3 involves training an ensemble classifier on each of the M-balanced datasets. Finally, in Step 4, the predictions from individual classifiers are aggregated to produce the final prediction.

EE finds applications in various domains where imbalanced datasets are prevalent. For instance, it has been employed to enhance extremely unbalanced data in predicting weld ultrasonic inspection results, where it works in conjunction with the extreme gradient boosting (XGBoost) algorithm [47]. Additionally, EE has been utilized alongside RF in the context of fake reviews [48] (Figure 8).

[figure(s) omitted; refer to PDF]

3. Methodology

The methodology employed in research plays a crucial role in achieving the objectives of any system. This study utilizes EL to address CI issues and identify the most effective ML model for predicting software faults in OO systems. Figure 9 illustrates the SFP framework for OO systems utilizing historical datasets.

[figure(s) omitted; refer to PDF]

3.1. Dataset

The benchmark datasets employed in our research have undergone standard data preprocessing steps, outlined as follows:

3.1.1. Data Collection

For this study, we gathered data from 19 open-source projects of a similar type sourced from the PROMISE SE Repository [49]. These datasets comprise the most recent releases of their respective versions, encompassing 19 OO metrics, which have been detailed and are presented in Table 1 within Section 2.3 as independent variables and defect proneness as the dependent feature. These metrics are used as training parameters.

Table 2 provides a comprehensive overview of all the datasets, including the number of faulty and nonfaulty samples. To highlight the imbalance issue, the ratio between defective and nondefective instances has been computed using the following formula [50]: $\begin{matrix} (1) & Imbalance ratio = \frac{Size of minorty class}{Size of majority class} . \end{matrix}$

Table 2

Dataset information.

Dataset	#Nondefected	#Defected	Nondefected (%)	Defected (%)	Imbalance ratio (%)
ant-1.7	579	166	78	22	28.7
camel-1.0	326	13	96	4	3.9
camel-1.6	777	188	81	19	24.2
data_arc	196	29	87	13	14.8
data_ivy-2.0	312	40	89	11	12.8
data_prop-6	583	61	91	9	10.5
data_redaktor	148	27	85	15	18.2
jedit-3.2	182	90	67	33	49.5
jedit-4.2	319	48	87	13	15.0
log4j-1.1	72	37	66	34	51.4
lucene-2.0	104	91	53	47	87.5
poi-2.0	277	37	88	12	13.4
synapse-1.0	141	16	90	10	11.3
synapse-1.2	170	86	66	34	50.6
velocity-1.6	151	78	66	34	51.7
xalan-2.4	613	110	85	15	17.9
xerces-1.2	369	71	84	16	19.2
xerces-1.3	384	69	85	15	18.0
xerces-1.4	437	151	74	26	34.6

3.1.2. Data Cleaning

Ensuring data quality is crucial, and data cleaning, as proposed by Rahm and Do [51], plays a vital role in addressing missing values, inconsistencies, and errors. In this research, the datasets were meticulously examined for missing values and errors using a Python 3.7 tool, revealing no missing values or errors. An attribute conversion process was also applied to facilitate the binary classification approach adopted for SFP, where the dependent attribute involves continuous values. This conversion transformed the continuous values into binary digits, specifically 0 for nonfaulty and 1 for faulty instances. Additionally, a data cleaning step was performed to remove three out of the 24 attributes: ‘name,’ ‘version,’ and a duplicate ‘name’ attribute, as they contained redundant and irrelevant information for the classification process.

3.1.3. Data Transformation

Data transformation converts input data into a more suitable form for mining and analysis [52]. As part of this process, normalization is employed to handle out-of-bounds features in the dataset [53, 54]. This study utilizes Z-score normalization, also known as standardization, based on the findings of Raju et al. [54]. StandardScaler, a technique for normalizing data, is applied, scaling each component so that the distribution is centered around 0 with a standard deviation of 1. This method anticipates the typical usage of data within each component. The element’s mean and standard deviation are calculated, and the component is then scaled according to the following formula: $\begin{matrix} (2) & Z_{scaled} = \frac{x - μ}{σ}, where μ is the mean and σ is the standard deviation . \end{matrix}$

3.1.4. Data Reduction

Data reduction is crucial to obtaining a condensed and streamlined dataset representation while maintaining comparable analytical results [55]. Dimensionality and numerosity reduction can be applied to achieve data reduction.

In dimensionality reduction, encoding techniques are commonly employed to obtain a compressed version of the initial data. Principal component analysis (PCA), a data compression strategy, was utilized in this study to reduce the space of the original data. The implementation of PCA resulted in a significant enhancement in the performance of the models.

3.2. Experiment Classifiers

SFP encounters a significant challenge known as CI, which can lead to suboptimal performance in detecting faulty instances. As discussed in Section 2.4, EL methods have proven effective in addressing the CI issue in imbalanced data, particularly through additional sampling techniques, mainly RUS. This study proposes three ensemble classifiers: Enhanced_BalancedBagging (E_BB), ROSBoost, and Enhanced_EasyEnsemble (E_EE).

In their research, Barandela et al. explored the impact of CI on ML algorithm performance for classification tasks. They highlighted that a significantly imbalanced ratio between majority and minority classes could lead to low classification accuracy for the minority class. To tackle this, they recommended oversampling techniques, generating synthetic samples for the minority class to balance the dataset. This study proposes modifications to BB, RUSBoost, and EE classifiers by substituting RUS with ROS. ROS is favored due to its advantages over RUS, such as preventing overfitting and being more suitable for highly imbalanced datasets.

3.2.1. E_BB Classifier

E_BB addresses CI by combining a bagging classifier with ROS. E_BB builds multiple subsamples from the training data and trains a base classifier on each, improving generalization. The ROS method generates synthetic samples for the minority class, enhancing diversity and resilience in highly underrepresented cases. The modification involves replacing the RandomUnderSampler method with RandomOverSampler in the learning. Ensemble.BalancedBaggingClassifier implementation.

3.2.2. ROSBoost Classifier

ROSBoost integrates ROS into the AdaBoost learning process to overcome imbalanced class distribution challenges. ROSBoost improves classification accuracy by ROS data during each boosting iteration, providing a more balanced dataset. Unlike RUSBoost, which uses RUS, ROSBoost generates synthetic samples for the minority class. This modification, substituting RandomUnderSampler with RandomOverSampler, is implemented in the imblearn. Ensemble.RUSBoostClassifier.

3.2.3. E_EE Classifier

E_EE is an ensemble of XGBoost learners trained on diverse, balanced bootstrap samples. In contrast to the original EE, E_EE substitutes RUS with ROS to generate balanced samples. ROS improves the training set’s diversity by replicating samples for the minority class. The modification replaces RandomUnderSampler with RandomOverSampler and employs XGBoost as the base learner instead of AdaBoost. This choice is based on XGBoost’s computational efficiency, ability to handle missing data, and flexibility in hyperparameter tuning.

Table 3 summarizes the three proposed methods, outlining their composition, ensemble type, and enhancements made in this study.

Table 3

Summary of the proposed ensemble classifier.

Classifier	Composite of sampler + base learner		Type of ensemble	Enhanced from
Enhanced BalancedBagging	ROS	Bagging classifier	Bagging	BalancedBagging
ROSBoost	ROS	AdaBoost classifier	Boosting	RUSBoost
Enhanced EasyEnsemble	ROS	Bagging classifier + XGBoost	Hybrid (bagging + boosting)	EasyEnsemble

3.3. Evaluation Measures

Our objective is to conduct binary classification on modules to determine their faultiness. The confusion matrix for this binary classification task is presented in Table 4. In this dataset, positive labels are assigned to faulty modules, while negative labels are assigned to nonfaulty modules. The measures of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) are defined. TP represents the number of faulty modules accurately identified as such. TN indicates the number of nonfaulty modules correctly identified as nonfaulty. FP represents the number of nonfaulty modules incorrectly classified as faulty. FN refers to the number of faulty modules inaccurately classified as nonfaulty.

Table 4

Confusion matrix.

Predicted
Actual		Positive	Negative
	Positive	TP	TN
	Negative	FP	FN

The evaluation metrics that are considered in our study are defined as follows:

3.3.1. Balanced Accuracy

The balanced accuracy metric is employed in binary and multiclass classification tasks to address imbalanced datasets. This metric ranges from a best value of 1 to a worst value of 0. It assesses the model’s performance on both classes equally, regardless of differences in sample sizes [56, 57].

The formula for balanced accuracy is the average of sensitivity (true-positive rate) and specificity (true-negative rate), given by $\begin{matrix} (3) & Balanced accuracy = \frac{1}{2} Sensitivity + Specificity, \end{matrix}$ where $\begin{matrix} (4) & Sensitivity = \frac{TP}{TP + FN}, \\ (5) & Specificity = \frac{TN}{TN + FP} . \end{matrix}$

So, from equations (3) and (4), the balanced accuracy formula will be as follows: $\begin{matrix} (6) & Balanced accuracy = \frac{1}{2} \frac{TP}{TP + FN} + \frac{TN}{TN + FP} . \end{matrix}$

3.3.2. F-Measure

The F-measure, balanced F-score or F1 score, is a metric for evaluating classification models. It can be defined as the harmonic mean of precision and recall, as outlined in Ref. [34]. A higher F1 score indicates better model performance, with the best value being 1 and the worst 0. The formula for the F-measure is expressed as $\begin{matrix} (7) & F_{measure} = 2 \frac{Precision * Recall}{Precision + Recall}, \end{matrix}$ where recall = sensitivity and it is in equation (4), and precision is given by $\begin{matrix} (8) & Precision = \frac{TP}{TP + FP} . \end{matrix}$

The F-measure is an effective metric for assessing models in the presence of imbalanced data as it considers both precision and recall, which can be influenced by class distribution. In scenarios with imbalanced datasets, a classifier might achieve a high accuracy level but display a poor ability to identify the minority class, resulting in low recall. In such instances, the F-measure offers a well-rounded evaluation of the model’s performance by harmonizing precision and recall into a single score. This holistic approach allows for assessing the model’s capability to accurately classify both majority and minority classes, providing a more comprehensive evaluation of its overall performance [58].

3.3.3. Specificity

Specificity is a frequently used performance metric in binary classification tasks, quantifying the percentage of TN results among all actual negative cases. Essentially, it indicates the classifier’s ability to accurately identify negative labels, a crucial aspect in situations where FP may have negative consequences [59]. The specificity is given by equation (5).

3.3.4. Area Under the Curve (AUC)

The AUC, introduced in 1982 by Hanley and McNeil [60], is a performance metric that evaluates a binary model’s ability to distinguish between positive and negative instances. AUC comprehensively summarizes a model’s performance across all potential classification thresholds. The receiver operating characteristic (ROC) curve is plotted to calculate the AUC. This curve graphically represents the true-positive rate (TPR) against the false-positive rate (FPR) at different classification thresholds, as illustrated in Figure 8 from Ref. [61].

In the work of Davis and Goadrich [62], the TPR is synonymous with recall and sensitivity, as in function (3), and the FPR is defined as follows: $\begin{matrix} (9) & FPR = \frac{FP}{FP + TN} . \end{matrix}$

The AUC has a range of values from 0 to 1, with higher values indicating superior performance. An AUC score of 0.5 denotes that the model performs no better than random guessing, while an AUC score of 1.0 indicates that the model can ideally differentiate between positive and negative samples.

Table 5 summarizes the metrics used to evaluate the performance of the models and the interpretation of their results and respective focus.

Table 5

Summary of evaluation measures.

Measure	Evaluation focus
Balanced accuracy	Overall performance of imbalance classification
F-measure	Provide a balanced assessment of the classifier’s performance
Specificity	How well a classifier recognizes and assigns negative labels
AUC	How well a classifier avoids a false classification

In this research, Using Python 3.7, the experiment evaluated model performance across 19 datasets from the PROMISE repository using metrics including F-measure, balanced accuracy, specificity, and AUC.

3.4. Statistical Analysis

The Wilcoxon signed-rank (WSR) test, a nonparametric statistical test frequently used to examine two related measures on a single sample [63], is employed in this research to assess the comparative statistical performance of the datasets in question.

Consider a hypothetical scenario involving the predicted outputs of a performance metric from two distinct approaches, denoted as A = {a1, a2, a3, … ai} and B = {b1, b2, b3, … bi}, respectively, with “i” representing the number of predictions. The differences between the corresponding predicted outputs are computed as Diff = {a1 − b1, a2 − b2, a3 − b3, … ai − bi} and are subsequently ranked based on their absolute values. Any differences equal to zero are disregarded for further analysis. Let “n” denote the count of nonzero differences, with the sum of positive and negative ranks denoted as $W_{+}$ and $W_{-}$ , respectively. These values can be computed using the following equations: $\begin{matrix} (10) & W_{+} = \sum_{i = 0}^{n} Rank Abs {Diff}_{i}, where Diff > 0, \\ (11) & W_{-} = \sum_{i = 0}^{n} Rank Abs {Diff}_{i}, where Diff < 0 . \end{matrix}$

To quantify the effect size for the WSR test, the matched-pairs rank biserial correlation coefficient (r) is employed in this study [34, 64]. The following equation is used to represent r: $\begin{matrix} (12) & r = \frac{4 T - W_{+} + W_{-} / 2}{n n + 1} . \end{matrix}$

T is the minimum $W_{+}$ and $W_{-}$ , and n is the count of nonzero difference samples. The effect size can be categorized into three labels: small (r ≤ 0.1), medium (0.1 < r < 0.5), and large (r ≥ 0.5) [65].

Here is how the WSR test works:

1. Calculate the differences between the pairs of observations.

2. Rank the absolute values of the differences from the smallest to the largest, ignoring the signs.

3. Assign positive ranks to the positive differences and negative ranks to the negative differences.

4. Calculate the sum of the ranks of the positive differences and the sum of the ranks of the negative differences.

5. Calculate the test statistic, which is the smaller of the two sums of ranks.

6. Compare the test statistic to a critical value in the WSR distribution table.

4. Results and Discussion

In this section, we conducted experiments using Spyder IDE as a Python 3.7 tool. We employed the main classifiers from the Scikit-learn and imbalanced-learn libraries, presenting the performance results of our enhanced imbalanced ensemble algorithm compared to the standard imbalanced ensemble models. The evaluation metrics used to measure the performance of the models include balanced accuracy, specificity, F-measure, and AUC. Furthermore, we conducted the WSR statistical comparison to evaluate the significance of the differences between the proposed algorithms and the other models.

The experimental results and statistical analysis are presented in the following sections.

4.1. Experiment Results and Evaluation

We analyzed the results presented in Figure 10 and compared the performance of each algorithm:

[figure(s) omitted; refer to PDF]

As shown in Figure 10(a), the E_BB classifier and E_EE classifier outperform other classifiers in terms of F-measure, with mean values of 0.93, and ROSBoost offering a slightly lower value of 0.91. In contrast, the EE, RUSBoost, and BB classifiers had the lowest F-measure values, with 0.80, 0.86, and 0.88, respectively. This suggests that all enhanced classifiers excel in predicting the positive class.

In Figure 10(b), the enhanced imbalanced learning methods, E_BB and E_EE classifiers, achieved the highest mean AUC values of 0.95 and 0.94, respectively, outperforming all other tested classifiers. The average AUC values for BB and EE classifiers were 0.93, while RUSBoost showed the weakest performance with a mean AUC of 0.86, which improved to 0.91 with ROSBoost. This indicates an improved ability of the models to distinguish between faulty and nonfaulty classes with enhanced imbalanced learning, namely, E_BB, E_EE, and ROSBoost.

Concerning the specificity metric, measuring the ability of the models to classify nonfaulty samples accurately, the ROSBoost classifier outperformed the RUSBoost classifier with a mean of 0.83 compared to 0.77, respectively. However, the E_BB and E_EE classifiers obtained lower values than the standard imbalanced learning models BB and EE, with means of 0.85 and 0.81, respectively. The best classifier for predicting the nonfaulty class is BB, with a mean of 0.87, as shown in Figure 10(c).

Figure 10(d) presents the balanced accuracy metric for evaluating the classifiers’ ability to predict correct classes. As depicted in the figure, the proposed classifiers E_BB, E_EE, and ROSBoost outperformed all other models, with a mean value of 0.87 for each. The BB classifier’s average balanced accuracy value of 0.86 is slightly lower than all the proposed methods. However, the lowest values were observed for RUSBoost (0.81) and EE (0.82) classifiers.

Figure 11 provides an insight into how the proposed methods have improved the prediction process in the combined datasets. The F-measure values of E_BB (0.89), ROSBoost (0.69), and E_EE (0.87) were found to be better than their standard counterparts, BB (0.84), RUSBoost (0.66), and EE (0.78), respectively. Regarding the AUC measure, the E_EE classifier scored the highest value of 0.92 followed by E_BB with a value of 0.91, outperforming all other models. Moreover, E_EE and E_BB models achieved the highest balance accuracy scores compared to all other classifiers.

[figure(s) omitted; refer to PDF]

The proposed EL methods E_BB, ROSBoost, and E_EE outperform their respective base models BB, RUSBoost, and EE regarding balanced accuracy, AUC, and F-measure. However, regarding specificity, E_BB and ROSBoost outperformed their base models, but BB performed better than E_BB. Overall, the proposed enhancement to EL methods for imbalanced data proved effective with this study’s historical dataset.

4.2. Statistical Analysis

To conduct a comprehensive analysis of the experimental results and determine whether there are statistically significant differences between the proposed EL methods and the other models, the statistical test called the WSR test is utilized. The WSR test compares the performance results of the three enhanced classifiers with each standard classifier. This test determines if the enhanced models significantly outperform the standard models regarding their F-measure and AUC. The WSR test allows us to evaluate how much the suggested strategies enhance classification performance and decide whether or not the variations are statistically significant. This allows us to draw meaningful conclusions about the effectiveness of the proposed models. Furthermore, the insights obtained from this analysis can contribute to the advancement of EL as a potential solution for addressing CI problems.

Tables 6, 7, and 8 present the results of the WSR test over the F-measure for the classifiers BB vs. E_BB, RUSBoost vs. ROSBoost, and EE vs. E_EE, respectively. Initially, we find the differences between each pair being compared. The observations that provide no difference (Diff) are excluded from consideration for the next ranking step. Then, we assign ranks to the absolute differences |Diff|. After that, the ranks are multiplied by the difference’s sign and computed into negative and positive rank sums using equations ( $W_{+}$ ) ( $W_{-}$ ). We calculated the effect size using the (r) function based on these outcomes. As shown in Table 6, BB and E_BB are statistically different, and E_BB significantly outperforms BB ( $W_{+}$ = 171, $W_{-}$ = 0), and we can reject the null hypothesis according to WSR test's critical value, where T is the minimum of 171 and 0, 0 is hugely less than the critical value 40. Likewise, in Table 7, the null hypothesis is rejected in the case of ROSBoost vs. RUSBoost on F_measure. ROSBoost excels RUSBoost statistically as the sum of the positive ranks is higher than the sum of the negatives ( $W_{+}$ = 171, $W_{-}$ = 0), and based on the threshold value T which is equal to 0 that is incredibly less than 40.

Table 6

WSR test of F-measure between BB and E_BB.

Datasets	BB	E_BB	Diff	Sign	\|Diff\|	Rank	Signed rank
ant-1.7	0.85	0.92	0.07	1	0.07	11	11
camel-1.0	0.88	0.99	0.11	1	0.11	16	16
camel-1.6	0.86	0.94	0.08	1	0.08	12.5	12.5
data_arc	0.89	0.93	0.04	1	0.04	7	7
data_ivy-2.0	0.87	0.98	0.11	1	0.11	16	16
data_prop-6	0.9	0.93	0.03	1	0.03	3.5	3.5
data_redaktor	0.87	0.9	0.03	1	0.03	3.5	3.5
jedit-3.2	0.93	0.93	0	—	—	—	—
jedit-4.2	0.83	0.95	0.12	1	0.12	18	18
log4j-1.1	0.82	0.9	0.08	1	0.08	12.5	12.5
lucene-2.0	0.86	0.86	0	—	—	—	—
poi-2.0	0.83	0.94	0.11	1	0.11	16	16
synapse-1.0	0.86	0.96	0.1	1	0.1	14	14
synapse-1.2	0.91	0.92	0.01	1	0.01	1	1
velocity-1.6	0.87	0.91	0.04	1	0.04	7	7
xalan-2.4	0.87	0.9	0.03	1	0.03	3.5	3.5
xerces-1.2	0.88	0.92	0.04	1	0.04	7	7
xerces-1.3	0.91	0.97	0.06	1	0.06	10	10
xerces-1.4	0.94	0.97	0.03	1	0.03	3.5	3.5
combined data	0.84	0.89	0.05	1	0.05	9	9

Table 7

WSR test of F-measure between RUSBoost and ROSBoost.

Datasets	RUSBoost	ROSBoost	Diff	Sign	\|Diff\|	Rank	Signed rank
ant-1.7	0.81	0.88	0.07	1	0.07	12	12
camel-1.0	0.87	0.98	0.11	1	0.11	17	17
camel-1.6	0.73	0.8	0.07	1	0.07	12	12
data_arc	0.87	0.95	0.08	1	0.08	14	14
data_ivy-2.0	0.93	0.96	0.03	1	0.03	4.5	4.5
data_prop-6	0.78	0.91	0.13	1	0.13	18	18
data_redaktor	0.88	0.88	0	—	—	—	—
jedit-3.2	0.91	0.93	0.02	1	0.02	1.5	1.5
jedit-4.2	0.87	0.92	0.05	1	0.05	9	9
log4j-1.1	0.9	0.9	0	—	—	—	—
lucene-2.0	0.88	0.92	0.04	1	0.04	7.5	7.5
poi-2.0	0.88	0.92	0.04	1	0.04	7.5	7.5
synapse-1.0	0.93	0.96	0.03	1	0.03	4.5	4.5
synapse-1.2	0.82	0.89	0.07	1	0.07	12	12
velocity-1.6	0.91	0.94	0.03	1	0.03	4.5	4.5
xalan-2.4	0.75	0.84	0.09	1	0.09	15.5	15.5
xerces-1.2	0.82	0.88	0.06	1	0.06	10	10
xerces-1.3	0.87	0.96	0.09	1	0.09	15.5	15.5
xerces-1.4	0.92	0.94	0.02	1	0.02	1.5	1.5
combined data	0.66	0.69	0.03	1	0.03	4.5	4.5

Table 8

WSR test of F-measure between EE and E_EE.

Datasets	EE	E_EE	Diff	Sign	\|Diff\|	Rank	Signed rank
ant-1.7	0.84	0.93	0.09	1	0.09	4.5	4.5
camel-1.0	0.76	0.98	0.22	1	0.22	18	18
camel-1.6	0.75	0.93	0.18	1	0.18	16	16
data_arc	0.67	0.93	0.26	1	0.26	19	19
data_ivy-2.0	0.81	0.95	0.14	1	0.14	9	9
data_prop-6	0.77	0.93	0.16	1	0.16	12	12
data_redaktor	0.8	0.9	0.1	1	0.1	6.5	6.5
jedit-3.2	0.94	0.94	0	—	—	—	—
jedit-4.2	0.78	0.92	0.14	1	0.14	9	9
log4j-1.1	0.76	0.9	0.14	1	0.14	9	9
lucene-2.0	0.86	0.85	−0.01	−1	0.01	1	−1
poi-2.0	0.76	0.93	0.17	1	0.17	14	14
synapse-1.0	0.77	0.96	0.19	1	0.19	17	17
synapse-1.2	0.89	0.95	0.06	1	0.06	3	3
velocity-1.6	0.8	0.9	0.1	1	0.1	6.5	6.5
xalan-2.4	0.74	0.91	0.17	1	0.17	14	14
xerces-1.2	0.76	0.91	0.15	1	0.15	11	11
xerces-1.3	0.8	0.97	0.17	1	0.17	14	14
xerces-1.4	0.92	0.97	0.05	1	0.05	2	2
combined data	0.78	0.87	0.09	1	0.09	4.5	4.5

Table 8 displays the results of F_measure values for EE versus E_EE. The null hypothesis is rejected based on the values of the WSR test statistics, where $W_{+}$ is 189 and $W_{-}$ is 1. Moreover, the calculated value of T equals 1, which is lower than the critical value of 46. Hence, the findings suggest that E_EE demonstrates superior performance compared to EE.

Tables 9, 10, and 11 provide the outcomes of the WSR test conducted on the AUC metric for the following classifier comparisons: BB versus E_BB, RUSBoost versus ROSBoost, and EE versus E_EE, respectively. AUC serves as a measure of overall model performance. The findings indicate that E_BB statistically outperforms BB with a significant difference in the sum of the positive and negative ranks ( $W_{+}$ = 158, $W_{-}$ = 19), as demonstrated in Table 9. The null hypothesis is therefore rejected, as evidenced by the calculated value of T (i.e., the minimum of $W_{+}$ and $W_{-}$ ), which is 19, well below the critical value of 40.

Table 9

WSR test of AUC measure between BB and E_BB.

Datasets	BB	E_BB	Diff	Sign	\|Diff\|	Rank	Signed rank
ant-1.7	0.91	0.93	0.02	1	0.02	11	11
camel-1.0	0.99	1	0.01	1	0.01	5	5
camel-1.6	0.94	0.96	0.02	1	0.02	11	11
data_arc	0.98	0.99	0.01	1	0.01	5	5
data_ivy-2.0	0.99	0.99	0	—	—	—	—
data_prop-6	0.88	0.89	0.01	1	0.01	5	5
data_redaktor	0.86	0.83	−0.03	−1	0.03	14	−14
jedit-3.2	0.98	0.98	0	—	—	—	—
jedit-4.2	0.88	0.93	0.05	1	0.05	17.5	17.5
log4j-1.1	0.87	0.88	0.01	1	0.01	5	5
lucene-2.0	0.93	0.94	0.01	1	0.01	5	5
poi-2.0	0.9	0.93	0.03	1	0.03	14	14
synapse-1.0	0.94	0.99	0.05	1	0.05	17.5	17.5
synapse-1.2	0.97	0.98	0.01	1	0.01	5	5
velocity-1.6	0.92	0.95	0.03	1	0.03	14	14
xalan-2.4	0.87	0.91	0.04	1	0.04	16	16
xerces-1.2	0.93	0.94	0.01	1	0.01	5	5
xerces-1.3	0.96	0.97	0.01	1	0.01	5	5
xerces-1.4	0.99	0.98	−0.01	−1	0.01	5	−5
combined data	0.89	0.91	0.02	1	0.02	11	11

Table 10

WSR test of AUC measure between RUSBoost and ROSBoost.

Datasets	RUSBoost	ROSBoost	Diff	Sign	\|Diff\|	Rank	Signed rank
ant-1.7	0.86	0.88	0.02	1	0.02	5	5
camel-1.0	0.95	1	0.05	1	0.05	10.5	10.5
camel-1.6	0.79	0.86	0.07	1	0.07	15	15
data_arc	0.81	0.95	0.14	1	0.14	18	18
data_ivy-2.0	0.94	0.98	0.04	1	0.04	8	8
data_prop-6	0.8	0.86	0.06	1	0.06	13.5	13.5
data_redaktor	0.76	0.81	0.05	1	0.05	10.5	10.5
jedit-3.2	0.98	0.97	−0.01	−1	0.01	2	−2
jedit-4.2	0.79	0.88	0.09	1	0.09	16	16
log4j-1.1	0.89	0.89	0	—	—	—	—
lucene-2.0	0.92	0.94	0.02	1	0.02	5	5
poi-2.0	0.85	0.84	−0.01	−1	0.01	2	−2
synapse-1.0	0.81	0.99	0.18	1	0.18	19	19
synapse-1.2	0.89	0.94	0.05	1	0.05	10.5	10.5
velocity-1.6	0.86	0.92	0.06	1	0.06	13.5	13.5
xalan-2.4	0.71	0.84	0.13	1	0.13	17	17
xerces-1.2	0.87	0.92	0.05	1	0.05	10.5	10.5
xerces-1.3	0.89	0.92	0.03	1	0.03	7	7
xerces-1.4	0.95	0.96	0.01	1	0.01	2	2
combined data	0.67	0.69	0.02	1	0.02	5	5

Table 11

WSR test of AUC measure between EE and E_EE.

Datasets	EE	E_EE	Diff	Sign	\|Diff\|	Rank	Signed rank
ant-1.7	0.92	0.94	0.02	1	0.02	9	9
camel-1.0	0.98	1	0.02	1	0.02	9	9
camel-1.6	0.93	0.96	0.03	1	0.03	14.5	14.5
data_arc	0.95	0.95	0	—	—	—	—
data_ivy-2.0	0.99	0.99	0	—	—	—	—
data_prop-6	0.89	0.86	−0.03	−1	0.03	14.5	−14.5
data_redaktor	0.8	0.82	0.02	1	0.02	9	9
jedit-3.2	0.97	0.98	0.01	1	0.01	3	3
jedit-4.2	0.89	0.91	0.02	1	0.02	9	9
log4j-1.1	0.9	0.89	−0.01	−1	0.01	3	−3
lucene-2.0	0.93	0.94	0.01	1	0.01	3	3
poi-2.0	0.93	0.95	0.02	1	0.02	9	9
synapse-1.0	0.94	0.96	0.02	1	0.02	9	9
synapse-1.2	0.97	0.96	−0.01	−1	0.01	3	−3
velocity-1.6	0.91	0.94	0.03	1	0.03	14.5	14.5
xalan-2.4	0.9	0.91	0.01	1	0.01	3	3
xerces-1.2	0.93	0.95	0.02	1	0.02	9	9
xerces-1.3	0.96	0.99	0.03	1	0.03	14.5	14.5
xerces-1.4	0.99	0.99	0	—	—	—	—
combined data	0.88	0.92	0.04	1	0.04	17	17

Similarly, Table 10 presents the WSR test results for the AUC metric in the ROSBoost versus RUSBoost comparison. The null hypothesis is rejected due to the significantly higher sum of the positive ranks ( $W_{+}$ = 186) relative to the negative ranks ( $W_{-}$ = 4). The value of T, which is equal to 4, is significantly lower than the critical value of 46.

Finally, Table 11 displays the AUC results for EE versus E_EE, where the null hypothesis is rejected based on the WSR test statistics with $W_{+}$ equating to 132.5 and $W_{-}$ to 20.5. The value of T is 20.5, which is lower than the critical value of 35, indicating that E_EE outperforms EE in terms of AUC performance.

The results presented in Table 12 compare the various methods regarding their effect size and rank sums. Specifically, for every performance metric, the table shows the effect size (r) as well as the positive rank sum ( $W_{+}$ ) and negative rank sum ( $W_{-}$ ). The effect size is computed using equation (12), and it can be classified as small (r ≤ 0.1), medium (0.1 < r < 0.5), or large (r ≥ 0.5). The results can be demonstrated as follows:

• Based on F-measure, the high effect sizes show that the proposed models, namely, E_BB, ROSBoost, and E_EE, outperform their corresponding classes.

• Similarly, in the AUC analysis, the proposed models, E_BB, ROSBoost, and E_EE, outperform their respective methods with high effect size values of 0.78, 0.94, and 0.97, respectively.

Table 12

Statistical comparison using effect size(r) in terms of F-measure and AUC.

Classifiers	F-measure	AUC
BB vs. E_BB	1 (171/0)	0.78 (158/19)
RUSBoost vs. ROSBoost	1 (171/0)	0.94 (186/4)
EE vs. E_EE	0.99 (189/1)	0.97 (132.5/20.5)

The statistical analysis conducted in this study aimed to assess the impact of the proposed methods across all 19 datasets and the combined data. The performance of the classifiers was evaluated using different measures, such as the F-measure and AUC. Through the application of the WSR test and considering the ranks and effect size, it was observed that the E_BB, ROSBoost, and E_EE classifiers achieved better performance in terms of the F-measure while also demonstrating slightly lower results in terms of AUC. Therefore, it can be concluded that the proposed ensemble methods for imbalanced data showed superior performance in comparison to the standard methods in this study.

5. Conclusion

SFP is vital in SE for early defect detection and enhancing software quality and reliability. A major challenge in SFP is managing CI. EL is an effective strategy for improving SFP models in OO systems, particularly in handling imbalanced data and increasing sensitivity to minority classes.

The conclusion of the study emphasizes the effectiveness of EL in enhancing SFP models for OO systems with imbalanced data. The researchers proposed three enhanced classifiers: E_BB, ROSBoost, and E_EE, which were improvements over BB, RUSBoost, and EE, respectively. The enhancements involved replacing RUS with ROS methods to address the CI and AdaBoost with XGBoost as a base learner in the E_EE classifier.

The experiment, conducted on 19 datasets from the PROMISE repository using Python 3.7, evaluated the performance of the models using measures such as F-measure, balanced accuracy, specificity, and AUC. The results showed that E_BB, ROSBoost, and E_EE classifiers outperformed their corresponding base models regarding F-measure, balanced accuracy, and AUC. The WSR test confirmed the statistical significance of these improvements, with the rejection of the null hypothesis and a notable effect size indicating the practical significance of the proposed models.

Future research recommends developing a software tool for predicting faults in OO systems. This tool should incorporate ensemble classifiers designed for imbalanced datasets and allow customization of parameters for sampling techniques and base classifiers, allowing researchers to adapt the models to specific dataset characteristics.

Overall, this study contributes valuable knowledge to SFP, offering practical solutions for improving model accuracy in the presence of imbalanced data.

5.1. Research Limitations

While this study aimed to investigate the effectiveness of ensemble classifiers in addressing CI in SFP, several limitations should be considered. First, the sample size primarily consisted of OO examples, limiting the generalizability of findings to all system types. Second, some datasets had small sample sizes; for instance, the log4j-1.1 dataset included 72 OO samples, and Lucene-2.0 included 104 samples. This could limit the findings’ applicability to other OO datasets individually. However, the proposed methodology showed an observable improvement when considering the combined set of 19 datasets comprising over 6000 samples. Future efforts could be directed toward developing a tool that expands the research scope to include all system types and simplifies tuning classifier parameters.

Ethics Statement

The paper is an original research contribution that has not been published elsewhere in any form or language. In addition, this research is part of a master’s degree and has been approved by the University of Jordan Graduate School with number 16308/2023/10, Date 12/07/2023.

Author Contributions

Hanan Sharif Alsorory, a programmer and a master’s student, contributed to writing certain document sections. Mohammad is responsible for writing and supervising the entire process.

Funding

The authors did not receive support from any organization for the submitted work.

References

[1] S. Rathore, S. Kumar, S. Briefs, I. N. Computer, Fault Prediction Modeling for the Prediction of Number of Software Faults‏, 2019.

[2] A. Singh, R. Bhatia, A. Sighrova, "Taxonomy of Machine Learning Algorithms in Software Fault Prediction Using Object-Oriented Metrics," Procedia Computer Science, vol. 132, pp. 993-1001, DOI: 10.1016/j.procs.2018.05.115, 2018.

[3] P. Chujai, K. Chomboon, P. Teerarassamee, N. Kerdprasop, K. Kerdprasop, "Ensemble Learning For Imbalanced Data Classification Problem," Proceedings of the 3rd International Conference on Industrial Application Engineering, vol. 467, pp. 449-456, DOI: 10.12792/iciae2015.079, .

[4] I. Arora, V. Tetarwal, A. Saha, "Open Issues in Software Defect Prediction," Procedia Computer Science, vol. 46, pp. 906-912, DOI: 10.1016/j.procs.2015.02.161, 2015.

[5] Z.-H. Zhou, Ensemble Methods, 2012.

[6] N. Shaar, M. Alshraideh, L. Shboul, I. AlDajani, "Decision Support System (DSS) for Traffic Prediction and Building a Dynamic Internet Community Using Netnography Technology in the City of Amman," Journal of Experimental & Theoretical Artificial Intelligence,DOI: 10.1080/0952813X.2023.2165716, 2023.

[7] B. V. Dasarathy, B. V. Sheela, "A Composite Classifier System Design: Concepts and Methodology," Proceedings of the IEEE, vol. 67 no. 5, pp. 708-713, DOI: 10.1109/PROC.1979.11321, 1979.

[8] L. K. A. I. Hansen, P. Salamon, "Neural Network Ensembles," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12 no. 10, pp. 993-1001, DOI: 10.1109/34.58871, 1990.

[9] Y. Freund, R. E. Schapire, "A Decision-Theoretic Generalization of Online Learning and an Application to Boosting," Journal of Computer and System Sciences, vol. 55 no. 1, pp. 119-139, DOI: 10.1006/jcss.1997.1504, 1997.

[10] A. Petrakova, M. Affenzeller, G. Merkurjeva, "Heterogeneous Versus Homogeneous Machine Learning Ensembles," Information Technology and Management Science, vol. 18 no. 1,DOI: 10.1515/itms-2015-0021, 2015.

[11] L. Breiman, "Bagging Predictors," Machine Learning, vol. 24 no. 2, pp. 123-140, DOI: 10.1007/bf00058655, 1996.

[12] H. I. Aljamaan, M. O. Elish, "An Empirical Study of Bagging and Boosting Ensembles for Identifying Faulty Classes in Object-Oriented Software," 2009 IEEE Symposium on Computational Intelligence and Data Mining, vol. 8, pp. 187-194, DOI: 10.1109/CIDM.2009.4938648, .

[13] B. J. Worton, J. S. U. Hjorth, "Computer Intensive Statistical Methods: Validation Model Selection and Bootstrap," Journal of the Royal Statistical Society: Series A, vol. 157 no. 3,DOI: 10.2307/2983538, 1994.

[14] J. H. Friedman, "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics, vol. 29 no. 5, pp. 1189-1232, DOI: 10.1214/aos/1013203451, 2001.

[15] T. Chen, C. Guestrin, "XGBoost: A Scalable Tree Boosting System," Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 11, pp. 785-794, DOI: 10.1145/2939672.2939785, .

[16] R. Guo, Z. Zhao, T. Wang, G. Liu, J. Zhao, D. Gao, "Degradation State Recognition of Piston Pump Based on ICEEMDAN and XGBoost," Applied Sciences, vol. 10 no. 18, pp. 6593-6617, DOI: 10.3390/APP10186593, 2020.

[17] X. Ren, H. Guo, S. Li, S. Wang, J. Li, "A Novel Image Classification Method With CNN-XGBoost Model," Lecture Notes in Computer Science, vol. 10431, pp. 378-390, DOI: 10.1007/978-3-319-64185-0_28, 2017.

[18] Z. Li, Q. Zhang, Y. Wang, S. Wang, "Social Media Rumor Refuter Feature Analysis and Crowd Identification Based on XG Boost and NLP," Applied Sciences, vol. 10 no. 14,DOI: 10.3390/app10144711, 2020.

[19] D. L. Gupta, K. Saxena, "Software Bug Prediction Using Object-Oriented Metrics," Sādhanā, vol. 42 no. 5, pp. 655-669, DOI: 10.1007/s12046-017-0629-5, 2017.

[20] R. Ponnala, C. R. K. Reddy, "Object Oriented Dynamic Metrics in Software Development: A Literature Review," International Journal of Applied Engineering Research, vol. 14 no. 22, pp. 4161-4172, 2019. https://www.ripublication.com

[21] S. K. Pandey, R. B. Mishra, A. K. Tripathi, "Machine Learning Based Methods for Software Fault Prediction: A Survey," Expert Systems with Applications, vol. 172,DOI: 10.1016/j.eswa.2021.114595, 2021.

[22] S. R. Chidamber, C. F. Kemerer, "A Metrics Suite for Object Oriented Design," IEEE Transactions on Software Engineering, vol. 20 no. 6, pp. 476-493, DOI: 10.1109/32.295895, 1994.

[23] A. Kaur, R. Malhotra, "Application of Random Forest in Predicting Fault-Prone Classes," 2008 International Conference on Advanced Computer Theory and Engineering, pp. 37-43, DOI: 10.1109/ICACTE.2008.204, .

[24] Q. Yu, S. Jiang, J. Qian, L. Bo, L. Jiang, G. Zhang, "Process Metrics for Software Defect Prediction in Object-Oriented Programs," IET Software, vol. 14 no. 3, pp. 283-292, DOI: 10.1049/iet-sen.2018.5439, 2020.

[25] M. Lanza, R. Marinescu, Object-Oriented Metrics in Practice, 2006.

[26] J. Bansiya, C. G. Davis, "A Hierarchical Model for Object-Oriented Design Quality Assessment," IEEE Transactions on Software Engineering, vol. 28 no. 1,DOI: 10.1109/32.979986, 2002.

[27] D. D. S. Marian Jureczko, "Using Object-Oriented Design Metrics to Predict," International Conference on Dependability of Computer Systems DepCoS, Monographs of System Dependability, pp. 69-81, .

[28] M. Alshraideh, L. Bottaci, "Search‐Based Software Test Data Generation for String Data Using Program‐Specific Search Operators," Software Testing, Verification and Reliability, vol. 16 no. 3, pp. 175-203, DOI: 10.1002/stvr.354, 2006.

[29] M. Alshraideh, B. A. Mahafzah, S. Al-Sharaeh, "A Multiple-Population Genetic Algorithm for Branch Coverage Test Data Generation," Software Quality Journal, vol. 19 no. 3, pp. 489-513, DOI: 10.1007/s11219-010-9117-4, 2011.

[30] M. Alshraideh, L. Bottaci, B. A. Mahafzah, "Using Program Data-State Scarcity to Guide Automatic Test Data Generation," Software Quality Journal, vol. 18 no. 1, pp. 109-144, DOI: 10.1007/s11219-009-9083-x, 2010.

[31] S. Di Martino, F. Ferrucci, C. Gravino, F. Sarro, "A Genetic Algorithm to Configure Support Vector Machines for Predicting Fault-Prone Components," Lecture Notes in Computer Science, vol. 6759, pp. 247-261, DOI: 10.1007/978-3-642-21843-9_20, 2011.

[32] R. Malhotra, "A Systematic Review of Machine Learning Techniques for Software Fault Prediction," Applied Soft Computing, vol. 27, pp. 504-518, DOI: 10.1016/j.asoc.2014.11.023, 2015.

[33] H. Ji, S. Huang, Y. Wu, Z. Hui, C. Zheng, "A New Weighted Naive Bayes Method Based on Information Diffusion for Software Defect Prediction," Software Quality Journal, vol. 27 no. 3, pp. 923-968, DOI: 10.1007/s11219-018-9436-4, 2019.

[34] P. Manchala, M. Bisi, "Diversity Based Imbalance Learning Approach for Software Fault Prediction Using Machine Learning Models," Applied Soft Computing, vol. 124,DOI: 10.1016/j.asoc.2022.109069, 2022.

[35] S. Huda, K. Liu, M. Abdelrazek, "An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction," IEEE Access, vol. 6, pp. 24184-24195, DOI: 10.1109/ACCESS.2018.2817572, 2018.

[36] N. Junsomboon, T. Phienthrakul, "Combining Over-Sampling and Under-Sampling Techniques for Imbalance Dataset," vol. no. 1, pp. 243-247, DOI: 10.1145/3055635.3056643, .

[37] B. Zadrozny, J. Langford, N. Abe, "Cost-Sensitive Learning by Cost-Proportionate Example Weighting," Third IEEE International Conference on Data Mining, pp. 435-442, DOI: 10.1109/icdm.2003.1250950, .

[38] L. Liu, X. Wu, S. Li, Y. Li, S. Tan, Y. Bai, "Solving the Class Imbalance Problem Using Ensemble Algorithm: Application of Screening for Aortic Dissection," BMC Medical Informatics and Decision Making, vol. 22 no. 1, pp. 82-16, DOI: 10.1186/s12911-022-01821-w, 2022.

[39] M. Hosni, I. Abnane, A. Idri, J. M. Carrillo de Gea, J. L. Fernández Alemán, "Reviewing Ensemble Classification Methods in Breast Cancer," Computer Methods and Programs in Biomedicine, vol. 177, pp. 89-112, DOI: 10.1016/j.cmpb.2019.05.019, 2019.

[40] B. Krawczyk, "Learning From Imbalanced Data: Open Challenges and Future Directions," Progress in Artificial Intelligence, vol. 5 no. 4, pp. 221-232, DOI: 10.1007/s13748-016-0094-0, 2016.

[41] R. Li, L. Zhou, S. Zhang, H. Liu, X. Huang, Z. Sun, "Software Defect Prediction Based on Ensemble Learning," Proceedings of the 2019 2nd International Conference on Data Science and Information Technology,DOI: 10.1145/3352411.3352412, .

[42] T. M. Barros, P. A. Souza Neto, I. Silva, L. A. Guedes, "Predictive Models for Imbalanced Data: A School Dropout Perspective," Education Sciences, vol. 9 no. 4,DOI: 10.3390/educsci9040275, 2019.

[43] G. Lemaître, F. Nogueira, C. K. Aridas, "Imbalanced-Learn: A python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning," Journal of Machine Learning Research, vol. 18, 2017.

[44] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, A. Napolitano, "RUSBoost: Improving Classification Performance when Training Data Is Skewed," 2008 19th International Conference on Pattern Recognition,DOI: 10.1109/icpr.2008.4761297, .

[45] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, A. Napolitano, "RUSBoost: A Hybrid Approach to Alleviating Class Imbalance," IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 40 no. 1, pp. 185-197, DOI: 10.1109/TSMCA.2009.2029559, 2010.

[46] X. Y. Liu, J. Wu, Z. H. Zhou, "Exploratory Undersampling for Class-Imbalance Learning," IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, vol. 39 no. 2, pp. 539-550, DOI: 10.1109/TSMCB.2008.2007853, 2009.

[47] Y. Chen, L. Chen, Y. Wang, Y. Zheng, H. Su, "Application Research on Prediction of Weld Ultrasonic Inspection Results Based on EasyEnsemble and XGBoost Algorithm," 2021 11th International Conference on Intelligent Control and Information Processing (ICICIP), vol. 22, pp. 341-345, DOI: 10.1109/ICICIP53388.2021.9642193, 2021.

[48] X. Ren, Z. Yuan, J. Huang, "Research on Fake Reviews Detection Based on Feature Construction and EasyEnsemble-RF," 2021 2nd International Conference on Artificial Intelligence and Computer Engineering (ICAICE), vol. 35 no. 7, pp. 478-482, DOI: 10.1109/ICAICE54393.2021.00098, 2021.

[49] D. Aggarwal, "Software Defect Prediction Dataset," 2021. https://figshare.com/articles/dataset/Software_Defect_Prediction_Dataset/13536506

[50] N. Cordón, "Imbalance/man/imbalance Ratio," 2018. https://github.com/ncordon/imbalance/blob/master/man/imbalanceRatio.Rd

[51] E. Rahm, H. Do, "Data Cleaning: Problems and Current Approaches," IEEE Data Engineering Bulletin, vol. 23 no. 4, 2000.

[52] V. Sathya Durga, T. Jeyaprakash, "An Effective Data Normalization Strategy for Academic Datasets Using Log Values," 2019 International Conference on Communication and Electronics Systems (ICCES), vol. 5, pp. 610-612, DOI: 10.1109/ICCES45898.2019.9002089, 2019.

[53] S. Jain, S. Shukla, R. Wadhvani, "Dynamic Selection of Normalization Techniques Using Data Complexity Measures," Expert Systems with Applications, vol. 106, pp. 252-262, DOI: 10.1016/j.eswa.2018.04.008, 2018.

[54] V. N. G. Raju, K. P. Lakshmi, V. M. Jain, A. Kalidindi, V. Padma, "Study the Influence of Normalization/Transformation Process on the Accuracy of Supervised Classification," 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), pp. 729-735, DOI: 10.1109/ICSSIT48917.2020.9214160, .

[55] P. Wlodarczak, "Data Pre-Processing," Machine Learning With Applications, pp. 43-51, DOI: 10.1201/9780429448782-3, 2019.

[56] R. Powers, M. Goldszmidt, I. Cohen, "Short-Term Performance Forecasting in Enterprise Systems," Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 801-807, DOI: 10.1145/1081870.1081976, .

[57] D. Chicco, N. Tötsch, G. Jurman, "The Matthews Correlation Coefficient (MCC) Is More Reliable Than Balanced Accuracy, Bookmaker Informedness, and Markedness in Two-Class Confusion Matrix Evaluation," BioData Mining, vol. 14, pp. 13-22, DOI: 10.1186/s13040-021-00244-z, 2021.

[58] H. He, E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21 no. 9, pp. 1263-1284, DOI: 10.1109/TKDE.2008.239, 2009.

[59] M. Sokolova, G. Lapalme, "A Systematic Analysis of Performance Measures for Classification Tasks," Information Processing & Management, vol. 45 no. 4, pp. 427-437, DOI: 10.1016/j.ipm.2009.03.002, 2009.

[60] J. A. Hanley, B. J. McNeil, "The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve," Radiology, vol. 143 no. 1, pp. 29-36, DOI: 10.1148/radiology.143.1.7063747, 1982.

[61] Zach, "How to Interpret a ROC Curve (With Examples)," Statology, 2021. https://www.statology.org/interpret-roc-curve/

[62] J. Davis, M. Goadrich, "The Relationship between Precision-Recall and ROC Curves," Proceedings of the 23rd International Conference on Machine Learning-ICML’06, vol. 148, pp. 233-240, DOI: 10.1145/1143844.1143874, .

[63] F. Wilcoxon, "Individual Comparisons of Grouped Data by Ranking Methods," Journal of Economic Entomology, vol. 39 no. 2, pp. 269-270, DOI: 10.1093/jee/39.2.269, 1946.

[64] M. Tomczak, E. Tomczak, "The Need to Report Effect Size Estimates Revisited. An Overview of Some Recommended Measures of Effect Size," TRENDS in Sport Sciences, vol. 1 no. 21, pp. 19-25, 2014. https://www.wbc.poznan.pl/Content/325867/5_Trends_Vol21_2014_no1_20.pdf

[65] A. S. Tarawneh, A. B. A. Hassanat, K. Almohammadi, D. Chetverikov, C. Bellinger, "SMOTEFUNA: Synthetic Minority Over-Sampling Technique Based on Furthest Neighbour Algorithm," IEEE Access, vol. 8, pp. 59069-59082, DOI: 10.1109/ACCESS.2020.2983003, 2020.

Word count: 9055

Show less

Copyright © 2024 Hanan Sharif Alsorory and Mohammad Alshraideh. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

Translate

Software fault prediction (SFP) is a crucial aspect of software engineering, aiding in the early identification of potential defects. This proactive approach significantly contributes to enhancing software quality and reliability. However, a common challenge in SFP is class imbalance (CI). Ensemble learning (EL) is a powerful strategy for refining SFP models in object-oriented systems with imbalanced data and improving sensitivity to minority classes. This study aimed to improve the effectiveness of ensemble classes in SFP within object-oriented systems, tackling the challenges associated with imbalanced data. It focuses on enhancing the performance of three ensemble classifiers, BalancedBagging, RUSBoost, and EasyEnsemble, explicitly designed for imbalanced datasets. In Enhanced_BalancedBagging (E_BB) and ROSBoost, random undersampling (RUS) is substituted with random oversampling (ROS). Meanwhile, Enhanced_EasyEnsemble (E_EE) replaces RUS with ROS and AdaBoost with XGBoost. The experimental results demonstrate the superior performance of E_BB, ROSBoost, and E_EE over their base models, achieving the highest F-measure, balanced accuracy, and AUC. Statistical tests, such as the Wilcoxon signed-rank test, provide robust support for the enhanced models, highlighting their practical significance through substantial improvements in F-measure and AUC, as indicated by low negative rank sums and large effect sizes.

Details

Title

Boosting Software Fault Prediction: Addressing Class Imbalance With Enhanced Ensemble Learning

Author

Hanan Sharif Alsorory¹

; Alshraideh, Mohammad²

¹ Computer Science Department The University of Jordan Amman Jordan
² Artificial Intelligence Department The University of Jordan Amman Jordan

Editor

Vishnu Srinivasa Murthy Yarlagadda

Publication year

2024

Publication date

2024

Publisher

John Wiley & Sons, Inc.

ISSN

16879724

e-ISSN

16879732

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2024/2959582

ProQuest document ID

3121098465

Boosting Software Fault Prediction: Addressing Class Imbalance With Enhanced Ensemble Learning

Jump to:

Full text

Abstract

Details

Suggested sources