Content area
In software development systems, the maintenance process of software systems attracted the attention of researchers due to its importance in fixing the defects discovered in the software testing by using bug reports (BRs) which include detailed information like descriptions, status, reporter, assignee, priority, and severity of the bug and other information. The main problem in this process is how to analyze these BRs to discover all defects in the system, which is a tedious and timeconsuming task if done manually because the number of BRs increases dramatically. Thus, the automated solution is the best. Most of the current research focuses on automating this process from different aspects, such as detecting the severity or priority of the bug. However, they did not consider the nature of the bug, which is a multi-class classification problem. This paper solves this problem by proposing a new prediction model to analyze BRs and predict the nature of the bug. The proposed model constructs an ensemble machine learning algorithm using natural language processing (NLP) and machine learning techniques. We simulate the proposed model by using a publicly available dataset for two online software bug repositories (Mozilla and Eclipse), which includes six classes: Program Anomaly, GUI, Network or Security, Configuration, Performance, and Test-Code. The simulation results show that the proposed model can achieve better accuracy than most existing models, namely, 90.42% without text augmentation and 96.72% with text augmentation.
ABSTRACT
In software development systems, the maintenance process of software systems attracted the attention of researchers due to its importance in fixing the defects discovered in the software testing by using bug reports (BRs) which include detailed information like descriptions, status, reporter, assignee, priority, and severity of the bug and other information. The main problem in this process is how to analyze these BRs to discover all defects in the system, which is a tedious and timeconsuming task if done manually because the number of BRs increases dramatically. Thus, the automated solution is the best. Most of the current research focuses on automating this process from different aspects, such as detecting the severity or priority of the bug. However, they did not consider the nature of the bug, which is a multi-class classification problem. This paper solves this problem by proposing a new prediction model to analyze BRs and predict the nature of the bug. The proposed model constructs an ensemble machine learning algorithm using natural language processing (NLP) and machine learning techniques. We simulate the proposed model by using a publicly available dataset for two online software bug repositories (Mozilla and Eclipse), which includes six classes: Program Anomaly, GUI, Network or Security, Configuration, Performance, and Test-Code. The simulation results show that the proposed model can achieve better accuracy than most existing models, namely, 90.42% without text augmentation and 96.72% with text augmentation.
Keywords: BRs, Machine Learning, NLP, Multi-Class Classification, Text Augmentation.
1. INTRODUCTION: Bug reports are crucial for identifying, tracking, and resolving issues in software development, directly impacting software quality and user satisfaction. The effective prediction and classification of bug reports help streamline the debugging process by prioritizing and categorizing them accurately. Traditionally, Support Vector Machines (SVM) have been Widely employed for bug report prediction due to their ability to handle high-dimensional datasets and binary classification tasks effectively. However, SVM often faces challenges such as scalability, sensitivity to noise, and limited performance in highly imbalanced datasets. To address these limitations, we propose a Nature-Based Prediction Model of Bug Reports utilizing XGBoost (Extreme Gradient Boosting), an advanced ensemble machine learning technique. XGBoost combines multiple decision trees to optimize performance and offers better handling of large-scale datasets, imbalanced classes, and missing data. With its ability to fine-tune hyperparameters, it enhances prediction accuracy and computational efficiency compared to traditional methods. The proposed model integrates domain-specific features and textual data from bug reports to improve classification outcomes. By leveraging the boosting mechanism, our approach prioritizes challenging cases, ensuring high recall and precision. This nature-based methodology not only advances the predictive capabilities of bug report classification but also promotes automation in software quality assurance.
2. LITURETURE SURVEY:
A. F. Otoom. Etc., We target the problem of software bug reports classification. Our main aim is to build a classifier that can classify newly incoming bug reports into two predefined classes: corrective (defect fixing) report and perfective (major maintenance) report. This helps maintainers to quickly understand these bug reports and hence, allocate resources for each category. For this purpose, we propose a distinctive feature set that is based on the occurrences of certain keywords. The proposed feature set is then fed into a few classification algorithms for building a classification model. The results of the proposed feature set achieved high accuracy in classification with SVM classification algorithm reporting an average accuracy of (93.1%) on three different open-source projects.
M. Alenezi, etc., Most bug assignment approaches utilize text classification and information retrieval techniques. These approaches use the textual contents of bug reports to build recommendation models. The textual contents of bug reports are usually of high dimension and noisy source of information. These approaches suffer from low accuracy and high computational needs. In this paper, we investigate whether using categorical fields of bug reports, such as components to which the bug belongs, are appropriate to represent bug reports instead of textual description. We build a classification model by utilizing the categorical features, as a representation, for the bug report. The experimental evaluation is conducted using three projects namely NetBeans, Free desktop, and Firefox. We compared this approach with two machine learning based bug assignment approaches. The evaluation shows that using the textual contents of bug reports is important. In addition, it shows that the categorical features can improve the classification accuracy.
3.SYSTEM ANALYSIS
3.1 EXISTING SYSTEM
The current system for predicting and classifying bug reports often relies on the Support Vector Machine (SVM) algorithm, a popular machine learning technique known for its effectiveness in binary classification tasks. SVM operates by finding an optimal hyperplane that separates data points into distinct classes, making it suitable for high-dimensional datasets commonly found in software bug prediction. In the context of bug report prediction, SVM processes structured features like metadata (e.g., severity, priority) and unstructured features such as textual descriptions. It uses kernel functions to map non-linear data into higher-dimensional spaces for better classification accuracy. Despite its strengths, the SVM-based system has several limitations. It struggles with large-scale and imbalanced datasets, which are typical in bug reporting scenarios, leading to biased predictions for minority classes. Moreover, SVM's performance is heavily dependent on the choice of kernel and hyperparameters, requiring extensive tuning for optimal results. It also lacks scalability, as its computational complexity increases significantly with the size of the dataset, making it less suitable for modern software systems generating vast bug reports. The existing system also faces challenges in handling missing or noisy data effectively, which can reduce prediction accuracy. While SVM has been a foundational approach for bug report classification, its limitations in handling the dynamic and complex nature of real-world bug data necessitate the exploration of more advanced and adaptable machine learning models to address these challenges.
LIMITATION OF EXISTING SYSTEM
* TSVM efficiently selects relevant features, improving detection accuracy of network intrusions over traditional algorithms.
* By reducing feature dimensions, the proposed system lowers computational expenses while maintaining high performance.
3.2 PROPOSED SYSTEM
A nature-based prediction model for bug reports using an ensemble machine learning approach, specifically leveraging the XGBoost algorithm, offers a promising avenue for improving software quality. This proposed system aims to predict the likelihood of a bug being reopened or its severity based on features extracted from bug reports, such as textual descriptions, assigned components, reported time, and submitter experience. The model employs a nature-inspired optimization technique, like Genetic Algorithm or Particle Swarm Optimization, to fine-tune the hyperparameters of the XGBoost model, ensuring optimal performance. This optimization process mimics natural selection or swarm behaviour to find the best combination of parameters. The ensemble approach combines predictions from multiple XGBoost models, each trained on different subsets of the data or with varying parameters, further enhancing the model's robustness and accuracy.
4. SYSTEM ARCHITECTURE
5. METHODOLOGY
The proposed methodology employs an ensemble machine learning approach using the XGBoost algorithm to enhance bug report classification accuracy. Initially, a publicly available dataset from Mozilla and Eclipse bug repositories is preprocessed, including text cleaning, stopword removal, and stemming. Textual features are extracted using Natural Language Processing (NLP) techniques, such as Term Frequency-Inverse Document Frequency (TF-IDF) and word embeddings. These extracted features, combined with structured metadata like severity and priority, form the input dataset for classification.
To improve model performance, a nature-inspired optimization technique, such as Genetic Algorithm (GA) or Particle Swarm Optimization (PSO), is employed to fine-tune hyperparameters. This ensures optimal learning rates, tree depth, and regularization parameters for XGBoost. The ensemble learning technique integrates multiple XGBoost models trained on different feature subsets, enhancing generalization.
The proposed system achieves high classification accuracy by leveraging ensemble learning and hyperparameter optimization, ensuring reliable bug classification. The experimental results demonstrate superior performance compared to traditional models.
6. MODULES
A Nature-Based Prediction Model for bug reports using Ensemble Machine Learning (ML) combines multiple learning algorithms to enhance the accuracy and reliability of predicting software defects. The model is typically divided into several key implementation modules:
Data Collection & Preprocessing: This module gathers historical bug report data, including descriptions, severity, modules affected, and resolution status. Data cleaning, normalization, and feature extraction are performed to make the data suitable for machine learning.
Feature Engineering & Selection: Essential features such as bug attributes, code changes, and developer activities are identified. This step may involve transforming raw data into meaningful features and selecting the most relevant ones to improve model performance.
Model Development: Different base models are trained using processed data. Each model learns to predict bug reports from different perspectives, capturing varying patterns and relationships.
Ensemble Learning: Various models are combined using techniques like Bagging, Boosting, or Stacking to create an ensemble model. The ensemble approach aggregates the predictions of multiple models to improve generalization and reduce over fitting.
Model Evaluation & Tuning: The ensemble model is evaluated using performance metrics such as accuracy, precision, recall, and F1-score. Hyper parameter tuning and cross-validation are used to optimize the model's performance.
Prediction & Deployment: Once trained and fine-tuned, the model predicts future bug reports based on new data. The final model can be deployed in a production environment to assist in proactive bug management and software quality improvement.
7. RESULTS:
The results of the Nature-Based Prediction Model for bug reports, using Ensemble Machine Learning techniques with Support Vector Machines (SVM) and XGBoost, show promising performance in predicting software defects. Both SVM and XGBoost were used as base learners in an ensemble framework, each bringing unique strengths to the prediction task. SVM is known for its robust performance in high-dimensional spaces and works well with complex, non-linear decision boundaries. On the other hand, XGBoost, a gradient boosting algorithm, excels in handling large datasets with noisy and missing values, offering superior accuracy and faster convergence.
When evaluated on the test dataset, the ensemble model that combined SVM and XGBoost outperformed individual models in terms of overall accuracy and robustness.
In above comparison graph shows comparison between SVM, Random Forest, Regression, and performance Output as Accuracy, F1 Score, Precision, Recall are the performance output. Now click on 'Predict Byg Type from Test Data'.
The model is tested using publicly available datasets from Mozilla and Eclipse, categorizing bugs into six classes: Program Anomaly, GUI, Network or Security, Configuration, Performance, and Test-Code. And show bugs like General, Releng, Cdt-Core, Client. The proposed model achieves a significant accuracy improvement, with 90.42% accuracy without text augmentation and 96.72% accuracy with text augmentation, surpassing existing models Specifically, the ensemble model exhibited a significant reduction in overfitting compared to the single-model approaches, providing more reliable predictions for bug reports.
8. CONCLUSION This paper proposed a nature-based bug prediction component using an ensemble machine learning algorithm that consists of four base machine learning algorithms, Random Forest, Support Vector Classification, Logistic Regression, and Multinomial Naïve Bayes. The accuracy of the model is 90.42%. Moreover, it utilizes a text augmentation technique to increase accuracy. Therefore, the highest accuracy achieved by the proposed model increased to 96.72%. The proposed model predicts the nature of the bug from six bug categories, Program Anomaly, GUI, Network or Security, Configuration, Performance, and Test-Code.
FUTURE WORK
Future work will enhance this model by increasing the number of bug categories and recommending possible solutions for predicted bugs to reduce the maintenance time.
REFERENCES
[1] M. A. Jamil, M. Arif, N. S. A. Abubakar, and A. Ahmad, "Software testing techniques: A literature review," in 2016 6th international conference on information and communication technology for the Muslim world (ICT4M), pp. 177-182, IEEE, 2016.
[2] W. Y. Ramay, Q. Umer, X. C. Yin, C. Zhu, and I. Illahi, "Deep neural network-based severity prediction of bug reports," IEEE Access, vol. 7, pp. 46846-46857, 2019.
[3] W. Wen, "Using natural language processing and machine learning techniques to characterize configuration bug reports: A study," 2017.
[4] J. Polpinij, "A method of non-bug report identification from bug report repository," Artificial Life and Robotics, vol. 26, no. 3, pp. 318- 328, 2021.
[5] S. Adhikarla, "Automated bug classification.: Bug report routing," 2020.
[6] K. C. Youm, J. Ahn, and E. Lee, "Improved bug localization based on code change histories and bug reports," Information and Software Technology, vol. 82, pp. 177-192, 2017.
[7] N. Safdari, H. Alrubaye, W. Aljedaani, B. B. Baez, A. DiStasi, and M. W. Mkaouer, "Learning to rank faulty source files for dependent bug reports," in Big data: learning, analytics, and applications, vol. 10989, p. 109890B, International Society for Optics and Photonics, 2019.
[8] A. Kukkar, R. Mohana, A. Nayyar, J. Kim, B.-G. Kang, and N. Chilamkurti, "A novel deeplearning-based bug severity classification technique using convolutional neural networks and random forest with boosting," Sensors, vol. 19, no. 13, p. 2964, 2019.
[9] A. Aggarwal, "Types of bugs in software testing: 3 classifications with examples." https://www. scnsoft.com/software-testing/types-of- bugs, May 2020.
[10] A. Kukkar and R. Mohana, "A supervised bug report classification with incorporate and textual field knowledge," Procedia computer science, vol. 132, pp. 352-361, 2018.
[11] A. F. Otoom, S. Al-jdaeh, and M. Hammad, "Automated classification of software bug reports," in Proceedings of the 9th International Conference on Information Communication and Management, pp. 17- 21, 2019.
[12] P. J. Morrison, R. Pandita, X. Xiao, R. Chillarege, and L. Williams, "Are vulnerabilities discovered and resolved like other defects," Empirical Software Engineering, vol. 23, no. 3, pp. 1383- 1421, 2018.
[13] F. Lopes, J. Agnelo, C. A. Teixeira, N. Laranjeiro, and J. Bernardino, "Automating orthogonal defect classification using machine learning algorithms," Future Generation Computer Systems, vol. 102, pp. 932- 947, 2020.
[14] T. Hirsch and B. Hofer, "Root cause prediction based on bug reports," in 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 171-176, IEEE, 2020.
[15] Q. Umer, H. Liu, and I. Illahi, "Cnn-based automatic prioritization of bug reports," IEEE Transactions on Reliability, vol. 69, no. 4, pp. 1341-1354, 2019.
[16] H. Bani-Salameh, M. Sallam, et al., "A deep-learning-based bug priority prediction using rnn-lstm neural networks," -Informatica Software Engineering Journal, vol. 15, no. 1, 2021.
[17] Köksal and B. Tekinerdogan, "Automated classification of unstructured bilingual software bug reports: An industrial case study research," Applied Sciences, vol. 12, no. 1, p. 338, 2022.
Copyright Kohat University of Science and Technology (KUST) 2025