Content area
Full text
Abstract
In many industrial applications, classification tasks are often associated with imbalanced class distributions in training datasets. Imbalanced datasets can severely affect the accuracy of class predictions, and thus they need to be handled by appropriate data processing before analyzing the data. The skewness between class labels can be managed by either oversampling minorities or downsampling majorities in class. In this research, we are seeking to find a better way of downsampling by selecting the most informative samples in the given imbalanced dataset through the active learning strategy to balance class labels. The data selection for downsampling is performed by the criterion used in optimal experimental designs, from which the generalization error of the trained model is minimized in a sequential manner, under the penalized logistic regression as a classification model. The downsampling implemented by active learning presents better performance compared to other downsampling techniques and comparable performance with a hybrid sampling method with much-improved data efficiency. On top of that, the pre-selective strategy is adopted to reduce the computational complexity which is indicated as a disadvantage of the sequential learning technique. This approach shows significant improvements in the time consumption for optimal data selection compared to the one without the pre-selective strategy, while it still presents greater performance than other downsampling techniques.
Keywords
active learning, imbalanced data, downsampling, penalized logistic regression
(ProQuest: ... denotes formulae omitted.)
1. Introduction
It is easy to find imbalanced datasets in many real-world classification problems such as prognostic of manufacturing equipment failures, spam emails detection, fraud credit card transactions identification, or medical disease diagnosis. As most of the standard machine learning algorithms are expected to have a balanced class distribution or an equal misclassification cost [1], imbalanced datasets are likely to affect classification performances negatively. It is therefore critical to address this problem by balancing the skewed class distribution, besides significant modifications of learning algorithms, for efficient learning and unbiased decision-making. One traditional approach to making a balance between binary class labels is to take a random sample from the instances with the majority class label, namely random downsampling. However, this naive approach can jeopardize the classification model by learning from noninformative instances [2]. In contrast, a more careful selection of instances that show representativeness of the data...