Content area
Full Text
Appl Intell (2012) 36:664684 DOI 10.1007/s10489-011-0287-y
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique
Chumphol Bunkhumpornpat Krung Sinapiromsaran
Chidchanok Lursinsap
Published online: 14 April 2011 Springer Science+Business Media, LLC 2011
Abstract A dataset exhibits the class imbalance problem when a target class has a very small number of instances relative to other classes. A trivial classier typically fails to detect a minority class due to its extremely low incidence rate. In this paper, a new over-sampling technique called DBSMOTE is proposed. Our technique relies on a density-based notion of clusters and is designed to over-sample an arbitrarily shaped cluster discovered by DB-SCAN. DBSMOTE generates synthetic instances along a shortest path from each positive instance to a pseudo-centroid of a minority-class cluster. Consequently, these synthetic instances are dense near this centroid and are sparse far from this centroid. Our experimental results show that DBSMOTE improves precision, F-value, and AUC more effectively than SMOTE, Borderline-SMOTE, and Safe-Level-SMOTE for imbalanced datasets.
Keywords Classication Class imbalance
Over-sampling Density-based
1 Introduction
Classication [21] is a data mining process that generates a model called a classier, which describes and distinguishes
C. Bunkhumpornpat ( ) K. Sinapiromsaran C. Lursinsap
Department of Mathematics, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailande-mail: mailto:[email protected]
Web End [email protected]
K. Sinapiromsarane-mail: mailto:[email protected]
Web End [email protected]
C. Lursinsape-mail: [email protected]
classes of instances. A derived classier requires the analysis of a training set, dened as a group of identied instances.
A dataset is considered to be imbalanced if a target class has a very small number of instances compared to other classes. The class imbalance problem [9, 18, 19] has attracted the attention of researchers from various elds. Many applications only consider the two-class case [23, 24, 31]. In this context, the smaller class is called the minority class (the positive class), while the larger class is called the majority class (the negative class). If a dataset has more than two classes, a target class is chosen to be the minority class, and the remaining classes are merged into a single majority class.
Analysts encounter the class imbalance problem in many real-world applications, such as medical decision support system for colon polyp screening [10], retailing bank customer attrition analysis [17], network intrusion detection of rare attack categories [22], automotive engineering diagnosis...