Content area

Abstract

The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.

Details

10000008
Business indexing term
Title
Imbalanced data sampling design based on grid boundary domain for big data
Publication title
Volume
40
Issue
1
Pages
27-64
Publication year
2025
Publication date
Jan 2025
Publisher
Springer Nature B.V.
Place of publication
Heidelberg
Country of publication
Netherlands
Publication subject
ISSN
0943-4062
e-ISSN
1613-9658
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2024-03-08
Milestone dates
2024-01-30 (Registration); 2023-04-10 (Received); 2024-01-29 (Accepted)
Publication history
 
 
   First posting date
08 Mar 2024
ProQuest document ID
3165215246
Document URL
https://www.proquest.com/scholarly-journals/imbalanced-data-sampling-design-based-on-grid/docview/3165215246/se-2?accountid=208611
Copyright
Copyright Springer Nature B.V. Jan 2025
Last updated
2025-02-18
Database
ProQuest One Academic