Synthetic Tabular Data Augmentation for Imbalanced Regression

Abstract

Addressing the issue of imbalanced target distribution in regression problems remains a critical challenge, particularly in tabular data scenarios where sparsely populated regions can degrade predictive accuracy. Data-level approaches, such as random sampling and SMOTE-inspired techniques, have attempted to tackle this by extending classification strategies to regression tasks. However, these methods often rely on strict, manually set thresholds for the target variable, which are inherited from approaches used for imbalanced classification tasks, and can lead to arbitrary and unintuitive problem definitions. More recent solutions using generative models such as Generative Adversarial Networks and Variational Autoencoders allow for more adaptive sample generation but often suffer from high computational demands and reduced interpretability. In this work, we present a modified version of a CART-based synthetic data generation method, specifically designed for imbalanced regression problems. Our approach incorporates both relevance and density information to focus sampling on underrepresented areas of the target space, eliminating the need for fixed thresholds by relying instead on a feature-guided, threshold-free generation process, making it suitable for heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. We validate our method through experiments targeting the prediction of rare or extreme values across benchmark datasets. The findings show that our method performs competitively with other resampling and generative approaches, while offering notable advantages in speed and interpretability. Overall, the proposed technique emerges as a transparent and scalable data-level solution for enhancing regression performance in imbalanced settings.

Details

Business indexing term

Subject:

Machine learning;
Artificial intelligence

Subject

Forgery;
Machine learning;
Statistics;
Deep learning;
Artificial intelligence;
Cybersecurity;
Adaptation;
Taxonomy;
Computer science;
Criminology

Classification

0463: Statistics
0800: Artificial intelligence
0984: Computer science
0627: Criminology

Title

Synthetic Tabular Data Augmentation for Imbalanced Regression

Author

da Silva Pinheiro, António Pedro Correia

Number of pages

113

Publication year

2025

Degree date

2025

School code

5896

Source

MAI 87/5(E), Masters Abstracts International

ISBN

9798265423139

Advisor

Ribeiro, Rita P.

University/institution

Universidade do Porto (Portugal)

University location

Portugal

Degree

M.C.S.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

32306567

ProQuest document ID

3275479854

Document URL

https://www.proquest.com/dissertations-theses/synthetic-tabular-data-augmentation-imbalanced/docview/3275479854/se-2?accountid=208611

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

Synthetic Tabular Data Augmentation for Imbalanced Regression

Content area

Abstract

Details