Content area

Abstract

Addressing the issue of imbalanced target distribution in regression problems remains a critical challenge, particularly in tabular data scenarios where sparsely populated regions can degrade predictive accuracy. Data-level approaches, such as random sampling and SMOTE-inspired techniques, have attempted to tackle this by extending classification strategies to regression tasks. However, these methods often rely on strict, manually set thresholds for the target variable, which are inherited from approaches used for imbalanced classification tasks, and can lead to arbitrary and unintuitive problem definitions. More recent solutions using generative models such as Generative Adversarial Networks and Variational Autoencoders allow for more adaptive sample generation but often suffer from high computational demands and reduced interpretability. In this work, we present a modified version of a CART-based synthetic data generation method, specifically designed for imbalanced regression problems. Our approach incorporates both relevance and density information to focus sampling on underrepresented areas of the target space, eliminating the need for fixed thresholds by relying instead on a feature-guided, threshold-free generation process, making it suitable for heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. We validate our method through experiments targeting the prediction of rare or extreme values across benchmark datasets. The findings show that our method performs competitively with other resampling and generative approaches, while offering notable advantages in speed and interpretability. Overall, the proposed technique emerges as a transparent and scalable data-level solution for enhancing regression performance in imbalanced settings.

Details

1010268
Business indexing term
Title
Synthetic Tabular Data Augmentation for Imbalanced Regression
Number of pages
113
Publication year
2025
Degree date
2025
School code
5896
Source
MAI 87/5(E), Masters Abstracts International
ISBN
9798265423139
University/institution
Universidade do Porto (Portugal)
University location
Portugal
Degree
M.C.S.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32306567
ProQuest document ID
3275479854
Document URL
https://www.proquest.com/dissertations-theses/synthetic-tabular-data-augmentation-imbalanced/docview/3275479854/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic