Content area
Addressing the issue of imbalanced target distribution in regression problems remains a critical challenge, particularly in tabular data scenarios where sparsely populated regions can degrade predictive accuracy. Data-level approaches, such as random sampling and SMOTE-inspired techniques, have attempted to tackle this by extending classification strategies to regression tasks. However, these methods often rely on strict, manually set thresholds for the target variable, which are inherited from approaches used for imbalanced classification tasks, and can lead to arbitrary and unintuitive problem definitions. More recent solutions using generative models such as Generative Adversarial Networks and Variational Autoencoders allow for more adaptive sample generation but often suffer from high computational demands and reduced interpretability. In this work, we present a modified version of a CART-based synthetic data generation method, specifically designed for imbalanced regression problems. Our approach incorporates both relevance and density information to focus sampling on underrepresented areas of the target space, eliminating the need for fixed thresholds by relying instead on a feature-guided, threshold-free generation process, making it suitable for heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. We validate our method through experiments targeting the prediction of rare or extreme values across benchmark datasets. The findings show that our method performs competitively with other resampling and generative approaches, while offering notable advantages in speed and interpretability. Overall, the proposed technique emerges as a transparent and scalable data-level solution for enhancing regression performance in imbalanced settings.