Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability

Abstract

Data imbalance poses a severe challenge in hydrological machine learning (ML) applications by limiting model performance and interpretability, whereas solutions remain limited. This study evaluates the impact of advanced sampling methods, particularly feature space coverage sampling (FSCS), on model performance in predicting forest cover types and saturated hydraulic conductivity (Ks); mechanism underlying its efficacy; and impact on model interpretability. Using ML algorithms such as random forest (RF) and LightGBM (LGB) across various training set sizes, we demonstrated that FSCS significantly mitigates data imbalance, enhancing model accuracy, feature importance estimation, and interpretability. Two widely used hydrological data sets were analyzed: a large multiclass forest cover type data set from Roosevelt National Forest (110,393 samples) and continuous‐value data set of soil properties from the USKSAT database (18,729 samples). In total, 1,720 models were constructed and optimized, combining different sampling methods, training set sizes, and algorithms. Balanced sampling, conditioned Latin hypercube sampling, and FSCS consistently outperformed simple random sampling. Despite using smaller training sets and simpler RF models, FSCS‐trained models matched or surpassed the performance of those using larger data sets or more complex LGB models. SHAP analysis revealed that FSCS enhanced feature–target relationship clarity, emphasizing feature interactions and improving model interpretability. These findings highlight the potential of advanced sampling methods for not only addressing data imbalance but also providing more accurate prior information for model training, thereby enhancing reliability, accuracy, and interpretability in ML for hydrological applications.

Full text

Translate

Turn on search term navigation

Introduction

The analysis, estimation, prediction, and interaction of hydrological features have long attracted attention in hydrology, water resources, and related earth and environmental sciences. Key areas of interest include soil properties, forest cover types, climate change, and surface and groundwater dynamics (Allan et al., 2020; Best, 2019; Gleeson et al., 2020; Richardson et al., 2024; Vereecken et al., 2022; Zhang et al., 2017). Accurate hydrological modeling is essential for advancing hydrological research and informed decision-making, especially under increasingly variable and extreme hydrological events (Fatichi et al., 2016; Peredo et al., 2022; Singh & Woolhiser, 2002; van Kempen et al., 2021). Machine learning (ML) has emerged as a fundamental tool for the robust analysis of complex nonlinear relationships within hydrological systems and their interactions with other environmental systems (Feng et al., 2023; Hsu et al., 1995; Karniadakis et al., 2021; Kashinath et al., 2021; Schaap et al., 2001; Tartakovsky et al., 2020). ML-based approaches can greatly enhance the prediction of hydrological features and aid in elucidating their involvement in earth and environmental systems (Ibrahim et al., 2022; Mosavi et al., 2018; Prodhan et al., 2022; Raghavendra & Deka, 2014; Sit et al., 2020; Tao et al., 2022; Xu & Liang, 2021). ML has been applied in simulating and estimating groundwater level dynamics, water quality, and irrigation responses (El Bilali et al., 2021; Singha et al., 2021; Tao et al., 2022), as well as generating high-resolution data sets for soil moisture, soil types, evaporation, and runoff (Abowarda et al., 2021; Hasan et al., 2024; Hengl et al., 2017; Koppa et al., 2022; Newman et al., 2015). Furthermore, ML-based studies have uncovered global hydrological trends, such as declining evapotranspiration (Jung et al., 2010), and threats from chemical contaminants in global groundwater (Podgorski & Berg, 2020, 2022).

Despite these advancements, ML-based applications are limited by data imbalance (Bedi et al., 2020; Erickson et al., 2021; Wang et al., 2024; Xu & Liang, 2021; Zhu & Pierskalla, 2016)—the uneven representation of features and events in hydrological data sets, such as rare extreme floods or drought events—leading to biased predictions and reduced model generalizability (Wang et al., 2024). Additionally, heterogeneity in environmental variables, such as elevation, slope, soil structure, and temperature, complicates model training and evaluation (Lange & Sippel, 2020). These issues arise under uneven data distributions across training and testing subsets, resulting in models that perform well on training data but poorly on testing data or fail to generalize across different data sets (Ghanbarian & Pachepsky, 2022; Schaap & Leij, 1998), ultimately promoting the misestimation of rare phenomena such as the misclassification of special landform types (Zhu & Pierskalla, 2016), overestimation of ultra-low permeability (Ahmadisharaf et al., 2024), omission of high-level flood risks (Wang et al., 2024), and underestimation of groundwater contamination (Bedi et al., 2020).

Data imbalance manifests not as absolute data deficiency but as relative underrepresentation of key hydrological phenomena. Issues caused by data imbalance such as poor model generalization do not disappear with larger training data sets but may be masked by performance improvements as the model learns the dominant patterns. Furthermore, selecting representative samples based on widely available environmental covariates, such as potential evapotranspiration, precipitation, topography, and remote sensing data sets, is critical for building accurate ML models. These covariates can guide the selection of field sampling locations to reduce the costs and maximize the availability of direct observations on target variables, such as groundwater levels or soil structure. Identifying representative samples helps ensure a balanced data set and maximizes prior information. Similar challenges in soil science have underscored the importance of leveraging accessible covariates to optimize sampling strategies and predict properties in unsampled areas (Biswas & Zhang, 2018; Žížala et al., 2024).

Advanced sampling methods, such as feature space coverage sampling (FSCS), have shown superior performance over simple random sampling (SRS) in building ML models for soil property prediction under small sample sizes (Ma et al., 2020; Wadoux et al., 2019; Žížala et al., 2024), demonstrating its potential for addressing the challenges related to data imbalance. FSCS optimizes feature space representation by minimizing the distance between sampling points and cluster centers, ensuring comprehensive coverage of input covariate feature distributions (Brus, 2019).

Existing studies were primarily conducted under conditions of limited data availability (sample sizes ranging from tens to hundreds), often comparing a narrow range of sampling methods without extensive hyperparameter optimization. Consequently, the focus has remained primarily on performance improvements in small-sample cases, while the more complex data imbalance issues in large-sample cases, underlying mechanisms, and impact on model interpretability remain unclear. Further research is needed to elucidate the mechanisms determining the efficacy of sampling methods beyond empirical validation, and the impact of sampling method on model interpretability—a key determinant of ML performance in hydrology. To address these gaps and robustly evaluate sampling methods, we designed a comprehensive computational experimental framework to systematically compare the performance of FSCS with other advanced sampling methods, including balanced sampling and conditional Latin hypercube sampling (CLHS), using SRS as the baseline. All training subsets were selected directly from the original data set using the respective sampling methods without generating synthetic samples, ensuring the integrity of the data distribution.

To encompass common problem types in hydrological ML models, we utilized two distinct data sets: a multi-class forest cover type and continuous-valued soil hydraulic conductivity (Ks) data set. To reflect the increasing volume of data in hydrological sciences, we extended our investigation to training set sizes of 1,000–20,000 samples. This range accounts for the growing data availability and allows examination of how imbalance issues manifest at larger scales (Ma et al., 2021; Slater et al., 2025; Tran et al., 2023). Each sampling method was applied 20 times at each size level to ensure statistical reliability.

We employed random forest (RF) as a representative simple model and Light Gradient Boosting Machine (LightGBM or LGB) as a complex algorithm, both of which have demonstrated effectiveness in hydrological applications (Guo et al., 2023; Jing et al., 2023; Li et al., 2022; Mohtaram et al., 2025). To guarantee fair comparisons and optimal model performance for each sampling method, we implemented hyperparameter optimization.

Moving beyond performance, we assessed the mechanisms by which FSCS achieved its effectiveness in terms of its impact on covariate and target variable distributions. Recognizing the importance of model interpretability in hydrological research, we evaluated how sampling methods affect model interpretability using SHapley Additive exPlanations (SHAP) analysis (Scott & Su-In, 2017). This approach provides insights into feature importance and interactions, bridging the gap between complex ML models and practical hydrological decision-making (Zhang et al., 2023).

The main objectives of this study were as follows: (a) Evaluate the effectiveness of advanced sampling methods, particularly FSCS, in addressing data imbalance and improving forest cover type and saturated hydraulic conductivity (Ks) predictions; (b) analyze how the FSCS-generated data distributions enhance ML model performance; and (c) explicitly describe the interpretability of ML models through SHAP analysis, highlighting the role of advanced sampling methods.

The remainder of this paper is organized as follows: Section 2 describes the data sets, including forest cover types and saturated soil hydraulic conductivity, representing common multiclass and continuous problems in hydrological ML modeling. It also covers data preparation, a necessary introduction to the sampling methods, the ML algorithms, hyperparameter tuning procedures, performance metrics, statistical significance evaluation, and SHAP tools.

This study provides a comprehensive practical framework for advancing the application of ML in hydrology by addressing data imbalance-related challenges. This is the first large-scale, mechanism-driven, and interpretability-focused evaluation of advanced sampling methods that has direct implications in both hydrological and soil science research.

Materials and Methods

Data Collection and Preprocessing

This study utilized two primary data sets: a forest cover type data set from the Roosevelt National Forest in northern Colorado, provided by the UC Irvine ML Repository and, soil properties data set from the USKSAT database (Blackard, 1998; Pachepsky & Park, 2015).

Forest Cover Type Data Set

The forest cover type data set comprised 110,393 samples, each with 54 features. Due to the computational overhead of model construction and sampling, we used a stratified subset of the full data set, provided by OpenML (Vanschoren et al., 2014). As covariates, we applied only 10 continuous numerical features that past studies identified as the most important environmental covariates for predicting forest cover type including elevation, aspect, slope, horizontal distance to hydrology, vertical distance to hydrology, horizontal distance to roadways, hillshade at 9 a.m., hillshade at noon, hillshade at 3 p.m., and horizontal distance to fire points (Sjöqvist et al., 2020; Tavakol Sadrabadi & Innocente, 2023). Forest cover type was classified into several categories, including aspen, cottonwood-willow, Douglas fir, krummholz, lodgepole pine, ponderosa pine, and spruce-fir. Figure 1 shows the distribution of forest cover types for the subset. Figure A1 (Appendix A) shows the distributions of environmental covariate features and forest cover types for the forest cover type data set, including the subset and full data set, demonstrating their comparable features and the representativeness of the subset.

[IMAGE OMITTED. SEE PDF]

Soil Properties Data Set

The USKSAT data set contains 20,851 samples, 45 data sets, and >27,000 lab measurements from the USA. For this study, 18,729 soil samples were selected based on data availability, and a subset—including clay, bulk density (Db), very fine sand (VFS), medium sand (MS), organic carbon (OC), silt, coarse sand (COS), fine sand (FS), depth to the top (DT), and very coarse sand (VCOS)—was selected to estimate Ks based on previous findings (Ahmadisharaf et al., 2024; Pham & Won, 2022). Figure 2 shows the distribution of Ks on a natural logarithmic scale. Figure A2 (Appendix A) shows the distributions of the environmental covariate features and Ks.

[IMAGE OMITTED. SEE PDF]

Data Standardization

To compare sampling methods, we performed data standardization using the z-score normalization method, ensuring that each feature has a mean of 0 and standard deviation of 1: 1 ${x}_{i,n}=\frac{{x}_{i}-\overline{x}}{\sigma }$

Here, $x$ is the desired variable, ${x}_{i}$ is the ith sample value in variable $x$ , $\overline{x}$ is the average of variable $x$ , $\sigma$ is the standard deviation of variable $x$ , and ${x}_{i,n}$ represents the normalized value of ${x}_{i}$ .

Sampling Methods

For the forest cover type data set, we established four different data set levels for training: 1,000, 5,000, 10,000, and 20,000 samples. For the Ks data set, we used three levels: 1,000, 5,000, and 10,000 samples. We employed several sampling methods, including balanced sampling (Brus, 2015; Grafström & Lisic, 2018), CLHS (Minasny & McBratney, 2006), fuzzy c-means clustering sampling (FCM), FSCS (Brus, 2019), and SRS as a benchmark. Each sampling method was applied at the specified levels for both data sets, generating 20 training subsets at each level. The remaining samples from each data set were used in the test subset. No target variable, that is, forest cover type and Ks, was introduced for any sampling method. Below, we briefly introduce each sampling method.

Balanced Sampling

Balanced sampling aims to improve efficiency by ensuring that the sample reflects the mean of a covariate that is linearly related to the variable of interest (Brus, 2015; Grafström & Lisic, 2018). This strategy is particularly useful when the variable of interest has a known relationship with one or more covariates. This approach selects a sample such that the average of the balancing covariate matches the population mean of that covariate. This promotes the efficient estimation of the variable of interest, often with reduced sampling variance compared to traditional sampling methods such as SRS. In probability sampling, a sampling design is considered balanced on variable $x$ if the following condition holds: 2 $\sum\limits _{k=1}^{n}\,\frac{{x}_{k}}{{\pi }_{k}}=\sum\limits _{k=1}^{N}\,{x}_{k}$ where $n$ is the sample size, $N$ is the total population size, ${x}_{k}$ is the value of the balancing covariate for unit $k$ , and ${\pi }_{k}$ is the inclusion probability of unit $k$ . In simpler terms, the equation ensures that the weighted sum of the values of the covariate in the sample, divided by the inclusion probabilities, equals the sum of the covariate values for the entire population. This guarantees zero sampling variance for the estimated total of the balancing variable. Among the advanced sampling methods, the balanced sampling method is the least demanding and fastest in terms of computational resources.

Conditional Latin Hypercube Sampling

CLHS, one of the most widely used methods in soil mapping, is an optimization-based sampling strategy designed to select a representative sample from a data set with multiple environmental variables (Minasny & McBratney, 2006). The approach aims to choose $m$ samples from a data set of $n$ size with $k$ covariate features. The covariate features are utilized to form a Latin hypercube in the sample while maintaining the statistical characteristics of the original data set. The objective function for CLHS (Equation 3) is composed of three components: ${O}_{1}$ for continuous variables, ${O}_{2}$ for categorical variables, and ${O}_{3}$ to preserve correlations between the variables. These components are weighted by $w$ , which is typically set to 1 for all components. 3 $O={w}_{1}{O}_{1}+{w}_{2}{O}_{2}+{w}_{3}{O}_{3}$ ${O}_{1}$ is determined as follows: 4 ${O}_{1}=\sum\limits _{i}^{m}\,\sum\limits _{j=1}^{k}\,\vert \eta \left({q}_{j}^{i}\mathit{\leqslant }{x}_{j}< {q}_{j}^{i+1}\right)-1\vert ,$ where $m$ is the number of the selected sample, $k$ is the number of covariate features, and $\eta \left({q}_{j}^{i}\mathit{\leqslant }{x}_{j}< {q}_{j}^{i+1}\right)$ is the number of ${x}_{j}$ that falls between quantiles ${q}_{j}^{i}$ and ${q}_{j}^{i+1}$ of the jth covariate feature.

${O}_{2}$ is determined as follows: 5 ${O}_{2}=\sum\limits _{j=1}^{c}\,\vert \frac{\eta \left({x}_{j}\right)}{m}-{k}_{j}\vert$ where $\eta \left({x}_{j}\right)$ is the number of $x$ that belongs to class j in the sampled data, and ${k}_{j}$ is the proportion of class j in X.

${O}_{3}$ is determined as follows: 6 ${O}_{3}=\sum\limits _{i=1}^{k}\,\sum\limits _{j=1}^{k}\,\vert {c}_{ij}-{t}_{ij}\vert$ where $c$ is the element of $C$ (the correlation matrix of $X$ ) and $t$ is the equivalent element of $T$ (the correlation matrix of $x$ ).

In this and most other studies, because the covariate characteristics do not have categorical variables, the objective function degenerates as follows: 7 $O={w}_{1}{O}_{1}+{w}_{3}{O}_{3}.$

Annealing simulations were used to search for sample points. The CLHS algorithm was implemented using a Python rewrite of the original MATLAB code (Minasny & McBratney, 2006).

Fuzzy C-Means Cluster Sampling

FCM sampling is based on fuzzy clustering, where each object (or data point) has a membership value in each cluster of 0–1, reflecting its degree of belonging. Unlike, for example, the k-means method of hard clustering, which assigns samples to a cluster, FCM provides the affiliation of samples to different clusters.

Mathematically, the FCM algorithm minimizes the following objective function: 8 ${J}_{m}=\sum\limits _{i=1}^{n}\,\sum\limits _{j=1}^{c}\,{u}_{ij}^{m}{\Vert {x}_{i}-{c}_{j}\Vert }^{2}$ where ${u}_{ij}$ represents the membership of sample ${x}_{i}$ in cluster $j$ , ${c}_{j}$ is the centroid of cluster $j$ , $m > 1$ is the fuzziness parameter, and ${x}_{i}-{c}_{j}$ is the Euclidean distance between the sample and cluster centroid. The optimization iteratively updates ${u}_{ij}$ and ${c}_{j}$ until convergence is achieved. In this study, $m$ was set to 2, the convergence criterion was set to 0.005, and the maximum number of iterations was 1,000, though this was never reached. With regards to the number of clusters, we employed two designs for two sampling strategies:

Cluster as Classes: In this strategy, the number of clusters is set equal to the number of categories in the target variable. For instance, when classifying forest cover types, the data set was divided into seven clusters, corresponding to the seven forest-cover categories. The total data set size (e.g., 5,000 samples) was evenly distributed among the clusters, with any remainder assigned randomly. Within each cluster, samples were selected in descending order based on their membership values. If a sample had already been allocated to another cluster, its membership values across all clusters were compared, and the sample was reassigned to the cluster with the highest membership value. After initial allocation, the sample distribution in each cluster was reviewed to ensure the required number of samples was achieved. The remaining samples were iteratively selected in each cluster based on membership values, and conflicts were resolved by reallocation to ensure that each cluster has an adequate number of samples.
Cluster as Sample Size: In this strategy, the number of clusters corresponds to the required sample size of the training data set. For example, if the training data set for Ks requires 10,000 samples, the number of clusters is set to 10,000. After clustering, initial sampling is performed, with each cluster selecting the sample closest to its centroid. Similar to the “Cluster as Classes” approach, conflicts are resolved by comparing membership values and reallocating samples to the cluster with the highest membership value. Empty clusters are filled iteratively by selecting samples based on proximity to centroids or membership values. This process continues until all clusters meet the required sample size criteria.

Importantly, the first strategy was applied only in the estimation of forest cover types, whereas the second was used for both forest cover types and Ks.

Feature Space Coverage Sampling and k-Means Cluster Sampling

FSCS aims to achieve optimal coverage of the feature space defined by quantitative variables by minimizing the average distance between population units (e.g., raster cells) and the nearest sampling unit in the standardized feature space (Brus, 2019). The process ensures that selected samples comprehensively represent the feature space.

FSCS optimization is performed according to the criterion mean squared shortest standardized distance (MSSSD). 9 $\text{MSSSD}=\frac{1}{N}\sum\limits _{i=1}^{N}\,{d}_{i}^{2}$

Here, $N$ is the total number of raster cells and ${d}_{i}^{2}$ is the squared distance between the ${i}_{th}$ raster cell and nearest cluster centroid.

To address the varying ranges of different features, all feature values were standardized by dividing them by their standard deviations. The k-means clustering algorithm was used to minimize the MSSSD iteratively, and the process was terminated when no further reductions were achieved.

Considering the deterministic nature of k-means clustering, where the final result is sensitive to the initial cluster selection, the k-means++ algorithm was employed to probabilistically select cluster centers based on distance, reduce the risk of suboptimal clustering, and enhance the initialization process. To ensure robustness, the algorithm was configured to iterate up to 1,000 times with five replicates, employing the k-means++ initialization strategy. We evaluated the computational overhead of the optimal advanced sampling method using k-means++ clustering (see Appendix A2).

Similar to FSCS, which focuses on achieving optimal coverage in the feature space, k-means sampling sets the number of clusters to match the number of classification categories (cluster = class). For example, in the forest cover type data set, the number of clusters was set equal to the number of target categories (i.e., 7).

Samples were selected from each cluster in descending order of their proximity to the cluster center. The number of samples from each cluster was determined by dividing the data set level by the number of categories. Remainders were allocated randomly to one of the clusters to ensure complete coverage.

Simple Random Sampling

SRS without replacement, in which locations are selected independently from the population with equal probability, does not consider the covariate data structure and is one of the most straightforward sampling methods (Cochran, 1977).

Machine Learning Algorithms and Model Building

Random Forest Algorithm

RF is a robust ensemble learning method widely used for both classification and regression tasks. It operates by constructing multiple decision trees during training and aggregating their predictions (via majority voting for classification or averaging for regression) to improve accuracy and reduce overfitting (Breiman, 2001). Each tree is trained on a bootstrap sample of the training data, with randomness introduced by selecting a subset of features at each split. This ensures diversity among the trees and enhances the model's generalizability.

Key parameters for RF including the number of trees (n_estimators), maximum depth of the trees (max_depth), and number of features considered at each split (max_features) were fine-tuned to optimize model performance. RF model construction was performed using the scikit-learn library in Python.

LightGBM Algorithm

LGB is a gradient-boosting framework known for its high efficiency and scalability, particularly for large data sets with numerous features (Ke et al., 2017). Unlike traditional gradient boosting methods, LGB uses a histogram-based algorithm for faster training and supports leaf-wise tree growth, which can reduce loss more effectively than level-wise growth. A comparison among algorithms showed that LGB achieves the optimal overall result in terms of speed and performance (Bentéjac et al., 2021).

Important parameters for LGB including the number of leaves (num_leaves), learning rate (learning_rate), and number of boosting iterations (num_boost_round), as well as regularization parameters such as lambda_l1 and lambda_l2, were used to prevent overfitting. Lambda_l1 penalizes the absolute size of each tree-leaf contribution—pushing minor effects toward zero to highlight the most influential features—while lambda_l2 penalizes the square of each leaf's contribution—limiting extreme values to produce smoother and more reliable predictions. The parameters were systematically optimized to achieve the best predictive performance. LGB model construction was performed using the LGB library in Python.

Hyperparameter Optimization

Hyperparameter optimization is critical for enhancing the performance and robustness of ML-based models. In this study, it ensured that the training sets obtained from the different sampling methods produced optimal or near-optimal models.

The Tree-structured Parzen estimator (TPE) algorithm in Optuna was employed for hyperparameter optimization (Akiba et al., 2019). This Bayesian approach improves efficiency by selecting hyperparameters that maximize expected improvement. Evaluation metrics include root mean square error for regression tasks and accuracy for classification tasks. To ensure reliability, 10-fold cross-validation was applied, with a Bayesian search of 100 iterations to identify the optimal hyperparameters. The best hyperparameters were used to build the final predictive models.

Model Performance and Interpretability Analyses

Model Performance Metrics

To evaluate the performance of the models, we utilized specific metrics tailored to classification and regression tasks. 10 $\text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{FP}+\text{FN}+\text{TN}}$ 11 $\mathrm{F}1\,\text{Score}=2\cdot \frac{\text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}$

Here, TP, TN, FP, and FN represent the true positives, true negatives, false positives, and false negatives. In addition, $\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}},\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}$ .

We determined the area under the curve (AUC) of the receiver operating characteristic (ROC)—a plot of the true positive rate against the false positive rate across all possible classification thresholds—to represent the model's ability for distinguishing between positive and negative classes. 12 $\text{AUC}=\frac{\mathit{\sum }{\left({p}_{i},{n}_{j}\right)}_{{p}_{i} > {n}_{j}}}{P\ast N}$

Here, $P$ is the number of positive samples; $N$ is the number of negative samples; ${p}_{i}$ is the positive sample prediction score, that is, the probability that a positive sample is predicted to be positive; and ${n}_{j}$ is the negative sample prediction score, that is, the probability that a negative sample is predicted to be positive. The AUC distribution ranges within [0.5, 1] and was computed for each classification and averaged as the final AUC (referred to as ROC-AUC).

For the regression model of Ks estimation, the coefficient of determination (R²) and root mean square log-transformed error (RMSLE) were used as performance metrics. 13 ${R}^{2}=1-\frac{\mathit{\sum }{\left({x}_{\text{meas}\,}-{x}_{\text{est}\,}\right)}^{2}}{\mathit{\sum }{\left({x}_{\text{meas}\,}-{\overline{x}}_{\text{meas}\,}\right)}^{2}}$ 14 $\text{RMSLE}=\sqrt{\frac{\mathit{\sum }{\left(\log \left({x}_{\text{est}\,}\right)-\log \left({x}_{\text{meas}\,}\right)\right)}^{2}}{n}}$

Here, ${x}_{\text{est}}$ and ${\,x}_{\text{meas}}$ are the estimated and measured values, respectively, and $n$ is the number of samples.

Statistical Comparison and Effect Size Measures of Sampling Methods

To compare the performance of different sampling methods across models and training sizes, we employed the two-sided Mann–Whitney U test, a nonparametric test suitable for independent samples without the assumption of normality. The test assesses whether the distribution of performance metrics differs significantly between two combinations (e.g., LGB + FSCS vs. RF + CLHS). The null hypothesis of the test is proven as follows: 15 ${H}_{0}:P(X > Y)=P(Y > X)$ where $X$ and $Y$ denote the sets of performance metric values obtained from two model–sampling method combinations at a given training set size. For example, for n = 10,000, $X$ could be the 20 R² measurements from LGB + FSCS, and $Y$ could be the 20 R² measurements from RF + SRS. The Mann–Whitney U statistic is computed based on ranked values across groups, and its two-sided p-value is used to determine statistical significance at a threshold of 0.05.

We quantified the magnitude of the observed differences using two nonparametric effect size measures: rank-biserial correlation (r_rb, Equation 16) and Cliff's delta (δ, Equation 17). 16 ${r}_{\text{rb}}=1-\frac{2U}{{n}_{1}{n}_{2}}$

Here, $U$ is the Mann–Whitney U statistic, and ${n}_{1}$ and ${n}_{2}$ are the sample sizes of the two groups. The value of ${r}_{\text{rb}}$ ranges from −1 to +1, with 0 indicating no difference between groups, +1 representing complete dominance of Group 1 over Group 2, and –1 indicating the opposite. 17 $\delta =\frac{\#\left({x}_{i} > {y}_{j}\right)-\#\left({x}_{i}< {y}_{j}\right)}{{n}_{1}{n}_{2}}$ Here, $\#\left({x}_{i} > {y}_{j}\right)$ and $\#\left({x}_{i}< {y}_{j}\right)$ are the total number of instances where one group's observation exceeds or is less than the other across all pairwise comparisons. Like r_rb, δ ranges from −1 to +1, with the same interpretation of dominance.

Model Interpretability Analysis

SHAP was used to analyze model interpretability based on feature importance and interactions (Scott & Su-In, 2017). Based on Shapley's game theory concept, SHAP quantifies each feature's contribution to the model's output. Positive SHAP values indicate a positive impact on predictions, and negative values indicate a negative impact. Additionally, the analysis of feature interactions can reveal how combinations of features influence model behavior. This approach enhances model transparency by elucidating important predictive features and how they interact, thereby clarifying the model's decision-making process.

Results and Discussion

Performance Comparison of Different Sampling Method Combinations

We first evaluated how performance—accuracy, F1 score, and ROC-AUC for forest cover type; R² and RMSLE for ln(Ks)—vary with model type, sampling method, and training set size (Section 3.1.1). Next, we applied two-sided Mann–Whitney U tests with effect sizes r_rb and δ to assess the statistical significance of FSCS gains over alternative sampling strategies in both fair (same-model) and unfair (cross-model) comparisons (Section 3.1.2). Finally, we compared the performance of LGB under different sampling methods to RF + SRS, demonstrating how FSCS can offset model-architecture limitations.

Performance Metrics Comparison Between Sampling Method Combinations

Figures 3 and 4 show the prediction results for forest cover type and ln(Ks) based on the test set. We used the different sampling methods to obtain sample subsets of the environmental covariate features for the training set, while the remaining samples were used in the test set.

[IMAGE OMITTED. SEE PDF]

As expected, most performance metrics improved with the increasing size of the training set, reflecting the typical outcome of larger data sets. However, the performance improvements were more pronounced when using advanced sampling methods—such as balanced sampling, CLHS, and FSCS—with FSCS showing the most significant enhancement. This suggests that advanced sampling methods, like FSCS, generate training sets with richer prior information by considering covariate feature structure, allowing for more even estimation of forest cover types and Ks.

Notably, the training set of 10,000 showed poorer performance than the smaller training set of 5,000 when using the FCM (cluster = class) sampling method. This indicates that simply increasing the training set size does not necessarily promote the accuracy of pattern learning. Importantly, the training set obtained by this method may also be obtained using a random sampling method.

Among the combinations of models and sampling methods, FSCS consistently delivered the most promising results. For forest cover type, the performance improvement of RF/LGB combined with FSCS was more noticeable than that of RF/LGB with SRS as the training set size increased. Remarkably, the RF/LGB + FSCS combination achieved performance comparable to that of RF/LGB + SRS with a training set size of 20,000 when using only 10,000 samples.

For Ks, the performance gains with FSCS were even more pronounced. For example, RF + FSCS achieved median R² and RMSLE values of 0.79 and 0.47, respectively, with a training set size of 5,000. In contrast, LGB + SRS required a training set size of 10,000 to achieve similar performance (R² = 0.83, RMSLE = 0.45). Furthermore, as the training set size increases or more advanced models are used, the superiority of FSCS becomes even more apparent.

To validate these findings, we compared our results with those in a previous study (Ahmadisharaf et al., 2024). In the previous study, XGBoost + SRS achieved R² and RMSLE values of 0.87/0.89/0.72 and 0.77/0.72/0.69, respectively, with varying training set sizes (11,200/12,800/14,392). In contrast, RF/LGB + FSCS achieved median R² values of approximately 0.86/0.95 and RMSLE values of 0.53/0.33 with a training set size of 10,000. The present and past results show that FSCS improves the performance of simpler models, such as RF, to levels comparable to, or even better than, more advanced models, including XGBoost and LGB, using fewer data.

Figures 5 and 6 quantitatively compare the performance improvements gained by better models versus advanced sampling methods. For instance, in forest cover type estimation, RF + FSCS with a training set size of 20,000 showed a 3.8% improvement in accuracy, compared to a 5.1% improvement with LGB + SRS. This indicates that selecting a better sampling method alone can achieve 74.5% of the improvement obtained by switching to a more advanced model. Moreover, the combination of LGB + FSCS yielded a 7.5% improvement, which was 147% that of LGB + SRS. The advantages of FSCS are even more pronounced for ln(Ks). For LGB + FSCS with a training set size of 10,000, the R² improved by 0.14 over RF + SRS, representing a 17% increase, while RMSLE decreased by 0.19, corresponding to a 36.7% reduction.

[IMAGE OMITTED. SEE PDF]

Based on the clear performance advantages of FSCS, we determined the significance of the observed gains.

Significance of FSCS Gains in Performance

Among the three advanced sampling methods that outperform SRS—balanced sampling, CLHS, and FSCS—FSCS yielded the most significant gains. In “fair” comparisons using the same model, the two-sided Mann–Whitney U tests (p ≤ 0.05), and effect sizes of r_rb and δ, revealed predominantly significant medium-to-large advantages for FSCS over the other sampling methods (Figures A3–A4). Only with small training sizes (ln(Ks) n = 1,000 or forest cover type n ≤ 5,000) did FSCS occasionally show non-significant or small-to-medium advantages, even trailing in some metrics of forest cover type predictions, though these disadvantages were mostly non-significant.

In “unfair” comparisons—contrasting RF + FSCS against LGB combined with other sampling methods—the gains with FSCS persisted despite architectural differences. RF + FSCS maintained superior ROC-AUC for forest cover type up to n = 20,000 (medium-to-large effect size from r_rb, δ). Similarly, for the prediction of ln(Ks) with the n = 10,000 training data set, RF + FSCS matched LGB + SRS in terms of RMSLE while delivering a significantly greater R² (large effect size from r_rb, δ). When roles were reversed (LGB + FSCS vs. RF + others), FSCS produced consistently significant effects across almost all metrics as the sample size increased.

At low training set sizes, LGB + FSCS underperformed compared to RF + other sampling methods in forest cover type prediction, and its advantage was narrowed for ln(Ks), which we largely attributed to model architecture differences rather than any deficiency in FSCS itself. As Figures A5–A6 illustrate, for ln(Ks) prediction at n = 1,000, LGB + FSCS showed a strong positive R² compared to the non-significant moderate negative effect under LGB + SRS while greatly reducing its moderate RMSLE—outperforming all other advanced sampling methods and demonstrating the capacity of FSCS to compensate for model limitations. Although FSCS performs the worst—worse even than SRS—in forest cover type prediction at this scale, the absolute differences in performance metrics across model and sampling method combinations remain on the order of only ∼0.01. At n = 5,000, LGB + FSCS showed a significant increase in accuracy versus RF + SRS (other methods remain non-significant with small effect sizes) and positive ROC-AUC (as opposed to the negative effects observed for other combinations). At n = 10,000, FSCS was the only method to produce a significant positive ROC-AUC; all others remained non-significant relative to RF + SRS.

These findings demonstrate that advanced sampling methods can provide performance gains comparable to, or even exceeding, those achieved by increasing the training set size or using more complex models. This is especially relevant for hydrological research, where data may be limited. For instance, the USKSAT data set does not consistently record all features for each sample, requiring optimal data usage and appropriate sampling methods.

Imbalances in training and test subsets are a common challenge in model performance. By controlling factors including hyperparameter influence, model architecture, and training set size, our study isolates the impact of sampling methods. Our finding that the advantages of FSCS become more pronounced as the training sample size increases revealed that data imbalance manifests more significantly at larger scales, which is often obscured by the performance gains that come with larger training sizes. The results indicate that advanced sampling techniques such as FSCS can effectively address data imbalance and greatly improve model performance, offering a practical solution to incomplete or imbalanced data sets in hydrological ML.

Distribution Differences Between SRS and FSCS Methods

We assessed the specific mechanism by which the advanced sampling method adjusts the feature distribution, addresses data imbalance, and enhances model performance. Given that FSCS demonstrated the most significant performance improvement, we analyzed the differences between the sampling results obtained by FSCS and SRS in terms of feature distribution.

Figures A7 and A8 (Appendix A) show the sampling density distributions for each environmental covariate feature of forest cover type and ln(Ks), including global distributions and training set sampling distributions for RF/LGB combined with SRS/FSCS in the median performance model. There was no significant difference between SRS, FSCS, and the global distribution at high densities. The key differences arose in regions of lower density or where fluctuations were more pronounced. SRS often failed to capture the lower-density portions of the features, leading to an amplification of fluctuations in the density distribution. For example, for the “Vertical Distance to Hydrology” feature of forest cover type, RF/LGB + SRS failed to capture samples above 500 or captured fewer samples in the 500–600 range. In contrast, FSCS oversampled the lower-density regions of the global distribution, and its density even exceeded that of the global distribution, which is also reflected in the “Slope.” The sampling superiority of FSCS was also evident in its better fit of the global distribution of “Clay” than SRS (Figure A8), increasing the density in lower regions and decreasing it in higher regions.

We analyzed the target variable distribution obtained through FSCS. For forest cover types, the sampling density of the two most frequent categories—lodgepole pine and spruce fir—was reduced under FSCS, while additional samples were allocated to the less frequent categories, such as ponderosa pine (Figures 7 and 8). Similarly, for ln(Ks), FSCS achieved greater sampling density for values of <−3 and >6, a range with a low density in the original data set. This indicates that the environmental covariate feature selection strategy of FSCS is reflected in the target variable distribution. Thus, for training sets that include more samples from less-frequent classes or low-density ranges, the model can learn more a priori information, establish better mapping relationships, and improve prediction performance.

[IMAGE OMITTED. SEE PDF]

Importantly, the FSCS method, like the other sampling methods, is entirely based on covariate feature data, and the distribution of the target variable results from the analysis and sampling of the covariate feature data.

Interpretative Analysis of Feature Influences With Different Sampling Methods

We demonstrated how the FSCS method effectively addresses data imbalance by constructing training sets with higher densities for less represented classes or low-density samples compared to the global distribution. This approach significantly improved the predictive performance for both forest cover types and Ks. However, it remains unclear whether the sampling method affects feature interpretability. Accordingly, we analyzed the interpretability of features for estimating forest cover types and Ks based on combinations of different models and sampling methods. Using the SHAP tool, we identified the features with the greatest impact on model construction, the influence of feature variations on predictions, and interaction effects between features. For categorical variables, higher SHAP values indicate a stronger tendency for the feature to predict a particular category. For continuous variables, higher SHAP values suggest a stronger influence toward predicting higher values. We hypothesized that better model performance would lead to more accurate feature interpretation.

For forest cover types, Figure 9 and Figures A9–A12 (Appendix A) present the SHAP feature importance analysis and SHAP value distributions for different model–sampling combinations. Elevation emerged as the most critical feature across all combinations. For RF + SRS/FSCS and LGB + FSCS, the top four features in order of importance were elevation, horizontal distance to roadways, horizontal distance to fire points, and horizontal distance to hydrology. LGB + SRS only differed in the order of the second and third most important features, though their importance was nearly comparable. The importance of the remaining features varied across model–sampling combinations. Compared to RF + SRS/FSCS, LGB + SRS/FSCS reduced the importance of elevation while increasing the importance of most other features. Notably with FSCS, the change in feature importance was further strengthened by upgrading the model from RF to LGB. For instance, the importance of “Aspect” increased when switching from RF + SRS to LGB + SRS and even more so from RF + SRS to LGB + FSCS. The combination of more complex models and advanced sampling methods further refined the contributions of features to less frequent forest cover types. For example, in Figures A9–A12 with the “Aspen” class, as predictive performance improved in terms of the SHAP analysis (RF + SRS < RF + FSCS ≈ LGB + SRS < LGB + FSCS), the influences of elevation and horizontal distance to roadways further increased, and lower feature values favored the estimation of aspen.

[IMAGE OMITTED. SEE PDF]

For Ks, the SHAP analysis indicated that clay content and Db are the two most important features (Figure 10). However, their effects differed: lower values of both clay and Db generally lead to an overestimation of Ks; however, with LGB + SRS/FSCS, the positive influence of low Db was stronger than that of low clay content. This suggests that Db plays a stronger role in estimating high Ks values. This is consistent with the stronger predictive role of Db in previous studies (Ahmadisharaf et al., 2024). Notably, this relationship was also most evident in LGB + FSCS, where model performance was best. The other rankings of feature importance were less consistent. The SHAP average importance values in Figure A13 show that, excluding VCOS, the differences among other features were relatively minor (0.1–0.25). Assuming that stronger predictive models yield more accurate SHAP interpretations, the LGB + FSCS results suggest that clay, Db, sand, MS, and silt are the most important features for estimating Ks.

[IMAGE OMITTED. SEE PDF]

Given that clay is the most critical feature for estimating Ks, we used the SHAP tool to examine the influence of interactions between clay and other features. SHAP interaction values reflect the effect of two interacting features on predictions, similar to SHAP values. Clay showed non-significant interactions with most features (Figure 11). However, the interaction between clay and OC was relatively distinct. At OC < 1.5%, the interaction between clay and OC had minimal influence on Ks estimation, showing a slight decrease as clay increased. However, at OC > 1.5%, increasing clay enhanced Ks estimation. This highlights the utility of SHAP interaction analysis in uncovering potential feature interactions. Notably, this does not conflict with the conclusion that lower clay and OC favor the estimation of Ks (Figure 10).

[IMAGE OMITTED. SEE PDF]

Compared to previous studies, we observed differences in the reported feature importance for predicting forest cover type using the cover type data set and estimating Ks using the USKSAT data set. For forest cover prediction, elevation, horizontal distance to hydrology, and vertical distance to hydrology are the three most important continuous environmental features (Sjöqvist et al., 2020; Tavakol Sadrabadi & Innocente, 2023). For estimating Ks, clay content and Db are typically the two most critical features (Ahmadisharaf et al., 2024; Pham & Won, 2022), consistent with the findings of our study.

While previous research highlighted the significance of other environmental features such as soil type for predicting forest cover, soil type was the most important feature only when using the XGBoost algorithm (Tavakol Sadrabadi & Innocente, 2023). Additionally, the 10th percentile particle diameter (d10) has been used for estimating Ks (Rubol et al., 2014), though its exclusion should not significantly impact model performance (Ahmadisharaf et al., 2024; Araya & Ghezzehei, 2019).

In contrast to previous studies, the primary aim of our research was not to emphasize the specific order of importance of environmental features but rather to evaluate the effectiveness of advanced sampling methods, particularly FSCS, in addressing data imbalance and enhancing the prediction of forest cover types and Ks. We also analyzed how the distributions of input data sampled by FSCS contribute to improving ML-based model performance. Furthermore, our study highlights the importance of feature interpretability, as demonstrated by SHAP analysis, for elucidating how advanced sampling methods improve model predictions and interpretations.

Despite these contributions, this study has limitations that merit further exploration: more complex hydrological application scenarios, such as remote sensing downscaling or drought risk assessment, remain to be evaluated; methods like FSCS are designed for feature-space optimization, and their adaptation to temporal forecasting awaits further investigation; furthermore, the evaluation was conducted only on tree-based models without validation in other architectures such as deep learning. These limitations point to valuable directions for future research.

Conclusion

Previous studies have extensively discussed the challenges posed by data imbalance in hydrological research using ML. However, no clear solution has been devised and now study has adequately addressed these issues. This study compared the effects of different sampling methods on forest cover types and Ks estimation. We demonstrated that the problems associated with data imbalance can be significantly mitigated by selecting advanced sampling methods, such as FSCS, thereby improving model performance and elucidating the relationships between covariate and target features.

We systematically investigated the impact of advanced sampling methods on model performance using two data sets: forest cover type data from the Roosevelt National Forest in northern Colorado (110,393 samples) and soil properties data from the USKSAT database (18,729 samples). A total of 1,720 models were built, considering combinations of data sets, training set sizes, model types, and sampling methods. To ensure optimal or near-optimal performance, each model underwent hyperparameter optimization using the TPE method in the Optuna framework.

Balanced sampling, CLHS, and FSCS outperformed SRS. The performance improvement became more pronounced as sample size increased, with FSCS showing the most significant enhancement. Even with smaller training sets and RF models, the results were comparable to or better than those obtained using larger data sets or more complex LGB models. We believe that these sampling methods address the data imbalance in both training and testing subsets, better capturing underrepresented regions in the feature distribution and providing more accurate prior information for model prediction. SHAP analysis further revealed that the FSCS method improved feature importance estimation, clarifying the relationship between covariate and target features and emphasizing the interactions between key features.

Our study demonstrates that advanced sampling methods, exemplified by FSCS, effectively address data imbalance by considering the data structure of covariate features. Advanced sampling methods not only enhance the accuracy of predictions for forest vegetation types and soil-Ks but also provide a deeper and more precise interpretation of model outcomes. By using a balanced and rigorously optimized comparison framework, this work not only provides robust evidence of the superior performance of advanced sampling in hydrological ML but also delivers a mechanistic explanation of its efficacy. Furthermore, we revealed—for the first time—the exceptional potential of these methods for enhancing model interpretability and facilitating the extraction of meaningful quantitative mechanisms. This contribution extends beyond prior studies that solely established predictive utility, offering a new paradigm for employing sampling as a foundational strategy in data-rich hydrological research. Furthermore, these sampling methods are fully adaptable to the prediction of a broader range of hydrological features, underscoring their dual advantage for simultaneously strengthening both the performance and interpretability of ML models, promoting more reliable and insightful hydrological science.

Appendix

Appendix A - Supplemental Figures and FSCS Overhead Analysis

Feature Histograms, Significance of Differences in Sampling Methods, and SHAP Results

We observed a difference in SHAP values between RF and LGB in the forest cover type prediction, primarily because RF outputs probabilities, leading to smaller SHAP values, while LGB converts forest types into numerical scores, amplifying the range of feature contributions. This does not affect the results of the analysis.

FSCS Computational Overhead Evaluation

To quantify the computational overhead of FSCS using the k-means++ algorithm, we performed a two-part evaluation on a standard desktop workstation (Intel i5 12600KF @ 3.70 GHz, 64 GB DDR5 4,000 MHz RAM), disabling all non-essential background processes to minimize interference.

We first evaluated the computational overhead of MATLAB's native k-means++ implementation, which is used throughout this study. The experimental variables included combinations of three key parameters: original data set level $n\mathit{\in }\left\{10,000,20,000,\mathit{\text{\ldots }},100,000\right\}$ ; sampling data set level, that is, the number of clusters $k\mathit{\in }\left\{5,000,10,000,15,000,\mathit{\text{\ldots }},80,000\right\}$ ; and feature dimension $d\mathit{\in }\left\{1,\mathit{\dots },10\right\}$ . Every combination must adhere to $k\mathit{\le }n$ and, when d < 10, the features are randomly subsampled and clustering is repeated five times. Other clustering program settings included Start = “plus”, MaxIter = 100, and Replicates = 1. The entire run took ∼340 computing h.

Peak memory usage m (MB) and running time t (s) were fitted using the following power-law models: A1 $\left\{\begin{array}{@{}c@{}}m(n,k,d)={C}_{m}+{\beta }_{m}{n}^{{a}_{m}}{k}^{{b}_{m}}{d}^{{c}_{m}}\\ {C}_{m}\mathit{\approx }6228.48,{\beta }_{m}\mathit{\approx }2.6098{\ast 10}^{-05}\\ {a}_{m}\mathit{\approx }0.9266,{b}_{m}\mathit{\approx }1.0258,{c}_{m}\mathit{\approx }-0.0051,{R}^{2}=0.9991\end{array}\right.$ A2 $\left\{\begin{array}{@{}c@{}}t(n,k,d)={C}_{t}+{\beta }_{t}{n}^{{a}_{t}}{k}^{{b}_{t}}{d}^{{c}_{t}}\\ {C}_{t}\mathit{\approx }10.93,{\beta }_{t}\mathit{\approx }5.5799{\ast 10}^{-33}\\ {a}_{t}\mathit{\approx }3.6930,{b}_{t}\mathit{\approx }3.5474,{c}_{t}\mathit{\approx }0.0432,{R}^{2}=0.9614\end{array}\right.$

Figures A14a and A14b shows that model fits are highly accurate and memory scales nearly linearly with n and k ( ${a}_{m},{b}_{m}\mathit{\approx }1$ ), and is insensitive to d ( ${c}_{m}\mathit{\approx }0$ ), while runtime exhibits strong super-linear growth in n and k ( ${a}_{m},{b}_{m} > 1$ ) but remains insensitive to d ( ${c}_{d}\mathit{\approx }0$ ). Thus, the computational overhead was mainly affected by the original data set and sampling size rather than the feature dimensionality.

Because feature dimensionality has a negligible effect, we benchmarked MATLAB against two Python implementations—scikit-learn and scikit-learn-intelex (accelerated via Intel OneAPI)—using a fixed feature dimension d = 10, with original sizes $n\mathit{\in }\left\{50,000,80,000,100,000\right\}$ and sampling data set level $k\mathit{\in }\left\{5,000,10,000,20,000,80,000\right\}$ . All combinations were tested five times under identical hardware conditions. Similarly, for the Python environment, clustering program settings included max_iter = 100, n_init = 1, and Algorithm = “lloyd.”

MATLAB exhibited significantly higher peak memory consumption and runtimes, especially for large n and k, with memory usage exceeding scikit-learn by more than 100X in some cases (Figure A14c–A14h). In contrast, both scikit-learn and scikit-learn-intelex were more efficient with memory. Although MATLAB outperformed Python variants in small-scale configurations, scikit-learn-intelex became substantially faster as n and k increased. For example, under the largest configuration tested (n = 100,000, k = 80,000), scikit-learn-intelex completed clustering in ∼60 s—more than 10X faster than scikit-learn (∼1,100 s)—and, based on extrapolation from the fitted model, MATLAB required >4,000 s under the same conditions.

These results demonstrate that implementing FSCS via the scikit-learn-intelex function of k-means++ significantly enhances computational efficiency, confirming that its overhead remains suitable for large-scale applications.

[IMAGE OMITTED. SEE PDF]

Acknowledgments

This work was supported by the National Key Research and Development Program of China (Grant 2024YFC3211600).

Conflict of Interest

The authors declare no conflicts of interest relevant to this study.

Data Availability Statement

The forest cover type data set can be accessed at and was collected by the author (Yin, 2025). The USKSAT soil data were obtained from by Araya and Ghezzehei (2019) and were collated by the author (Yin, 2025). The CLHS implementation was obtained from and collected by the author (Yin, 2025). The remaining sampling methods and corresponding code tools are described in the main text. All of the data and code implementations have been published at and were collected by the author (Yin, 2025).

References

Abowarda, A. S., Bai, L. L., Zhang, C. J., Long, D., Li, X. Y., Huang, Q., & Sun, Z. L. (2021). Generating surface soil moisture at 30 m spatial resolution using both data fusion and machine learning toward better water resources management at the field scale. Remote Sensing of Environment, 255, 112301. https://doi.org/10.1016/j.rse.2021.112301

Ahmadisharaf, A., Nematirad, R., Sabouri, S., Pachepsky, Y., & Ghanbarian, B. (2024). Representative sample size for estimating saturated hydraulic conductivity via machine learning: A proof‐of‐concept study. Water Resources Research, 60(8), e2023WR036783. https://doi.org/10.1029/2023WR036783

Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next‐generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2623–2631). ACM. https://doi.org/10.1145/3292500.3330701

Allan, R. P., Barlow, M., Byrne, M. P., Cherchi, A., Douville, H., Fowler, H. J., et al. (2020). Advances in understanding large‐scale responses of the water cycle to climate change. Annals of the New York Academy of Sciences, 1472(1), 49–75. https://doi.org/10.1111/nyas.14337

Araya, S. N., & Ghezzehei, T. A. (2019). Using machine learning for prediction of saturated hydraulic conductivity and its sensitivity to soil structural perturbations. Water Resources Research, 55(7), 5715–5737. https://doi.org/10.1029/2018WR024357

Bedi, S., Samal, A., Ray, C., & Snow, D. (2020). Comparative evaluation of machine learning models for groundwater quality assessment. Environmental Monitoring and Assessment, 192(12), 776. https://doi.org/10.1007/s10661‐020‐08695‐3

Bentéjac, C., Csörgo, A., & Martínez‐Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937–1967. https://doi.org/10.1007/s10462‐020‐09896‐5

Best, J. (2019). Anthropogenic stresses on the world's big Rivers. Nature Geoscience, 12(1), 7–21. https://doi.org/10.1038/s41561‐018‐0262‐x

Biswas, A., & Zhang, Y. (2018). Sampling designs for validating digital soil maps: A review. Pedosphere, 28(1), 1–15. https://doi.org/10.1016/S1002‐0160(18)60001‐3

Blackard, J. (1998). Covertype, UCI machine learning repository. https://doi.org/10.24432/C50K5N

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Brus, D. J. (2015). Balanced sampling: A versatile sampling approach for statistical soil surveys. Geoderma, 253–254, 111–121. https://doi.org/10.1016/j.geoderma.2015.04.009

Brus, D. J. (2019). Sampling for digital soil mapping: A tutorial supported by R scripts. Geoderma, 338, 464–480. https://doi.org/10.1016/j.geoderma.2018.07.036

Cochran, W. G. (1977). Sampling techniques. John Wiley & Sons.

El Bilali, A., Taleb, A., & Brouziyne, Y. (2021). Groundwater quality forecasting using machine learning algorithms for irrigation purposes. Agricultural Water Management, 245, 106625. https://doi.org/10.1016/j.agwat.2020.106625

Erickson, M. L., Elliott, S. M., Brown, C. J., Stackelberg, P. E., Ransom, K. M., Reddy, J. E., & Cravotta, C. A. (2021). Machine‐learning predictions of high arsenic and high manganese at drinking water depths of the glacial aquifer system, northern Continental United States. Environmental Science and Technology, 55(9), 5791–5805. https://doi.org/10.1021/acs.est.0c06740

Fatichi, S., Vivoni, E. R., Ogden, F. L., Ivanov, V. Y., Mirus, B., Gochis, D., et al. (2016). An overview of current applications, challenges, and future trends in distributed process‐based models in hydrology. Journal of Hydrology, 537, 45–60. https://doi.org/10.1016/j.jhydrol.2016.03.026

Feng, D., Beck, H., Lawson, K., & Shen, C. (2023). The suitability of differentiable, physics‐informed machine learninghydrologic models for ungauged regions and climate change impact assessment. Hydrology and Earth System Sciences, 27(12), 2357–2373. https://doi.org/10.5194/hess‐27‐2357‐2023

Ghanbarian, B., & Pachepsky, Y. (2022). Machine learning in vadose zone hydrology: A flashback. Vadose Zone Journal, 21(4), e20212. https://doi.org/10.1002/vzj2.20212

Gleeson, T., Cuthbert, M., Ferguson, G., & Perrone, D. (2020). Global groundwater sustainability, resources, and systems in the Anthropocene. Annual Review of Earth and Planetary Sciences, 48(1), 431–463. https://doi.org/10.1146/annurev‐earth‐071719‐055251

Grafström, A., & Lisic, J. (2018). BalancedSampling: Balanced and spatially balanced sampling. (Version 1.5.4) [Software]. http://www.antongrafstrom.se/balancedsampling

Guo, X., Gui, X. F., Xiong, H. X., Hu, X. J., Li, Y. G., Cui, H., et al. (2023). Critical role of climate factors for groundwater potential mapping in arid regions: Insights from random forest, XGBoost, and LightGBM algorithms. Journal of Hydrology, 621, 129599. https://doi.org/10.1016/j.jhydrol.2023.129599

Hasan, F., Medley, P., Drake, J., & Chen, G. (2024). Advancing hydrology through machine learning: Insights, challenges, and future directions using the CAMELS, CARAVAN, GRDC, CHIRPS, PERSIANN, NLDAS, GLDAS, and GRACE datasets. Water, 16(13), 1904. https://doi.org/10.3390/w16131904

Hengl, T., Mendes de Jesus, J., Heuvelink, G. B. M., Ruiperez Gonzalez, M., Kilibarda, M., Blagotić, A., et al. (2017). SoilGrids250m: Global gridded soil information based on machine learning. PLoS One, 12(2), e0169748. https://doi.org/10.1371/journal.pone.0169748

Hsu, K.‐L., Gupta, H. V., & Sorooshian, S. (1995). Artificial neural network modeling of the rainfall‐runoff process. Water Resources Research, 31(10), 2517–2530. https://doi.org/10.1029/95WR01955

Ibrahim, K. S. M. H., Huang, Y. F., Ahmed, A. N., Koo, C. H., & El‐Shafie, A. (2022). A review of the hybrid artificial intelligence and optimization modelling of hydrological streamflow forecasting. Alexandria Engineering Journal, 61(1), 279–303. https://doi.org/10.1016/j.aej.2021.04.100

Jing, H., He, X., Tian, Y., Lancia, M., Cao, G. L., Crivellari, A., et al. (2023). Comparison and interpretation of data‐driven models for simulating site‐specific human‐impacted groundwater dynamics in the North China Plain. Journal of Hydrology, 616, 128751. https://doi.org/10.1016/j.jhydrol.2022.128751

Jung, M., Reichstein, M., Ciais, P., Seneviratne, S. I., Sheffield, J., Goulden, M. L., et al. (2010). Recent decline in the global land evapotranspiration trend due to limited moisture supply. Nature, 467(7318), 951–954. https://doi.org/10.1038/nature09396

Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S. F., & Yang, L. (2021). Physics‐informed machine learning. Nature Reviews Physics, 3(6), 422–440. https://doi.org/10.1038/s42254‐021‐00314‐5

Kashinath, K., Mustafa, M., Albert, A., Wu, J. L., Jiang, C., Esmaeilzadeh, S., et al. (2021). Physics‐informed machine learning: Case studies for weather and climate modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical & Engineering Sciences, 379(2194), 20200093. https://doi.org/10.1098/rsta.2020.0093

Ke, G. L., Meng, Q., Finley, T., Wang, T. F., Chen, W., Ma, W. D., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems. Long Beach, CA: 31st annual conference on neural information processing systems (NIPS).

Koppa, A., Rains, D., Hulsman, P., Poyatos, R., & Miralles, D. G. (2022). A deep learning‐based hybrid model of global terrestrial evaporation. Nature Communications, 13(1), 1912. https://doi.org/10.1038/s41467‐022‐29543‐7

Lange, H., & Sippel, S. (2020). Machine learning applications in hydrology. In D. F. Levia, D. E. Carlyle‐Moses, S. Iida, B. Michalzik, K. Nanko, & A. Tischer (Eds.), Forest‐water interaction (pp. 233–257). Springer International Publishing. https://doi.org/10.1007/978‐3‐030‐26086‐6_10

Li, L. B., Qiao, J. D., Yu, G., Wang, L. Z., Li, H. Y., Liao, C., & Zhu, Z. D. (2022). Interpretable tree‐based ensemble model for predicting beach water quality. Water Research, 211, 118078. https://doi.org/10.1016/j.watres.2022.118078

Ma, K., Feng, D. P., Lawson, K., Tsai, W. P., Liang, C. A., Huang, X. R., et al. (2021). Transferring hydrologic data across continents ‐ Leveraging data‐rich regions to improve hydrologic prediction in data‐sparse regions. Water Resources Research, 57(5), e2020WR028600. https://doi.org/10.1029/2020wr028600

Ma, T., Brus, D. J., Zhu, A. X., Zhang, L., & Scholten, T. (2020). Comparison of conditioned Latin hypercube and feature space coverage sampling for predicting soil classes using simulation from soil maps. Geoderma, 370, 114366. https://doi.org/10.1016/j.geoderma.2020.114366

Minasny, B., & McBratney, A. B. (2006). A conditioned Latin hypercube method for sampling in the presence of ancillary information. Computers & Geosciences, 32(9), 1378–1388. https://doi.org/10.1016/j.cageo.2005.12.009

Mohtaram, A., Shafizadeh‐Moghadam, H., & Ketabchi, H. (2025). A flexible multi‐scale approach for downscaling GRACE‐derived groundwater storage anomaly using LightGBM and random forest in the Tashk‐Bakhtegan Basin, Iran. Journal of Hydrology‐Regional Studies, 57, 102086. https://doi.org/10.1016/j.ejrh.2024.102086

Mosavi, A., Ozturk, P., & Chau, K. W. (2018). Flood prediction using machine learning models: Literature review. Water, 10(11), 1536. https://doi.org/10.3390/w10111536

Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L. E., Bock, A., et al. (2015). Development of a large‐sample watershed‐scale hydrometeorological data set for the contiguous USA: Data set characteristics and assessment of regional variability in hydrologic model performance. Hydrology and Earth System Sciences, 19(1), 209–223. https://doi.org/10.5194/hess‐19‐209‐2015

Pachepsky, Y., & Park, Y. (2015). Saturated hydraulic conductivity of US soils grouped according to textural class and bulk density. Soil Science Society of America Journal, 79(4), 1094–1100. https://doi.org/10.2136/sssaj2015.02.0067

Peredo, D., Ramos, M. H., Andréassian, V., & Oudin, L. (2022). Investigating hydrological model versatility to simulate extreme flood events. Hydrological Sciences Journal, 67(4), 628–645. https://doi.org/10.1080/02626667.2022.2030864

Pham, K., & Won, J. (2022). Enhancing the tree‐boosting‐based pedotransfer function for saturated hydraulic conductivity using data preprocessing and predictor importance using game theory. Geoderma, 420, 115864. https://doi.org/10.1016/j.geoderma.2022.115864

Podgorski, J., & Berg, M. (2020). Global threat of arsenic in groundwater. Science, 368(6493), 845–850. https://doi.org/10.1126/science.aba1510

Podgorski, J., & Berg, M. (2022). Global analysis and prediction of fluoride in groundwater. Nature Communications, 13(1), 4232. https://doi.org/10.1038/s41467‐022‐31940‐x

Prodhan, F. A., Zhang, J. H., Hasan, S. S., Pangali Sharma, T. P., & Mohana, H. P. (2022). A review of machine learning methods for drought hazard monitoring and forecasting: Current research trends, challenges, and future research directions. Environmental Modelling & Software, 149, 105327. https://doi.org/10.1016/j.envsoft.2022.105327

Raghavendra, N. S., & Deka, P. C. (2014). Support vector machine applications in the field of hydrology: A review. Applied Soft Computing, 19, 372–386. https://doi.org/10.1016/j.asoc.2014.02.002

Richardson, C. M., Davis, K. L., Ruiz‐González, C., Guimond, J. A., Michael, H. A., Paldor, A., et al. (2024). The impacts of climate change on coastal groundwater. Nature Reviews Earth & Environment, 5(2), 100–119. https://doi.org/10.1038/s43017‐023‐00500‐2

Rubol, S., Freixa, A., Carles‐Brangarí, A., Fernàndez‐Garcia, D., Romaní, A. M., & Sanchez‐Vila, X. (2014). Connecting bacterial colonization to physical and biochemical changes in a sand box infiltration experiment. Journal of Hydrology, 517, 317–327. https://doi.org/10.1016/j.jhydrol.2014.05.041

Schaap, M. G., & Leij, F. J. (1998). Database‐related accuracy and uncertainty of pedotransfer functions. Soil Science, 163(10), 765–779. https://doi.org/10.1097/00010694‐199810000‐00001

Schaap, M. G., Leij, F. J., & van Genuchten, M. T. (2001). ROSETTA: A computer program for estimating soil hydraulic parameters with hierarchical pedotransfer functions. Journal of Hydrology, 251(3–4), 163–176. https://doi.org/10.1016/s0022‐1694(01)00466‐8

Scott, M., & Su‐In, L. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems (Vol. 30, pp. 4768–4777). NIPS. https://doi.org/10.5555/3295222.3295230

Singh, V. P., & Woolhiser, D. A. (2002). Mathematical modeling of watershed hydrology. Journal of Hydrologic Engineering, 7(4), 270–292. https://doi.org/10.1061/(ASCE)1084‐0699(2002)7:4(270)

Singha, S., Pasupuleti, S., Singha, S. S., Singh, R., & Kumar, S. (2021). Prediction of groundwater quality using efficient machine learning technique. Chemosphere, 276, 130265. https://doi.org/10.1016/j.chemosphere.2021.130265

Sit, M., Demiray, B. Z., Xiang, Z., Ewing, G. J., Sermet, Y., & Demir, I. (2020). A comprehensive review of deep learning applications in hydrology and water resources. Water Science and Technology, 82(12), 2635–2670. https://doi.org/10.2166/wst.2020.369

Sjöqvist, H., Längkvist, M., & Javed, F. (2020). An analysis of fast learning methods for classifying forest cover types. Applied Artificial Intelligence, 34(10), 691–709. https://doi.org/10.1080/08839514.2020.1771523

Slater, L., Blougouras, G., Deng, L. K., Deng, Q. M., Ford, E., van Dijke, A. H., et al. (2025). Challenges and opportunities of ML and explainable AI in large‐sample hydrology. Philosophical Transactions of the Royal Society A: Mathematical, Physical & Engineering Sciences, 383(2302), 20240287. https://doi.org/10.1098/rsta.2024.0287

Tao, H., Hameed, M. M., Marhoon, H. A., Zounemat‐Kermani, M., Heddam, S., Kim, S., et al. (2022). Groundwater level prediction using machine learning models: A comprehensive review. Neurocomputing, 489, 271–308. https://doi.org/10.1016/j.neucom.2022.03.014

Tartakovsky, A. M., Marrero, C. O., Perdikaris, P., Tartakovsky, G. D., & Barajas‐Solano, D. (2020). Physics‐informed deep neural networks for learning parameters and constitutive relationships in subsurface flow problems. Water Resources Research, 56(5), e2019WR026731. https://doi.org/10.1029/2019wr026731

Tavakol Sadrabadi, M., & Innocente, M. S. (2023). Vegetation cover type classification using cartographic data for prediction of wildfire behaviour. Fire, 6(2), 76. https://doi.org/10.3390/fire6020076

Tran, V. N., Ivanov, V. Y., & Kim, J. (2023). Data reformation ‐ A novel data processing technique enhancing machine learning applicability for predicting streamflow extremes. Advances in Water Resources, 182, 104569. https://doi.org/10.1016/j.advwatres.2023.104569

van Kempen, G., van der Wiel, K., & Melsen, L. A. (2021). The impact of hydrological model structure on the simulation of extreme runoff events. Natural Hazards and Earth System Sciences, 21(3), 961–976. https://doi.org/10.5194/nhes‐21‐961‐2021

Vanschoren, J., Van Rijn, J. N., Bischl, B., & Torgo, L. (2014). OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2), 49–60. https://doi.org/10.1145/2641190.2641198

Vereecken, H., Amelung, W., Bauke, S. L., Bogena, H., Brüggemann, N., Montzka, C., et al. (2022). Soil hydrology in the Earth system. Nature Reviews Earth & Environment, 3(9), 573–587. https://doi.org/10.1038/s43017‐022‐00324‐6

Wadoux, A. M. J. C., Brus, D. J., & Heuvelink, G. B. M. (2019). Sampling design optimization for soil mapping with random forest. Geoderma, 355, 113913. https://doi.org/10.1016/j.geoderma.2019.113913

Wang, H., Meng, Y., Xu, H., Wang, H., Guan, X., Liu, Y., et al. (2024). Prediction of flood risk levels of urban flooded points though using machine learning with unbalanced data. Journal of Hydrology, 630, 130742. https://doi.org/10.1016/j.jhydrol.2024.130742

Xu, T., & Liang, F. (2021). Machine learning for hydrologic sciences: An introductory overview. WIREs Water, 8(5), e1533. https://doi.org/10.1002/wat2.1533

Yin, X. (2025). Xiaoran Yin/HydroML‐DSI. (version 1.0.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.15715605

Zhang, J., Ma, X., Zhang, J., Sun, D., Zhou, X., Mi, C., & Wen, H. (2023). Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP‐XGBoost model. Journal of Environmental Management, 332, 117357. https://doi.org/10.1016/j.jenvman.2023.117357

Zhang, M., Liu, N., Harper, R., Li, Q., Liu, K., Wei, X., et al. (2017). A global review on hydrological responses to forest change across multiple spatial scales: Importance of scale, climate, forest type and hydrological regime. Journal of Hydrology, 546, 44–59. https://doi.org/10.1016/j.jhydrol.2016.12.040

Zhu, J., & Pierskalla, W. P. (2016). Applying a weighted random forests method to extract karst sinkholes from LiDAR data. Journal of Hydrology, 533, 343–352. https://doi.org/10.1016/j.jhydrol.2015.12.012

Žížala, D., Princ, T., Skála, J., Juřicová, A., Lukas, V., Bohovic, R., et al. (2024). Soil sampling design matters ‐ Enhancing the efficiency of digital soil mapping at the field scale. Geoderma Regional, 39, e00874. https://doi.org/10.1016/j.geodrs.2024.e00874

Word count: 10227

Show less

© 2025. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability

Content area

Abstract

Full text