Content area

Abstract

Data imbalance poses a severe challenge in hydrological machine learning (ML) applications by limiting model performance and interpretability, whereas solutions remain limited. This study evaluates the impact of advanced sampling methods, particularly feature space coverage sampling (FSCS), on model performance in predicting forest cover types and saturated hydraulic conductivity (Ks); mechanism underlying its efficacy; and impact on model interpretability. Using ML algorithms such as random forest (RF) and LightGBM (LGB) across various training set sizes, we demonstrated that FSCS significantly mitigates data imbalance, enhancing model accuracy, feature importance estimation, and interpretability. Two widely used hydrological data sets were analyzed: a large multiclass forest cover type data set from Roosevelt National Forest (110,393 samples) and continuous‐value data set of soil properties from the USKSAT database (18,729 samples). In total, 1,720 models were constructed and optimized, combining different sampling methods, training set sizes, and algorithms. Balanced sampling, conditioned Latin hypercube sampling, and FSCS consistently outperformed simple random sampling. Despite using smaller training sets and simpler RF models, FSCS‐trained models matched or surpassed the performance of those using larger data sets or more complex LGB models. SHAP analysis revealed that FSCS enhanced feature–target relationship clarity, emphasizing feature interactions and improving model interpretability. These findings highlight the potential of advanced sampling methods for not only addressing data imbalance but also providing more accurate prior information for model training, thereby enhancing reliability, accuracy, and interpretability in ML for hydrological applications.

Details

10000008
Business indexing term
Title
Addressing Data Imbalance in Hydrological Machine Learning: Impact of Advanced Sampling Methods on Performance and Interpretability
Author
Yin, Xiaoran 1   VIAFID ORCID Logo  ; Shu, Longcang 1   VIAFID ORCID Logo  ; Wang, Zhe 1 ; Zhou, Long 2 ; Niu, Shuyao 3 ; Ren, Huazhun 4 ; Liu, Bo 1 ; Lu, Chengpeng 1   VIAFID ORCID Logo 

 The National Key Laboratory of Water Disaster Prevention, Hohai University, Nanjing, China, College of Hydrology and Water Resources, Hohai University, Nanjing, China 
 The National Key Laboratory of Water Disaster Prevention, Hohai University, Nanjing, China, College of Hydrology and Water Resources, Hohai University, Nanjing, China, College of Hydraulic and Civil Engineering, Xinjiang Agricultural University, Urumqi, China 
 The National Key Laboratory of Water Disaster Prevention, Hohai University, Nanjing, China, College of Hydrology and Water Resources, Hohai University, Nanjing, China, National Marine Data and Information Service, Ministry of Natural Resources, Tianjin, China 
 The National Key Laboratory of Water Disaster Prevention, Hohai University, Nanjing, China, College of Hydrology and Water Resources, Hohai University, Nanjing, China, Bureau of Rivers and Lakes Protection, Construction, Operation and Safety of Chang Jiang Water Resources Commission, Wuhan, China 
Publication title
Volume
61
Issue
10
Number of pages
38
Publication year
2025
Publication date
Oct 1, 2025
Section
Research Article
Publisher
John Wiley & Sons, Inc.
Place of publication
Washington
Country of publication
United States
ISSN
00431397
e-ISSN
19447973
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-10-09
Milestone dates
2025-09-06 (manuscriptRevised); 2025-10-09 (publishedOnlineFinalForm); 2025-01-16 (manuscriptReceived); 2025-10-02 (manuscriptAccepted)
Publication history
 
 
   First posting date
09 Oct 2025
ProQuest document ID
3266095095
Document URL
https://www.proquest.com/scholarly-journals/addressing-data-imbalance-hydrological-machine/docview/3266095095/se-2?accountid=208611
Copyright
© 2025. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-03
Database
ProQuest One Academic