Content area
This study aims to compare three popular machine learning (ML) algorithms including random forest (RF), boosting regression tree (BRT), and multinomial logistic regression (MnLR) for spatial prediction of groundwater quality classes and mapping it for salinity hazard. Three hundred eighty-six groundwater samples were collected from an agriculturally intensive area in Fars Province, Iran, and nine hydro-chemical parameters were defined and interpreted. Variance inflation factor and Pearson’s correlations were used to check collinearity between variables. Thereinafter, the performance of ML models was evaluated by statistical indices, namely, overall accuracy (OA) and Kappa index obtained from the confusion matrix. The results showed that the RF model was more accurate than other models with the slight difference. Moreover, the analysis of relative importance also indicated that sodium adsorption ratio (SAR) and pH have the most impact parameters in explaining groundwater quality classes, respectively. In this research, applied ML algorithms along with the hydro-chemical parameters affecting the quality of ground water can lead to produce spatial distribution maps with high accuracy for managing irrigation practice.
Introduction
Due to the shortage of surface water resources in arid and semi-arid regions, groundwater nowadays accounts for one of the most valuable irrigation water resources in these areas (Hotzel, 2012). In addition, accurate identification and utilization of these water resources, particularly in arid and semi-arid regions, can have a major effect on the sustainable development of many agricultural, social, and economic activities in those regions. With the growing population and the increasing demand for water in different agricultural, drinking, and industrial sectors, the increasing over-exploitation of groundwater resources leads to the declined quality of these valuable resources (Alipour et al., 2017; Masoudi et al., 2015). This process can have irreversible impacts on the agricultural sector, in particular on public health (Bui et al., 2020). In recent decades, therefore, groundwater resource management has become an important issue worldwide (Raman et al., 2020). The studied basin of Tashk-Bakhtegan and Maharloo in Fars Province also has such a situation concerning the scarcity of surface water resources and over-exploitation of groundwater resources (Ghafarijoo & Zarei, 2016; Hedayat et al., 2016). Proper management and exploitation of surface and groundwater resources require qualitative assessment and monitoring in order to use optimally and thereby impose minimal damage to the environment and reservoirs of these waters.
Since the conventional methods for the determination of groundwater quality have such disadvantages as time-consuming, high cost, and complexity (Jha et al., 2009, 2010), other methods of groundwater quality assessment have been presented in recent years.
It is essential to select a method that can accurately predict water quality with a minimum number of hydro-chemical quality parameters (Jeihouni et al., 2020). Data mining is one of such methods, which is advantageous over other previous approaches as it is stronger and more accurate, does not need usual prerequisites, and can be used for the classification of a target variable. Data mining is defined as the analysis of a huge dataset to discover the hidden and meaningful patterns in data (Sattari et al., 2017) and is divided into supervised and unsupervised types. The former possesses a specific, predetermined target variable seeking a specific pattern, and the latter aims to find patterns or similarities between groups of data with no specific target variable or a set of predetermined groups and patterns (Witten et al., 2011).
The use of diagrams is one the most widespread and easiest methods of water quality classification (Sattari et al., 2017). The United State Salinity Laboratory (USSL) diagram, introduced by US Salinity Laboratory Staff in 1954, is known as one of the mostly used diagrams and the best tool for the classification of irrigation water quality, which provides the best interpretation of the combined risk of alkalinity and salinity. This diagram has so far been used in many studies for the water quality classification in irrigation applications (Goswamee et al., 2015; Jeon et al., 2020; Karakuş &Yıldız, 2019; Kumarasamy et al., 2013; Laze et al., 2016; Mirabbasi et al., 2008; Singh et al., 2020; US Salinity Laboratory Staff, 1954).
After predicting hydro-chemical parameters using the ML algorithms, water quality should be classified based on different needs and purposes. Numerous researches were carried out for modeling of groundwater quality using the mentioned ML algorithms such as RF (Avila et al., 2018; Chen et al., 2020; Liaw & Wiener, 2002; Mousavi et al., 2019; Naghibi et al., 2016), MnLR (Avila et al., 2018; Debella-Gilo & Etzelmüller, 2009; Kempen et al., 2009; Venkataraman & Uddameri, 2012; Yoo et al., 2016), and BRT (Kim et al., 2019; Kordestani et al., 2018).
Chen et al. (2020) have recently conducted a study to predict surface water quality in basins of ten national rivers in China using statistical and ML methods (decision-making tree, random forest, and deep cascade forest). They found that the three methods yielded better results than the statistical techniques.
Saghebian et al. (2014) used the decision tree to predict and classify groundwater quality based on the USSL diagram in Ardabil basin. They reported that this method was more accurate and efficient than the principal component analysis (PCA) approach for groundwater quality classification. Arabameri et al. (2019) tested four GIS-based models for groundwater potential mapping including RF, weight of evidence (WoE), binary logistic regression (BLR), and technique for order preference by similarity to ideal solution (TOPSIS) multi-criteria. Results showed that land use, land cover, soil type (ST), and slope were the key factors in groundwater occurrence in RF, BLR, and the analytic hierarchy process (AHP), respectively.
Also the combination of remote sensing (RS) data and geographic information system (GIS) with new approaches can be used as a powerful tool in groundwater potential map in arid and semi-arid areas.
Due to the importance and shortage of surface water resources in the Tashk-Bakhtegan and Maharloo basin, the utilization and the management of groundwater resources are of paramount importance (Ghafarijoo & Zarei, 2016; Hedayat et al., 2016), which necessitates the need for water resource management and extensive research in this field more than ever (Wu et al., 2020). Modeling of groundwater quality in Tashk-Bakhtegan and Maharloo basin using popular ML methods can provide a suitable guide for planning and adoption of groundwater-compatible managerial strategies. Until now, the quality of groundwater has been evaluated using various machine learning methods based on more than a few quantitative parameters and with a small number of observation points with a long period (El Bilali & Taleb, 2020; El Bilali et al., 2021), and a few studies have been carried out on the quality of irrigation water in the form of salinity hazard classification and sodium (Saghebian et al., 2014). However, digital modeling and zoning of irrigation water quality class on a large scale with a high number of observation points have received less attention, and it can be known as the innovation of this research.
The main purposes of this research are (i) determination of the most important hydro-chemical parameters affecting groundwater quality and (ii) comparison of the performance of three ML algorithms (BRT, RF, and MnLR) in the spatial prediction of groundwater quality classes for agricultural uses.
Materials and methods
Study area
The study area includes the watershed of Tashk-Bakhtegan and Maharloo basin and is located in the center and north parts of Fars Province and between latitudes 29° 01′ 59″ to 31° 11′ 46″ E and longitude 51° 42′ 12″ to 54° 37′ 12″ N, with an area of 3,145,840 hectares. The average annual rainfall is not the same in terms of spatial distribution as it varies from 200 mm in the south to 700 mm in the north, and its maximum is related to the two seasons of winter and autumn (Choubin et al., 2016). This region, which is located on the Zagros Mountains range, has a range of altitude changes between 1987 and 3922 m above sea level and is part of the main watershed of the Central Plateau of Iran. This region includes 9 cities of Shiraz, Neyriz, Estahban, Arsanjan, Marvdasht, Pasargad, Sepidan, Eghlid, and Sarvestan. The water resources of the region are seasonal and temporary rivers, and groundwater, which is mainly discharged by springs (Zomorodian et al., 2013). Over-exploitation of groundwater and decreasing rainfall and as a result decline of the groundwater level have reduced the volume of the reservoir and the quality of groundwater in the study area (Hojati & Boustani, 2010). The geographical location of the study area and the observation wells are shown in Fig. 1.
[See PDF for image]
Fig. 1
Location of the Tashk-Bakhtegan and Maharloo basin and monitoring wells
Workflow
This research was performed in six stages: (1) data collection and interpretation of data, (2) using the variance inflation factor (VIF) and Pearson’s correlation to check collinearity between independent variables, (3) split the dataset into two subsets 70% (n = 270) and 30% (n = 116) for calibration and validation of ML models, (4) comparing ML algorithms with statistical indices, namely, overall accuracy and Kappa index, (5) the relative importance of groundwater quality index predictors, and (6) predicting groundwater quality maps.
In the first stage, after collecting and identifying the primary parameters, the most optimal parameters for predicting groundwater quality classes in the study area were determined using the inflation variance index among the 9 available parameters.
In this study, the performance of three ML methods including random forest (RF), boosting regression tree (BRT), and multinomial logistic regression (MnLR) was compared to predict the spatial situation of groundwater quality in spring and autumn. The flowchart of the proposed modeling is shown in Fig. 2.
[See PDF for image]
Fig. 2
Flowchart for the proposed methodology
Data collection
In this study, monthly data for groundwater quality were obtained from the Iran Water Resource Management Organization. Then, nine chemical parameters, namely, sodium (mEq/L), calcium (mEq/L), magnesium (mEq/L), chlorine (mEq/L), bicarbonate (HCO3−, mEq/L), pH, electrical conductivity (EC, µmhos/cm), SAR, and total dissolved solid (TDS) (mg/L), were measured in 386 piezometric wells. Due to the lack of data for each month, the two statistical periods of autumn 2017 and spring 2018 were determined as the seasons in which the most water exploitation and the most feeding are done. The nine hydro-chemical characteristics were considered the model inputs and the groundwater quality classes as the model outputs and the target characteristic. The groundwater quality classes for agricultural use were determined based on the USSL diagram, as shown in Table 1.
Table 1. Groundwater quality classes for agriculture use based on the USSL diagram
Class | Application and type |
|---|---|
C1S1 | Suitable for irrigation water |
C2S1, C2S2, C1S2 | Low salinity—it can be used for irrigation |
C1S3, C2S3, C3S1, C3S2, C3S3 | High salinity—for agriculture with proper treatment |
C1S4, C2S4, C3S4, C4S4, C4S3, C4S2, C4S1 | Very high salinity—is not suitable for irrigation |
Reference: US Salinity Laboratory, 1954
Collinearity test
In this study, the qualitative variables predicting the groundwater quality class were selected according to the variance inflation factor (VIF) and Pearson’s correlation. The variables are transferred to the R Studio 1.1456 software as a CSV file, and the test is implemented using the VIF package and the “vifcor” data mining method. This method removes the variables with maximum correlation in the variable set, using a step-by-step approach (Dormann et al., 2013; O’brien, 2007).
The purpose of VIF test is to observe the collinearity of independent variables used in the modeling process. Determination of non-collinear independent variables is one of the conditions and pre-assumptions for the regression. A VIF value over 10 shows a critical alignment (Dodge, 2008; Everitt & Skrondal, 2010), and a value near 1 is an indicator of the desired state as well as a reasonable alignment. Although, in many studies, the researchers may use different criteria depending on the purpose of their study, Rogerson (2001) and Pan and Jackson (2008) suggested the maximum VIF of 5 and 4, respectively.
VIF can be computed as follows:
1
where R is the determination coefficient of the multiple linear regression model when predicting the jth covariate using the remaining (j − 1) ones.Also, Pearson’s correlation was calculated, and the values more than 0.8 with the high correlation were removed to prevent complexity of the model and decrease ML accuracy, so the Pearson’s correlation less than 0.8 was chosen. After selecting the variables, the maps related to the effective parameters were eventually plotted using the ordinary kriging (OK) interpolation method by the ArcMap10.6.1.
Modeling algorithms
Random forest
The random forest (RF), presented by Breiman (2001) is one of the algorithms used in this study. This is a development of the classification and regression tree (CART) model. The CART model iteratively separates the data to find a relationship between the response variables and the independent variables as well as to carry out the prediction. Unlike the other methods plotting a limited number of trees, the RF model generates hundreds or thousands of classification trees (Breiman & Cutler, 2004). This is an ensemble learning method for classifying through generating a large number of trees (Breiman, 2001). In ensemble learning methods, a group of weak learners comes together to form a strong learner. All modeling steps employ this method through the “random forest” package; also the codes are written in the R Studio 3.5.1 software.
Boosted regression tree
In recent years, the decision tree models, which are along with the development of the statistical methods, have been upgraded as the random developed models (Friedman et al., 2000). The boosted regression tree (BRT) is a novel method that is used for the prediction purpose, employing the new statistical algorithms, aiming at improving the efficiency of a model through fitting and combining a large number of trees. The random developed model increases the prediction efficiency and accuracy by reducing the over-training and over-fitting that occurred in the simple tree models. The BRT function fitting can be linear, curvilinear, or nonlinear and follows the normal error distribution, binominal distribution, and Poisson distribution (De'Ath, 2007; Elith et al., 2008).
Multinomial logistic regression
The logistic regression, also called the nominal regression, is a statistical method classifying the records regarding the input field values. This is based on linear regression, but it takes a qualitative variable (such as a nominal variable) instead of a quantitative one. This method can work for both the binominal and the polynomial models (for purposes with more than two categories) (Pakgohar, 2016).
Therefore, in a regression problem when the response variable or the dependent variable is qualitative or categorical, the regression method would be logistic polynomial regression. This kind of regression is one of the tools used for classification problems.
The logistic regression models are usually very precise. They can work for the symbolic or quantitative variables and compute the predicted probabilities for all groups of the objective variables.
Validation
In this study, the test sample or the validation method was used to evaluate the accuracy. Seventy percent of data were used for verification, and the remaining 30 percent were used for validation of the model. The overall verification criteria and the Kappa index were used to evaluate the classification accuracy as follows (Byrt et al., 1993).
Overall accuracy (OA)
In Eq. (2), OA, N, and are overall accuracy, the total number of the classified pixels, and summation of the pixels on the main diagonal of the error matrix (the correctly classified pixels), respectively. The overall classification accuracy is a measurement parameter indicating only the overall accuracy without providing information about the individual classes.
2
Kappa index
The Kappa index has been used for comparing automated classifier with a random classification (Table 2). In fact, this index has a value between zero and one, which if Kappa is equal to zero indicates a completely “random” classification, and a negative value indicates an error in classification, and if this value is equal to one indicates a completely classification is “correct.” The equation is as follows:
3
Table 2. Classification of Kappa index
Kappa index | Description |
|---|---|
< 0.01 | Less than chance agreement |
0.01–0.2 | Slight agreement |
0.2–0.4 | Fair agreement |
0.4–0.6 | Moderate agreement |
0.6–0.8 | Substation agreement |
0.8–0.9 | Almost perfect agreement |
Producer accuracy (PA) and user accuracy (UA)
In Eq. (4), is the number of correctly classified pixels on the main diagonal, and is the summation of the number of pixels classified as the training samples in that column. In Eq. (5), is the number of correctly classified pixels on the main diagonal, and is the summation of the number of pixels classified as the training samples in that row. The range of producer accuracy and user accuracy is from 0 to 1, so that higher values indicate a proper performance for the model.
4
5
Results and discussion
Table 3 lists the most appropriate environmental variables selected for modeling based on the variance inflation factor (VIF). To predict groundwater quality classes, five out of nine applied covariates, with a VIF > 5 and minimum cross-correlation, were finally selected as the most optimal parameters. Figure 3 shows the correlation factor matrix based on the Pearson model. According to Table 3 and Fig. 3, the result of VIF and Pearson’s correlation confirm each other so five hydro-chemical parameters (Mg2+, Ca2+, HCO3−, pH, and SAR) were chosen as effective variables.
Table 3. The covariates selected by the VIF method for modeling groundwater quality (autumn 2017 and spring 2018)
Variables | VIF value | |
|---|---|---|
Autumn 2017 | Spring 2018 | |
Mg2+ | 2.65 | 2.91 |
Ca2+ | 2.30 | 2.25 |
Hco3− | 1.16 | 1.07 |
pH | 1.13 | 1.42 |
SAR | 1.65 | 2.24 |
[See PDF for image]
Fig. 3
Correlation matrix map for all groundwater quality variables based on Pearson’s correlation (autumn 2017 and spring 2018)
Since an important aim of this research was to prepare a highly accurate and quality map, the effective variables were first analyzed statistically in both periods of autumn 2017 and spring 2018. Tables 4 and 5 summarize the results of statistical analysis for the five effective variables (Mg2+, Ca2+, HCO3−, pH, and SAR), which were used as inputs of RF, BRT, and MnLR models.
Table 4. Summary statistics assessment of the selected wells quality variables (autumn 2017)
Variables | Unit | Mina | Maxb | Mean | SDc | CVd (%) | Skewness | Kurtosis | Transformation |
|---|---|---|---|---|---|---|---|---|---|
Mg2+ | mEq/L | 0.20 | 199 | 12.5 | 19.5 | 156 | 4.93 | 37 | Lognormal |
HCO3− | mEq/L | 1.00 | 19.0 | 4.08 | 1.55 | 38 | 2.84 | 20 | Box-Cox |
Ca2+ | mEq/L | 1.00 | 75.0 | 11.14 | 14 | 125 | 2.19 | 5.18 | Square-Root |
pH | - | 7.02 | 10.0 | 8.08 | 0.418 | 5.17 | 0.937 | 3.17 | Lognormal |
SAR | - | 0.23 | 55.0 | 5.05 | 6.43 | 127 | 2.72 | 12.2 | Lognormal |
aMinimum value
bMaximum value
cSDstandard deviation
dCVcoefficient variation
Table 5. Summary statistics assessment of the selected well quality variables (spring 2018)
Variables | Unit | Mina | Maxb | Mean | SDc | CVd (%) | Skewness | Kurtosis | Transformation |
|---|---|---|---|---|---|---|---|---|---|
Mg2+ | mEq/L | 0.00 | 114 | 8.57 | 11.50 | 134 | 3.72 | 22.75 | Lognormal |
Hco3− | mEq/L | 0.00 | 18.6 | 4.37 | 1.71 | 39.0 | 2.60 | 15.18 | Lognormal |
Ca2+ | MEq/L | 0.80 | 57.0 | 8.80 | 9.51 | 108 | 1.80 | 3.13 | Square-root |
pH | - | 7.02 | 8.27 | 7.44 | 0.23 | 3.09 | 0.72 | 0.36 | - |
SAR | - | 0.03 | 86.7 | 6.94 | 9.53 | 137 | 4.26 | 28.50 | Square-root |
aMinimum value
bMaximum value
cSD standard deviation
dCV coefficient variation
Moreover, a requisite for the use of geostatistical analysis is the normality of data. Therefore, the normality of data was examined here based on the values of skewness statistic before fitting empirical semivariogram and spatial zonation of factors. Skewness statistic values less than 1 indicated normal distribution; otherwise, data were transformed and normalized using the Lognormal, Box-Cox, and Square-Root statistical methods (Oliver & Webster, 2014). To select the best interpolation method, spatial correlation was investigated through drawing a semivariogram. Tables 6 and 7 represent the best-fitted semivariogram for ground water quality variables and the related semi-variogram models parameters in both periods of autumn 2017 and spring 2018.
Table 6. Semi-variogram model parameters for groundwater quality variables (autumn 2017)
Variables | Model | Nugget | Sill | Range (m) | Nugget/sill (%) |
|---|---|---|---|---|---|
Mg2+ | Exponential | 0.01 | 0.41 | 23217 | 24.3 |
Hco3− | Exponential | 0.055 | 0.095 | 175070 | 52.6 |
Ca2+ | Exponential | 0.18 | 0.40 | 15598 | 45.0 |
PH | Exponential | 0.0008 | 0.0015 | 2004 | 53.3 |
SAR | Spherical | 0.00 | 31.6 | 4010 | 0.00 |
Table 7. Semi-variogram model parameters for groundwater quality variables (spring 2018)
Variables | Model | Nugget | Sill | Range (m) | Nugget/sill (%) |
|---|---|---|---|---|---|
Mg2+ | Spherical | 23.18 | 75.32 | 9481 | 30 |
Hco3− | Stable | 0.80 | 1.67 | 25444 | 53 |
Ca2+ | Exponential | 0.25 | 0.91 | 162240 | 27 |
PH | Exponential | 0.34 | 0.61 | 6282 | 55 |
SAR | Exponential | 0.00 | 0.78 | 7595 | 0.00 |
The final map of the spatial distribution of groundwater quality variables using the OK method is depicted in Figs. 4 and 5 for autumn 2017 and spring 2018, respectively.
[See PDF for image]
Fig. 4
The final map of the spatial distribution of groundwater quality variables in autumn 2017
[See PDF for image]
Fig. 5
The final map of the spatial distribution of groundwater quality variables in spring 2018
As shown in Fig. 4, the values of Mg2+, SAR, and Ca2+ were lower in the north than in the south of the region in autumn. HCO3− was lower in the northern and southern parts of the study area. In spring 2018 (Fig. 5), the uppermost and lowermost values of all the studied quality parameters (Mg2+, SAR, pH, and Ca2+) were recorded in the southern and northern parts of the study area, respectively. HCO3− was higher in the central and western parts than in the southern and northern zones, with maximum and minimum values of 18.5 and 1.50 mEq/L, showing an increase in comparison to autumn 2017.
Model performance assessment
The OK and Kappa statistics (Tables 8 and 9) were examined for each groundwater quality classes in autumn 2017 and spring 2018 to validate the spatial prediction maps. In both periods, the results of model performances were close to each other due to the appropriate frequency of the observed wells.
Table 8. The verification values of groundwater quality classes employing data mining algorithms in autumn 2017
Data mining models | Verification index | |
|---|---|---|
Kappa (%) | OA | |
RF | 92 | 95 |
BRT | 87 | 91 |
MnLR | 89 | 93 |
Table 9. The verification values of groundwater quality classes employing data mining algorithms in spring 2018
Data mining models | Verification index | |
|---|---|---|
Kappa (%) | OA | |
RF | 88 | 91 |
BRT | 84 | 88 |
MnLR | 85 | 89 |
In conditions where there is a combination of adequate sample size (a minimum of 300 samples) in the calibration subset of learning models, the results of model verification are close to each other, and models reach their optima in predicting the target variable (Somarathna et al., 2017). Accordingly, it is expectable to observe the accuracy of models applied here similar to that of Somarathna et al. (2017) because the number of our observation data was n = 270, which is slightly different from the mentioned study.
With a slight difference, the RF model was more accurate (Naghibi & Dashtpagerdi, 2017; Norouzi et al., 2017; Sihag et al., 2019; Solaimani et al., 2019; Victoriano et al., 2020) than the other two models (BRT and MnLR), with OA and Kappa values of 95 and 92% in autumn 2017 and those of 91 and 88% in spring 2018.
In line with our observations, similar results were reported on the modeling accuracy (about 80%) of RF and BRT algorithms in a study on groundwater potential in a water basin in Korea (Kim et al., 2019). Additionally, if the MNLR method is trained with the sufficient number of data and suitable predictor variables, it can provide acceptable results for modeling and mapping a target variable (Abbaszadeh Afshar et al., 2018).
Table 10 shows the assessment results of the user accuracy (UA) and the producer accuracy (PA) of the RF model in groundwater quality classes. Accordingly, the highest values of UA and PA were obtained for classes C3-S2, C4-S1, C4-S2, C4-S3, and C4-S4 in the autumn 2017 period and classes C3-S2, C4-S3, and C4-S4 in the spring 2017 period, with 100% accuracy. This indicates the good performance of the RF model for the prediction of these quality classes relative to the other ones. The lowest values of UA and PA were respectively observed for classes C3-S1 (88%) and C2-S1 (92%) in autumn 2017 and classes C4-S1 (50%) and C2-S1 (69%) in spring 2017 (Lacoste et al., 2011).
Table 10. UA and PA for groundwater quality classes
Class | Autumn 2017 | Spring 2018 | ||
|---|---|---|---|---|
PA (%)* | UA (%)* | PA (%)* | UA (%)* | |
c2-s1 | 96 | 92 | 100 | 69 |
c3-s1 | 88 | 94 | 80 | 96 |
c3-s2 | 100 | 100 | 100 | 100 |
c3-s3 | NaN* | - | NaN | - |
c3-s4 | - | NaN | - | NaN |
c4-s1 | 100 | 100 | 50 | 100 |
c4-s2 | 100 | 100 | 88 | 100 |
c4-s3 | 100 | 100 | 100 | 100 |
c4-s4 | 100 | 100 | 100 | 100 |
*NaN not a number
The UA and PA values (Table 10) for individual groundwater quality classes make it possible to estimate the overestimation and underestimation levels in the prediction of individual classes by the intended learning model (Lacoste et al., 2011).
According to Table 10, the c2-s1 class has a lower UA than PA, suggesting that this class was overestimated in the study area by the RF model. Contrarily, the value of PA was estimated less than UA in the c3-s1 class, indicating the underestimation of the RF model in the prediction of this class. In spring 2018, values of PA were estimated lower than UA for classes c3-s1, c4-s1, and c4-s2, showing the underestimation of these classes by this model in the study area. In contrast, class c2-s1 has lower UA values than PA, indicating the underestimation of the class by this model. For classes c3-s3 and c3-s4 in autumn 2017 and spring 2018, respectively, values of PA and UA were also measured as NAN due to the low frequencies of the two classes in the datasets. Because of low numbers, they were excluded from the model verification repetitions and reported as NAN in the final output of the model accuracy (Kudo et al., 1999). For a better comparison, groundwater quality maps were finally prepared in autumn 2017 and spring 2018 using all three algorithms (Figs. 6 and 7).
[See PDF for image]
Fig. 6
Groundwater quality maps in autumn 2017 using RF, BRT, and MnLR
[See PDF for image]
Fig. 7
Groundwater quality maps in spring 2018 using RF, BRT, and MnLR
As shown in Figs. 6 and 7, irrigation water quality (Table 1) was in an inappropriate class for agricultural use in the prediction map of the three algorithms (BRT, RF, and MnLR) almost from the center to the south of the study area. The percentages of water quality classes from the viewpoint of irrigation use were displayed differently in all three maps.
Due to the higher accuracy of the RF method, this method was considered the base, and the percentage of area (hectare) was occupied by each groundwater quality class in the study area was determined by RF (Table 11).
Table 11. The area of predicted groundwater quality classes by the RF data mining model
Autumn 2017 | Spring 2018 | ||||
|---|---|---|---|---|---|
Area | Area | ||||
Class | ha | % | Class | ha | % |
C2-S1 | 724,225 | 23.02 | C2-S1 | 800,155 | 25.43 |
C3-S1 | 1,230,276 | 39.1 | C3-S1 | 1,093,369 | 34.75 |
C3-S2 | 93,383 | 2.97 | C3-S2 | 60,637 | 1.93 |
C3-S4 | 425 | 0.01 | C3-S3 | 4741 | 0.15 |
C4-S1 | 8048 | 0.26 | C4-S1 | 40,058 | 1.27 |
C4-S2 | 104,715 | 3.33 | C4-S2 | 213,226 | 6.78 |
C4-S3 | 689,709 | 21.92 | C4-S3 | 248,455 | 7.9 |
C4-S4 | 295,558 | 9.39 | C4-S4 | 685,696 | 21.79 |
Total | 3,146,338 | 100 | Total | 3,146,338 | 100 |
Class C3-S1, with areas of 39.10 and 34.75% in autumn 2017 and spring 2018, respectively, covers the greatest area of the study region. According to Table 1, the salty irrigation water in these areas can be used for agriculture with appropriate preparations. In spring 2018, the percentage of regions with C4-S4 irrigation water quality increased considerably from 9.39 to 21.79% which indicated that the percentage of the area with inappropriate irrigation water quality has increased. To explain and complete the results of groundwater quality prediction, the land use map of the area, obtained from Landsat 8 images (2019), is also depicted in Fig. 8.
[See PDF for image]
Fig. 8
Land use map of the area in 2019
According to the results of the prepared land use map, barren lands, croplands, and grasslands comprise 62, 15.34, and 15.73% of the area. Based on the results of combined land use and groundwater quality maps, most wells with inappropriate irrigation water quality are located in the center to the south of the region, barren lands, and some in farmlands. The obtained results of the groundwater quality prediction maps indicated that land use and the conversion of rangelands and farmlands to barren lands (desertification) affect the water quality (Mirzaei et al., 2016; Salajegheh et al., 2011).
Relative importance (RI) for effective variables on groundwater quality
In this study, the RI of effective variables was determined using the mean decrease accuracy (MDA) index (Fig. 9) (Gayen et al., 2019; Mousavi et al., 2019). As revealed by the RI results, more than 50% of changes in groundwater quality in the study area were explained using SAR, with 50.50% and 50.20% for autumn 2017 and spring 2018, respectively. Other variables, such as Mg2+, Ca2+, HCO3−, and pH with RI values of 52, 18.11, 4.16, and 1.17%, respectively, in autumn 2017, and the same variables with RI values of 27.06, 16.55, 3.89, and 2.27%, respectively, in spring 2018 contributed to groundwater quality modeling.
[See PDF for image]
Fig. 9
The relative importance (%) of the most effective variables on the groundwater quality using the MDA in RF model
Several studies showed the importance of SAR on the quality of groundwater resources used in agriculture section and, as a result, management of soil and crop quality (Bakhshandehmehr et al., 2017; Piri & Bameri, 2014).
Conclusions
Given the high costs of sampling and chemical analysis of samples, this study aimed to investigate the efficiency and comparison of modern methods of ML in modeling and to predict groundwater quality classes in the Tashk-Bakhtegan basin to achieve the most accurate results in a short time. In the present research, three data mining algorithms (BRT, RF, and MnLR) were studied comparatively to determine a more accurate method for the prediction of irrigation water quality based on the USSL diagram.
The models used in this study presented very close efficiencies to each other due to sufficient available data with regard to the suitable number of observation wells.
However, the results of the model performances indicate that the RF model was more accurate in the prediction of groundwater quality classes than the other models. Accordingly, the tree structure developed between the target variable and the hydro-chemical parameters used in the modeling process resulted in a higher accuracy of the RF model than the BRT and MnLR models.
.According to the results of the final map, the irrigation water quality is placed in the extreme salinity class in the southern parts of the study areas, with more density of barren lands, in both study periods. In autumn 2017, the highest percentage of the land area belonged to the C4-S3 class (very high salinity and high-sodium water, respectively). In spring 2018, the C4-S4 class (very high salinity water and very high sodium water, respectively) covered the highest percentage of land; both of the qualities are harmful to agriculture. Moreover, the results indicated that the groundwater quality belonged to slightly salty and salty classes in the northern and western areas, consisting of rangelands and shrublands, where the available water can be used for the irrigation of farmlands using appropriate management measures.
The analysis of RI also demonstrated that SAR and pH were the most effective and least effective variables, respectively.
In this investigation, an accuracy of over 80% was obtained for the MnLR, BRT, and RF data mining models in the preparation of a groundwater quality spatial prediction map.
According to the present results, the RF method, with such advantages as nonlinear relationships, manageability of outlier data, error estimation, and running myriads of data without excluding one of them, is recommended to be used as an accurate technique in modeling and spatial prediction of groundwater quality.
Author contribution
All authors, Reyhaneh Masoudi, Seyed Roohollah Mousavi, Pouyan Dehghan Rahimabadi, Mehdi Panahi, and Asghar Rahmani, contributed to conception and design, acquisition of data, or analysis and interpretation of data, drafting the article, and final approval of the version to be submitted for publication.
Availability of data and material
All data generated or used in this study are present in the submitted article.
Declarations
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Conflict of interest
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Abbaszadeh Afshar, F; Ayoubi, S; Jafari, A. The extrapolation of soil great groups using multinomial logistic regression at regional scale in arid regions of Iran. Geoderma; 2018; 315, pp. 36-48. [DOI: https://dx.doi.org/10.1016/j.geoderma.2017.11.030]
Alipour, A; Rahimi, J; Azarnivand, A. Groundwater quality analysis for drinking and agricultural purposes-A prerequisite for land use planning in the arid and semi-arid regions of Iran. Journal of Range and Watershed Management; 2017; 70, pp. 423-434.
Arabameri, A; Rezaei, K; Cerda, A; Lombardo, L; Rodrigo-Comino, J. GIS-based groundwater potential mapping in Shahroud plain, Iran. A comparison among statistical (bivariate and multivariate), data mining and MCDM approaches. Science of the Total Environment; 2019; 658, pp. 160-177.1:CAS:528:DC%2BC1cXisFKhtr3K [DOI: https://dx.doi.org/10.1016/j.scitotenv.2018.12.115]
Avila, R; Horn, B; Moriarty, E; Hodson, R; Moltchanova, E. Evaluating statistical model performance in water quality prediction. Journal of Environmental Management; 2018; 206, pp. 910-919.1:CAS:528:DC%2BC2sXhvFahur3J [DOI: https://dx.doi.org/10.1016/j.jenvman.2017.11.049]
Bakhshandehmehr, L; Yazdani, MR; Zolfaghari, AA. The evaluation of groundwater suitability for irrigation and changes in agricultural land of Garmsar Basin. Journal of Water and Soil (agricultural Sciences and Technology); 2017; 30, pp. 1773-1786.
Breiman, L. Random Forests. Machine Learning; 2001; 45, pp. 5-32. [DOI: https://dx.doi.org/10.1023/A:1010933404324]
Breiman, L., & Cutler, A., (2004). Random forests. University of California, 1–33.
Bui, D. T., Khosravi, K., Karimi, M., Busico, G., Khozani., Z. S., Nguyen, H., Mastrocicco, M., Tedesco, D., Cuoco, E., & Kazakis, N. (2020). Enhancing nitrate and strontium concentration prediction in groundwater by using new data mining algorithm. Science of the Total Environment,715, 136836.
Byrt, T; Bishop, J; Carlin, JB. Bias, prevalence and kappa. Journal of Clinical Epidemiology; 1993; 46, pp. 423-429.1:STN:280:DyaK3s3nsVejsw%3D%3D [DOI: https://dx.doi.org/10.1016/0895-4356(93)90018-V]
Chen, K; Chen, H; Zhou, C; Huang, Y; Qi, X; Shen, R; Liu, F; Zuo, M; Zou, X; Wang, J. Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Research; 2020; 171, pp. 1-10. [DOI: https://dx.doi.org/10.1016/j.watres.2019.115454]
Choubin, B; Khalighi Sigaroodi, S; Malekian, A. Impacts of large-scale climate signals on seasonal rainfall in the Maharlu-Bakhtegan watershed. Journal of Range and Watershed Management (iranian Journal of Natural Resources); 2016; 69, pp. 51-63.
De'Ath, G. Boosted trees for ecological modeling and prediction. Ecology; 2007; 88, pp. 243-251. [DOI: https://dx.doi.org/10.1890/0012-9658(2007)88[243:BTFEMA]2.0.CO;2]
Debella-Gilo, M; Etzelmüller, B. Spatial prediction of soil classes using digital terrain analysis and multinomial logistic regression modeling integrated in GIS: Examples from Vestfold County, Norway. CATENA; 2009; 77, pp. 8-18. [DOI: https://dx.doi.org/10.1016/j.catena.2008.12.001]
Dodge, Y. (2008). The concise encyclopedia of statistics. Springer Science & Business Media, 1–259.
Dormann, CF; Elith, J; Bacher, S; Buchmann, C; Carl, G; Carré, G; Marquéz, JRG; Gruber, B; Lafourcade, B; Leitao, PJ. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography; 2013; 36, pp. 27-46. [DOI: https://dx.doi.org/10.1111/j.1600-0587.2012.07348.x]
El Bilali, A; Taleb, A. Prediction of irrigation water quality parameters using machine learning models in a semi-arid environment. Journal of the Saudi Society of Agricultural Sciences; 2020; 19, pp. 439-451. [DOI: https://dx.doi.org/10.1016/j.jssas.2020.08.001]
El Bilali, A; Taleb, A; Brouziyne, Y. Groundwater quality forecasting using machine learning algorithms for irrigation purposes. Agricultural Water Management; 2021; 245, [DOI: https://dx.doi.org/10.1016/j.agwat.2020.106625] 106625.
Elith, J; Leathwick, JR; Hastie, T. A working guide to boosted regression trees. Journal of Animal Ecology; 2008; 77, pp. 802-813.1:STN:280:DC%2BD1cvgsFOqsQ%3D%3D [DOI: https://dx.doi.org/10.1111/j.1365-2656.2008.01390.x]
Everitt. B., & Skrondal, A. (2010). The Cambridge dictionary of statistics. Cambridge University Press, 4th Edition, 1–480.
Friedman, J; Hastie, T; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics; 2000; 28, pp. 337-407. [DOI: https://dx.doi.org/10.1214/aos/1016218223]
Gayen, A; Pourghasemi, HR; Saha, S; Keesstra, S; Bai, S. Gully erosion susceptibility assessment and management of hazard-prone areas in India using different machine learning algorithms. Science of the Total Environment; 2019; 668, pp. 124-138.1:CAS:528:DC%2BC1MXksFKgsrs%3D [DOI: https://dx.doi.org/10.1016/j.scitotenv.2019.02.436]
Ghafarijoo, N., & Zarei, H. (2016). Investigation of catchment plains balance in Bakhtegan-Tashtak-Maharloo basin. First National Congress on Iran’s Irrigation & Drainage, Ferdowsi University of Mashhad. Iran.
Goswamee, DS; Shah, P; Patel, Y. Analysis of quality of groundwater and its suitability for irrigation purpose in Visnagar Taluka. Mehsana District, Gujarat.; 2015; 4, pp. 2907-2911.
Hedayat, S., Zarei, H., Radmanesh, F., & Soltani Mohammadi, A. (2016). Investigation of groundwater resources in Bakhtegan-Maharloo basin, Second national congress on Iran’s irrigation & drainage. Ahvaz, Iran.
Hojati, SMH; Boustani, F. Sustainable groundwater management of Khir Plain by Groundwater Balance Journal of. Physical Geography; 2010; 2, pp. 57-72.
Hotzel, H. (2012). Climatic caused variations of groundwater recharge in the Middle East and its consequences for the future water management. Hydrogeology of Arid Environments, 10–14.
Jeihouni, M; Toomanian, A; Mansourian, A. Decision tree-based data mining and rule induction for identifying high quality groundwater zones to water supply management: A novel hybrid use of data mining and GIS. Water Resources Management; 2020; 34, pp. 139-154. [DOI: https://dx.doi.org/10.1007/s11269-019-02447-w]
Jensen, J. R. (1996). Introductory digital image processing: A remote sensing perspective (No. Ed. 2). Prentice-Hall Inc. Pearson. 4th Edition. 1–526.
Jeon, C; Raza, M; Lee, JY; Kim, H; Kim, CS; Kim, B; Kim, JW; Kim, RH; Lee, SW. Countrywide groundwater quality trend and suitability for use in key sectors of Korea. Water; 2020; 12, 1193.1:CAS:528:DC%2BB3cXhslSrtL3O [DOI: https://dx.doi.org/10.3390/w12041193]
Jha, MK; Chowdary, VM; Chowdhury, A. Groundwater assessment in Salboni Block, West Bengal (India) using remote sensing, geographical information system and multi-criteria decision analysis techniques. Hydrogeology Journal; 2010; 18, pp. 1713-1728. [DOI: https://dx.doi.org/10.1007/s10040-010-0631-z]
Jha, MK; Kamii, Y; Chikamori, K. Cost-effective approaches for sustainable groundwater management in alluvial aquifer systems. Water Resources Management; 2009; 23, pp. 219-233. [DOI: https://dx.doi.org/10.1007/s11269-008-9272-6]
Karakuş, CB; Yıldız, S. Evaluation for irrigation water purposes of groundwater quality in the vicinity of Sivas City Centre (Turkey) by using GIS and an irrigation water quality index. Irrigation and Drainage; 2019; 69, pp. 121-137. [DOI: https://dx.doi.org/10.1002/ird.2386]
Kempen, B; Brus, DJ; Heuvelink, GBM; Stoorvogel, JJ. Updating the 1:50,000 Dutch soil map using legacy soil data: A multinomial logistic regression approach. Geoderma; 2009; 151, pp. 311-326. [DOI: https://dx.doi.org/10.1016/j.geoderma.2009.04.023]
Kim, JC; Jung, HS; Lee, S. Spatial mapping of the groundwater potential of the geum river basin using ensemble models based on remote sensing images. Remote Sensing; 2019; 11, 2285. [DOI: https://dx.doi.org/10.3390/rs11192285]
Kordestani, MD; Naghibi, SA; Hashemi, H; Ahmadi, K; Kalantar, B; Pradhan, B. Groundwater potential mapping using a novel data-mining ensemble model. Hydrogeology Journal; 2018; 27, pp. 211-224. [DOI: https://dx.doi.org/10.1007/s10040-018-1848-5]
Kudo, M; Toyama, J; Shimbo, M. Multidimensional curve classification using passing-through regions. Pattern Recognition Letters; 1999; 20, pp. 1103-1111. [DOI: https://dx.doi.org/10.1016/S0167-8655(99)00077-X]
Kumarasamy, P; Dahms, HU; Jeon, HJ; Rajendran, A; Arthur James, R. Irrigation water quality assessment—an example from the Tamiraparani river, Southern India. Arabian Journal of Geosciences; 2013; 7, pp. 5209-5220. [DOI: https://dx.doi.org/10.1007/s12517-013-1146-4]
Lacoste, M; Lemercier, B; Walter, C. Regional mapping of soil parent material by machine learning based on point data. Geomorphology; 2011; 133, pp. 90-99. [DOI: https://dx.doi.org/10.1016/j.geomorph.2011.06.026]
Laze, P; Rizani, S; Ibraliu, A. Assessment of irrigation water quality of Dukagjin basin in Kosovo. Journal International Science of Public Agricultural Food; 2016; 4, pp. 544-551.
Liaw, A; Wiener, M. Classification and regression by random Forest. R News; 2002; 2, pp. 18-22.
Masoudi, R; Zehtabian, GH; Ahmadi, H; Malekian, A. Assessment of trends in groundwater quality and quantity of Kashan plain. Desert Management; 2015; 3, pp. 65-78.
Mirabbasi, R; Mazloumzadeh, S; Rahnama, M. Evaluation of irrigation water quality using fuzzy logic. Research Journal of Environmental Sciences; 2008; 2, pp. 340-352.1:CAS:528:DC%2BD1MXjvVyltA%3D%3D [DOI: https://dx.doi.org/10.3923/rjes.2008.340.352]
Mirzaei, M; Solgi, I; Salman Mahini, AR. Investigating the relationship between water quality parameters and land use changes (Zayandehrud watershed). Water Management and Irrigation; 2016; 6, pp. 175-191.
Mousavi, S. R., Sarmadian, F., Rahmani, A., & Khamoshi, S. E. (2019). Digital soil mapping with regression tree classification approaches by RS and geomorphometry covariate in the Qazvin Plain, Iran. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, XLII-4/W18, 773–777.
Naghibi, SA; Dashtpagerdi, MM. Evaluation of four supervised learning methods for groundwater spring potential mapping in Khalkhal region (Iran) using GIS-based features. Hydrogeology Journal; 2017; 25, pp. 169-189.1:CAS:528:DC%2BC28XhsFWmsbnP [DOI: https://dx.doi.org/10.1007/s10040-016-1466-z]
Naghibi, SA; Pourghasemi, HR; Dixon, B. GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran. Environmental Monitoring and Assessment; 2016; 188, 44. [DOI: https://dx.doi.org/10.1007/s10661-015-5049-6]
Norouzi, H; Nadiri, AA; Asghari Mogaddam, A; Gharekhani, M. Prediction of transmissivity of Malikan plain aquifer using random forest method. Water and Soil Science; 2017; 27, pp. 61-75.
O’brien, R. M. ,. A caution regarding rules of thumb for variance inflation factors. Quality & Quantity; 2007; 41, pp. 673-690. [DOI: https://dx.doi.org/10.1007/s11135-006-9018-6]
Oliver, M; Webster, R. A tutorial guide to geostatistics: Computing and modeling variograms and kriging. CATENA; 2014; 113, pp. 56-69. [DOI: https://dx.doi.org/10.1016/j.catena.2013.09.006]
Pakgohar, A. Performance comparison of logistic regression and classification regression tree models for binary dependent variable. Scientific Research; 2016; 1, pp. 7-14.
Pan, Y; Jackson, RT. Ethnic difference in the relationship between acute inflammation and serum ferritin in US adult males. Epidemiology & Infection; 2008; 136, pp. 421-431.1:CAS:528:DC%2BD1cXhs1yrtrw%3D [DOI: https://dx.doi.org/10.1017/S095026880700831X]
Piri, H; Bameri, A. Estimation of sodium absorption ration (SAR) in groundwater using the artificial neural network and linear multiple regression: Case study: The Baiestan Plain. Water Engineering; 2014; 7, pp. 67-79.
Raman, BV; Bouwmeester, R; Mohan, S. Fuzzy logic water quality index and importance of water quality parameters. Air, Soil and Water Research; 2020; 2, pp. 51-59.
Rogerson, P. Statistical methods for geography; 2001; Thousand Oaks, California, SAGE Publications: [DOI: https://dx.doi.org/10.4135/9781849209953]
Saghebian, SM; Sattari, MT; Mirabbasi, R; Pal, M. Groundwater quality classification by decision tree method in Ardebil region. Iran. Arabian Journal of Geosciences; 2014; 7, pp. 4767-4777. [DOI: https://dx.doi.org/10.1007/s12517-013-1042-y]
Salajegheh, A; Razavizadeh, S; Khorasani, N; Hamidifar, M; Salajegheh, S. Land use changes and its effects on water quality (Case study: Karkheh watershed). Journal of Environmental Studies; 2011; 37, pp. 81-86.
Sattari, MT; Mirabbasi, NR; Abbasgoli, NM. Surface water quality prediction using data mining method (Case study: Rivers of northern side of Sahand Mountain). Iranian Journal of Ecohydrology; 2017; 4, pp. 407-419.
Sihag, P; Karimi, SM; Angelaki, A. Random forest, M5P and regression analysis to estimate the field unsaturated hydraulic conductivity. Applied Water Science; 2019; 9, 129. [DOI: https://dx.doi.org/10.1007/s13201-019-1007-8]
Singh, K. K., Tewari, G., & Kumar, S. (2020). Evaluation of groundwater quality for suitability of irrigation purposes: A case study in the Udham Singh Nagar, Uttarakhand. Journal of Chemistry, 15.
Solaimani, K; Alidadgan, F; Purghasemi, H. Comparison of Shannon entropy data mining techniques and random forest algorithm to preparing underground water potential map of Jahrom. Desert Ecosystem Engineering Journal; 2019; 8, pp. 37-48.
Somarathna, P; Minasny, B; Malone, BP. More data or a better model? Figuring out what matters most for the spatial prediction of soil carbon. Soil Science Society of America Journal; 2017; 81, pp. 1413-1426.1:CAS:528:DC%2BC1cXitlSmsb3L [DOI: https://dx.doi.org/10.2136/sssaj2016.11.0376]
US Salinity Laboratory Staff. (1954). Diagnosis and improvement of saline and alkali soils. US Department of Agriculture, 60, 160.
Venkataraman, K; Uddameri, V. Modeling simultaneous exceedance of drinking-water standards of arsenic and nitrate in the Southern Ogallala aquifer using multinomial logistic regression. Journal of Hydrology; 2012; 458, pp. 16-27. [DOI: https://dx.doi.org/10.1016/j.jhydrol.2012.06.028]
Victoriano, JM; Lacatan, LL; Vinluan, AA. Predicting river pollution using random forest decision tree with GIS model: A case study of MMORS. Philippines International Journal of Environmental Science Development; 2020; 11, pp. 36-42.1:CAS:528:DC%2BB3cXhtFShsrnN [DOI: https://dx.doi.org/10.18178/ijesd.2020.11.1.1222]
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Elsevier Science. 3rd edition.
Wu, J., Zhang, Y., & Zhou, H., (2020). Groundwater chemistry and groundwater quality index incorporating health risk weighting in Dingbian County, Ordos basin of northwest China. Geochemistry, e125607.
Yoo, K; Shukla, SK; Ahn, JJ; Oh, K; Park, J. Decision tree-based data mining and rule induction for identifying hydrogeological parameters that influence groundwater pollution sensitivity. Journal of Cleaner Production; 2016; 122, pp. 277-286.1:CAS:528:DC%2BC28Xjt1ylt7Y%3D [DOI: https://dx.doi.org/10.1016/j.jclepro.2016.01.075]
Zomorodian, MJ; Khakpour, M; Velayati, S. Analysis of hydro-geomorphic landforms of lake Maharlu basin, based on interactive relation of morphotectonic, morphoclimatic and hydro-morphic processes. Journal of Geography and Regional Development; 2013; 10, pp. 47-70.
© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2023.