Full Text

Turn on search term navigation

1. Introduction

Soil salinization, driven by human activities and natural disturbances, severely impacts agricultural production and ecological balance [1,2]. Based on their causes, soil salinization can be classified as primary or secondary. Salinized soil covers over 412 million hectares [3], while China’s saline–alkaline land spans approximately 36.9 million hectares, representing about 4.88% of the country’s arable land [4]. In the irrigated areas of the middle and upper reaches of the Yellow River, soil salts are primarily sulfates and chlorides, with secondary salinization resulting from drought and ineffective irrigated practices. Thus, practical monitoring, quantitative evaluation, analysis, and digital mapping methods are crucial for understanding soil salinization’s temporal and spatial distribution and prevention. However, traditional soil salinization monitoring typically employs a single-point measurement method, which is time-consuming, labor-intensive, and costly [5]. This approach requires representative and timely sampling locations and can only provide point-scale soil information [6].

Studies have shown that remote sensing technology has become an efficient method to achieve low-cost information acquisition [7,8], and machine learning combined with remote sensing technology has been widely used in monitoring soil salinity [9,10]. WANG et al. [11] compared the performance of three models, namely, geographically weighted regression (GWR), ordinary kriging interpolation (OK), and multivariate linear regression (MLR) for the prediction of organic matter, and it was concluded that the GWR model had the highest accuracy. MENG et al. [12] employed ordinary kriging interpolation, geographically weighted regression, and random forest (RF) methods, integrating environmentally relevant variables such as topography, soil physicochemical properties, and remotely sensed imagery data. Ultimately, they used the RF model to map the distribution of salinity in the surface soil of the Weigan River–Kuche River oasis. As the study progressed, researchers found that hyperparameter settings are crucial for machine-learning models [13,14], and hyperparameter optimization can effectively avoid the occurrence of overfitting phenomena due to sparse samples. Currently, the commonly used methods for hyperparameter optimization are the grid search, random search, genetic algorithm, and Bayesian algorithm [15]. However, grid and random searches become computationally inefficient and time-consuming when the number of parameters is large. At the same time, genetic algorithms usually converge to a local optimum, which may ignore the global performance. Bayesian algorithms can find the global optimal solution efficiently. However, Bayesian algorithms use the Gaussian process in their probabilistic model optimization, which is computationally intensive, more suitable for continuous hyper-parameters, and thus less efficient for discrete parameters. Optuna is a novel hyper-parameter optimization framework employing a tree-structured Bayesian algorithm to model the objective function. This approach offers greater flexibility and efficiency than the traditional Gaussian Bayesian optimization processes [16]. Currently, research on optimizing soil salinity estimation models using the Optuna algorithm is limited. This study employs Optuna for hyperparameter optimization to identify the best inversion parameters. While machine-learning models excel in soil salinity monitoring, their non-interpretability remains challenging. The lack of interpretability hinders a comprehensive assessment of these models based on a single metric. Traditional variable significance methods can only highlight important variables without clarifying their impact on predictions. The SHAP interpreter model can identify the direction of contribution (positive or negative) of each input feature [17]. It can also provide a changing threshold, allowing managers to more accurately adjust feature factors. Therefore, SHAP is utilized to specify the optimal salt content estimation model [18,19]. Since most studies focus on machine learning models, this paper also introduces a deep learning model (LSTM) to evaluate the performance of different methods.

This study addresses current research gaps by collecting soil salinity samples from saline–alkaline land on the south bank of the Yellow River during the spring bare soil period. Multiple soil salinity estimation models were constructed using satellite images, with hyper-parameters optimized through the Optuna algorithm. The optimal model was interpreted using the SHAP method. The performance of four models, PLSR, GWR, XGBoost, and LSTM, was compared to determine the most suitable model for estimating soil salinity in the study area. Digital spatial mapping of saline soils in the study area for the past five years was completed, providing a scientific basis for saline land management and planning. This research will fill the existing gap in machine learning for soil salinity monitoring and will improve the monitoring and management of saline–alkali land. It will also support the sustainable development of smart agriculture and water resources.

2. Materials and Methods

2.1. Study Area

The south bank of the Yellow River, especially in Inner Mongolia, is important for saline–alkali land research due to severe soil salinization and limited water resources [20]. The region also faces diverse irrigation methods, low precipitation, and high evaporation, which worsen salinization. The study area is located in Dalate Banner (108°59′~109°59′ E, 40°16′~40°33′ N) in the southwestern Inner Mongolia Autonomous Region. It is part of the Yellow River’s upper and middle reaches (Figure 1). The terrain gradually decreases in elevation from south to north. It covers an area of 1437.12 km², and the primary soil type is Aeolian soil. The area belongs to the mid-temperate semi-arid zone. It has an average temperature of 6.1–7.1 °C and an annual rainfall of 240–360 mm. Rainfall is mainly concentrated in July and August. Annual evaporation ranges from 1450 to 3250 mm [21]. The region receives about 3000 h of sunshine annually. The altitude ranges from approximately 996–1140 m (ASL), with an average altitude of 1031 m (ASL). Soil salinization has increased in the area due to the dry climate, low precipitation, high evaporation, elevated groundwater levels, mineralization, and extensive farming practices [22].

2.2. Soil Collection

Random sampling was conducted in the study area. In the spring bare soil period, vegetation has not yet grown in large quantities, and remote sensing data can reflect the true spectral characteristics of the soil. This has the advantage of reducing vegetation interference, providing clearer salt characteristics, and stabilizing soil conditions. A total of 85 and 219 samples were collected during the spring bare soil periods of 13–18 May 2022, and 15–19 April 2023, respectively. The sampling depth was 0–10 cm. The longitude and latitude of each sampling point were recorded using GPS. The collected soil samples were air-dried under natural conditions. They were then ground, sieved, and used to prepare soil leachate with a water-soil mass ratio of 5:1. The prepared extract was analyzed to measure the soil salt content (SSC) [22].

2.3. Remote Sensing Data Selection

Optical remote sensing data from Sentinel-2 (the European Space Agency’s Copernicus) and Landsat-8 (the United States Geological Survey) and terrain data were selected and preprocessed using the Google Earth Engine (GEE) platform. A total of 30 inversion factors were calculated. These included salinity indices, three-band indices, vegetation indices, principal component analysis bands, tasseled cap change factor, surface temperature, and dem derivatives, as shown in Table 1.

Considering the sampling time and cloud cover, the Sentinel-2 Level-2A product data were screened, QA60 clouded, median synthesized, and cropped. The specific image screening time and cloud cover are shown in Table 2. Landsat-8 uses SR-level images and a single-window algorithm to invert surface temperature. The terrain data use the USGS/SRTMGL1_003 dataset with a spatial resolution 1″ (about 30 m). The images were resampled to 10 m for subsequent analysis, and SAGA was used to calculate various terrain factors.

2.4. Model

2.4.1. Model Selection

Partial least squares regression (PLSR) is a widely used soil salinity inversion model, combining principal component analysis, multiple linear regression, and canonical correlation analysis. Unlike traditional multiple linear regression, it reduces multicollinearity among variables.

Geographically weighted regression (GWR) is a local linear regression model that fits an independent linear regression to each sampling point. The bisquare function determines the regression coefficients for each point [29]. The golden section search method, the Akaike information criterion (AIC), determines the optimal bandwidth. The GWR model fits best when the smallest AIC value and the corresponding bandwidth (gwr_bw) are optimal [30]. The GWR model is implemented in Python using mgwr 2.1.2, installed via “conda install mgwr”.

A long Short-Term Memory (LSTM) network is a type of RNN that solves the gradient vanishing/exploding problem and improves the disadvantage of difficulty in maintaining long-term memory. LSTM consists of an input layer, an output layer, and a hidden layer. The storage modules of the hidden layer include memory and multiplication units (input, output, and forgetting gates). These storage units are self-connecting and adaptive and control the flow of information through logic gates [31].

Extreme Gradient Boosting (XGBoost) is an enhanced algorithm based on GBDT, proposed by Chen Tianqi [32]. It is an ensemble learning method built on the Boosting concept. XGBoost has gained wide recognition in data mining and machine-learning challenges. It uses a gradient-based decision tree algorithm to optimize the loss function iteratively, combining weak learners into strong learners with high classification performance. Using only decision trees as base classifiers, it runs faster, including regularization, terms speeding up the solving of objective functions, and preventing overfitting. XGBoost also offers extensive hyperparameter tuning options. Table 3 compares the advantages and disadvantages of different models. Table 4 lists the eight parameters selected in this study and their optimal values.

2.4.2. Optuna-Based Hyperparameter Optimization Framework

Hyperparameter tuning is selecting the optimal hyperparameters for a machine learning model, which significantly impacts the model’s accuracy. Optuna is used to tune hyperparameters to improve the training speed and accuracy of XGBoost models and reduce manual training time. The entire process can be automated by calling the Optuna 2.10.1 package using Python. Optuna uses a sampling-based approach and a pruning strategy to optimize the hyperparameters. Eight hyperparameters are selected for tuning: learning_rate, n_estimators, max_depth, min_child_weight, gamma, reg_alpha, reg_lambda, and subsample. The optimization process consists of four steps: (1) define the objective function to minimize the RMSE on the test set and specify the range of hyperparameters; (2) train the XGBoost model with the given hyperparameters in each trial, perform prediction, and calculate the performance metrics, such as R², RMSE, and MAE; (3) perform multiple trials to find the hyperparameters that minimize the RMSE and optimize the performance of the model; and (4) output the optimal hyperparameter set as well as corresponding model performance metrics for the training set and the test set.

2.4.3. SHAP

SHAP, proposed by Lundberg and Lee in 2017, is a game theory-based method for interpreting the outputs of machine learning models, providing both global and local interpretations [33]. The SHAP values are model-independent and represent the contribution of each feature, measuring its importance and showing both positive and negative effects. In this study, SHAP analysis was used to assess the contribution of each factor in the optimal soil salinity inversion model and to construct an explanatory model of the factors affecting salinity estimation. The SHAP values measure the marginal contribution of the feature variables to the model output, providing global and local insights. For the eigenvalues i in the set of inversion factors S, the SHAP value is calculated as

(1) $ϕ (x_{i}) = \sum_{S \subseteq N_{{i}}} \frac{∣ S ∣! (N - ∣ S ∣ - 1)!}{N!} [f (S \cup {x_{i}}) - f (S)]$

where: N is the total number of input features,

N_{{i}}

is the set of features excluding

x_{i}

, S is a subset of

N_{{i}}

f (S)

is the model prediction for S, and

f (S \cup {x_{i}})

is the model prediction for S with

N_{{i}}

The SHAP interpretability method interprets the model prediction ŷ as the sum of the SHAP values for each input feature, calculated as

(2) $\hat{y} = ϕ_{0} + \sum_{i = 1}^{N} ϕ (x_{i})$

where:

ϕ_{0}

denotes the model prediction base value when no feature contributes to the model prediction.

2.5. Model Validation

Four validation indicators—coefficient of determination (R²), mean absolute error (MAE), root mean square error (RMSE), and ratio of performance to deviation (RPD)—were used to assess the soil salinity model. When RPD > 2.0, the model shows good prediction accuracy; when 1.40 < RPD < 2.00, the accuracy is low; and when RPD < 1.40, the model’s prediction is poor and unreliable [34]. Higher R² and RPD values and lower MAE and RMSE values indicate better model accuracy. The indicators were calculated using pandas 2.1.1, numpy 1.23.0, and sklearn.metrics 1.0.2 in Python 3.9.18. The formulas for each indicator are as follows:

(3) $R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - {\bar{y}}_{i})^{2}}$

(4) $R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}{n}}$

(5) $M A E = \frac{1}{n} \sum_{i = 1}^{n} | {\hat{y}}_{i} - y_{i} |$

(6) $R P D = \frac{S D}{R M S E}$

where n is the number of samples;

y_{i}

and

{\hat{y}}_{i}

are the predicted and true values, respectively;

{\bar{y}}_{i}

is the mean of the measured values, and SD is the standard deviation of the measured soil salinity values.

2.6. Rate of Change in Area of Soil Salinization

The change rate of the soil salinization area is a crucial indicator of the worsening or alleviation of soil salinization. Land use dynamics are used to calculate the annual change rate of saline soil better to understand the changes in different types of salinized soils. The mathematical expression is:

(7) $K = \frac{U_{b} - U_{a}}{U_{a}} * \frac{1}{T} * 100 %$

where: U_a and U_b are the areas of a specific kind of saline soil in the study area in the first and second year; T is the length of time, and T is taken as 1 in both cases; and K is the rate of change of a specific type of saline soil in one year.

3. Results and Analysis

3.1. Statistical Analysis of Soil Salinity

The 304 soil salinity samples collected from bare soil in spring were randomly divided into training and test sets. The sample data distribution is shown in Figure 2a. The soil salinity of the entire sample ranged from 0.38 to 51.42 g⋅kg⁻¹, with an average of 8.78 g⋅kg⁻¹ and a coefficient of variation of 1.28. This indicates that surface soil salinity varies greatly and is unevenly distributed. The training and test sets’ average, range, and coefficient of variation are consistent with the complete set. This suggests that the training and test sets’ SSC represent the entire data set. According to the third national soil survey salinity classification and grading standard, saline–alkali land is divided into five salinization levels: non-saline soils (0–2 g⋅kg⁻¹), mildly saline soils (2–4 g⋅kg⁻¹), moderately saline soils (4–6 g⋅kg⁻¹), heavily saline soils (6–10 g⋅kg⁻¹), and salted soils (>10 g⋅kg⁻¹). Among the soil samples, non-saline soils, mildly saline soils, moderately saline soils, mildly saline soils, and salted soils samples accounted for 45.7%, 8.6%, 5.9%, 9.9%, and 29.9%, respectively, as shown in Figure 2b.

3.2. Inversion Factor Screening

Thirty inversion factors were selected for the analysis. Since not every factor plays a crucial role in soil salt estimation accuracy, Pearson correlation analysis was used for preliminary screening. The correlation coefficient (R) ranges from [−1, 1]. The larger the absolute value of R, the stronger the correlation between the two variables. Variables with an R between [−0.30–0.30] were eliminated. However, the remaining factors exhibit collinearity and may contain noise, leading to information redundancy and overfitting. These parameters could reduce the prediction accuracy of the regression model. Therefore, the successive projections algorithm (SPA) was used for secondary screening.

As shown in Figure 3, a total of 18 factors passed the significance test (p < 0.01), among which CAEX had the highest correlation with SSC (−0.48). The salinity indices and three-band indices screened out are SI4 (0.38), S1 (0.42), S2 (0.42), IBI_temp (−0.43), CAEX (−0.48), TBI3 (0.36), TBI4 (0.33), and TBI7 (0.40); the vegetation indices screened out are GVMI (0.36), VSDI (0.42), SIWIS (−0.38), and NDI (−0.36); the dem derivatives screened out are CNBL (−0.36), CND (−0.35), DEM (−0.42), RSP (−0.38), and VD (0.47); the correlation with land surface temperature (LST) and principal component analysis bands is weak, and among the tasseled cap change factor, it is only correlated with humidity (0.38).

The 18 factors were screened again using the SPA algorithm in MatlabR 2018a. The following factors were selected: RSP, CAEX, CND, SI4, Wetness, IBI_temp, TBI7, VD, S1, and CNBL.

3.3. Model Accuracy

PLSR, GWR, LSTM, and XGBoost soil salinity estimation models were constructed using the selected ten inversion factors, and the model performance was evaluated. The training set accuracies of the four inversion models are given in Figure 4, with R² ranging from 0.37 to 0.94, MAE ranging from 1.51 to 5.99 g⋅kg⁻¹, RMSE ranging from 2.66 to 8.7 g⋅kg⁻¹, and RPD ranging from 1.26 to 4.13. The closer the model fit straight line is to a 1:1 straight line, the closer the predicted value is to the actual value. For the PLSR and GWR models, the fitted consecutive lines rotated clockwise, indicating that the models greatly overestimated the light saline soil samples and vastly underestimated the heavy saline soil samples, and the model accuracy was low; the fitted straight lines of the LSTM and XGBoost models were close to a 1:1 straight line, and the accuracy of the XGBoost model was high. Figure 5 illustrates the accuracy of the models on the test set. Although the LSTM and XGBoost models perform well on the training set, both exhibit underestimation of the high salt samples. From the figure, it can be seen that the PLSR and GWR models have an overestimation phenomenon for non-saline soil samples, and the model accuracy is low; the LSTM and XGBoost models can obtain a good estimation for non-saline soil samples, but there is some bias in light, medium, and heavy saline soil samples. There are low-value overestimation and high-value underestimation phenomena for saline soil samples. This may be because non-saline soil samples comprise most of the dataset [35], and the machine learning model is more sensitive to the statistical distribution of the data values. For the dataset’s few extreme salinity samples, the model smoothed them as outliers, resulting in an underestimation of median and high values and an overestimation of low values in the constructed model [36].

Among the four models, only the RPD of the XGBoost test set was more significant than 2, while the RPD of the other models was less than 2. The above analysis shows that the overall model prediction accuracy is XGBoost > LSTM > GWR > PLSR. XGBoost was selected as the salt content estimation model, and the spatial distribution map of soil salt content in the study area was completed.

3.4. Spatial and Temporal Distribution and Variability of Soil Salinity

3.4.1. Spatial Distribution of Soil Salinity and Area Statistics

The image screening time was biased due to the consideration of actual sampling time, cloudiness, and image integrity (Table 2). The measured soil salinity data in 2022 and 2023 were used to estimate the soil salinity in the study area from 2019 to 2023 and classified and graded according to the soil census salinity. The specific spatial distribution of salinity is shown in Figure 6.

The soils in the study area are predominantly non-saline soils and are distributed in the south. Mildly saline soils are predominantly distributed in the east and west. Moderately and heavily saline soils constitute a smaller proportion and are scattered throughout the study area. Salted soils are mainly distributed in the north. The saline areas are consistent with the actual field survey and analysis results. Regarding spatial distribution, the degree of soil salinity gradually increases from south to north, especially near the Yellow River. This is due to the higher groundwater level caused by the lateral seepage of the Yellow River and the impact of flooding, coupled with the low-lying terrain in the northern region, which exacerbates the salinity problem. In 2019, the soil salinity problem was the most serious, with the proportion of salted soils as high as 28% and the proportion of non-saline soils only accounting for 35%. Between 2020 and 2023, the proportion of non-saline soils stayed at about 50%. From 2019 to 2023, the proportion of the non-saline soil land area was relatively stable. From 2023 onwards, the proportion of non-saline land area stayed at around 50%. In 2023, the proportion of the non-saline soil land area was the largest (56%), and the proportion of the salted soil land area was the smallest (16%).

3.4.2. Characteristics of Spatial and Temporal Evolution of Soil Salinization

From Figure 7a, it can be seen that the total area of saline soil was 878.80 km², 619.00 km², 682.59 km², 624.59 km², and 592.45 km² from 2019 to 2023.The overall trend was a decrease from 2019 to 2020, an increase from 2020 to 2021, and a further decrease from 2021 to 2023. From 2019 to 2020, the area of non-saline soils increased by 259.80 km², while the area of mildly saline soils decreased by 153.56 km². During this period, the area of moderately saline soils increased by 4.00 km², the area of heavily saline soils increased by 27.94 km², and the area of salted soils decreased by 138.18 km². From 2020 to 2021, the area of non-saline soils decreased by 63.59 km², and the area of mildly saline soils decreased by 1.34 km². During the same period, the area of moderately saline soils decreased by 17.60 km², the area of heavily saline soils increased by 31.88 km², and the area of salted soils increased by 114.41 km². From 2021 to 2022, the area of non-saline soils increased by 58.00 km², and the area of mildly saline soils increased by 21.98 km². Meanwhile, the area of moderately saline soils decreased by 0.73 km², the area of heavily salted soils decreased by 1.15 km², and the area of salted soils decreased by 34.14 km². From 2022 to 2023, the area of non-saline soils increased by 32.14 km², and the area of mildly saline soils increased by 3.72 km². Of this, the area of moderately saline soils increased by 23.21 km², the area of heavily saline soils increased by 40.74 km², and the area of salted soils decreased by 99.81 km².

Figure 7b analyzes the annual rate of change in saline soils from 2019 to 2023. From 2019 to 2020, the non-saline soils grew the fastest, with a growth rate of 55.20%, followed by heavily saline soils, with a growth rate of 28.15%. In contrast, the area of mildly saline and salted soils decreased significantly with reduction rates of −45.67% and −36.75%, respectively. From 2020 to 2021, the area of salted soils increased sharply by 48.11%. Meanwhile, the area of non-saline, mildly saline, moderately saline, and heavily saline soils decreased at −8.70%, −0.76%, −22.41%, and −22.05%, respectively. From 2021 to 2022, only the area of non-saline soils increased by 8.70%, while the area of the remaining types of saline soils decreased, with the area of mildly saline soils decreasing the fastest at 12.62%. From 2022 to 2023, only salted soils decreased by 31.37%, while all the other types of saline soils increased.

Due to the short period selected, 2019, 2021, and 2023 were chosen as the study periods to reflect better the soil salinity transfer process in the last 5 years. The transfer matrices between different types of saline soils were calculated for 2019–2021, 2021–2023, and 2019–2023, and chord diagrams were plotted to visualize the results (Figure 8). From 2019 to 2023, the area of non-saline soils increased by 286.35 km², the area of mildly saline soils decreased by 173.16 km², the area of moderately saline soils increased by 8.88 km², the area of heavily saline soils increased by 35.65 km², and the area of salted soils decreased by 157.72 km². Overall, the salinity problem in the study area has improved. From 2019 to 2021, the primary transfer processes are mildly saline soils to non-saline soils, salted soils to heavily saline soils, and non-saline soils to mildly saline soils. Among them, 202.04 km² was transferred from mildly saline soils to non-saline soils, 44.11 km² from salted soils to heavily saline soils, and 32.40 km² from non-saline soils to mildly saline soils. Between 2021 and 2023, the primary transfer processes were similar: mildly saline soils to non-saline soils, salted soils to heavily saline soils, and non-saline soils to mildly saline soils. During this period, 95.34 km² was transformed from mildly to non-saline soils, 84.58 km² from salted to heavily saline soils, and 61.54 km² from non-saline to mildly salted soils.

4. Discussion

4.1. Model Comparison

Studies have shown complex nonlinear and indirect relationships between salt inversion factors. Machine learning models have advantages in estimating soil salt content (SSC) [37]. Wang et al. [38] collected 160 soil samples in an arid region and compared the performance of RF, SVM, and ANN models in estimating SSC. The study concluded that the RF model performed best (R² = 0.81). However, Li et al. [23] used SMR, SVR, RF, and PLSR models to monitor soil salinity and found that the PLSR model performed best (R² = 0.66). This could be because the dataset contained only 60 samples, preventing the machine-learning model from adequate training. MUKHAMEDIEV et al. [39] used XGBoost, LightGBM, RF, SVM, and RR models to predict soil salinity with the help of an optimized spectral index. The results showed that the XGBoost model had the highest accuracy (R² = 0.701). Zhang et al. [40] compared the effectiveness of CNN-LSTM and RF models for soil organic carbon (SOC) prediction. It was found that CNN-LSTM was better than the RF model in predicting SOC at the regional spatial scale. However, due to the resource-intensive nature of manually tuning machine learning hyperparameters, the XGBoost model was optimized in this study using Optuna. The accuracy of the PLSR, GWR, XGBoost, and LSTM models was evaluated to determine the optimal model. Typically, the performance of machine learning models improves with increasing data. Nonlinear models can handle complex relationships between independent and dependent variables, allowing them to predict soil salinity [41] effectively. Nonlinear machine-learning models and deep-learning models usually outperform linear regression models in predicting soil salinity. The GWR model incorporates spatial location factors and assigns weights to data points based on proximity. Therefore, it outperforms the PLSR model. However, the traditional GWR assumes that the optimal bandwidths of all influencing factors are the same, an assumption that limits its ability to accurately reflect the real spatial processes of soil salinity [42], resulting in relatively poor model performance. Deep learning methods are more suitable for large datasets. When the dataset is small, the number of training iterations needs to be limited to avoid overfitting, resulting in a lower performance than the XGBoost model, which is consistent with the results of Wang [43]. XGBoost outperforms other models because it can handle nonlinearities, interactions between features, robustness to overfitting, and computational efficiency in high-dimensional, noisy data environments. The results of this study show that the XGBoost model achieved the optimal R², MAE, RMSE, and RPD values on both the training and test sets, while the PLSR model had the worst prediction performance.

4.2. SHAP Analysis

SHAP analysis was performed on the optimal model, XGBoost, to determine which factors contribute to SSC estimation. Figure 9a explains the difference between the XGBoost model predictions and the model benchmark value. It also identifies the variables responsible for these differences. Each point in the figure represents an actual SSC sampling point. The X-axis shows the specific SHAP value of the variable corresponding to each SSC point. A SHAP value > 0 indicates that the variable positively impacts SSC estimation, while a SHAP value < 0 indicates a negative impact. The Y-axis displays the variables sorted by their SHAP value. The color of each point represents the SHAP value of the variable. Blue indicates lower SHAP values, while red represents higher SHAP values. Figure 9b shows the SHAP value matrix for all samples in the test set. The left Y-axis lists the inversion factors, sorted by importance, while the right side displays the average SHAP value of each factor. In the matrix, darker colors indicate larger absolute SHAP values, meaning the variable has a greater impact on the model. The model prediction results are visualized at the top of Figure 9b.

As can be seen in Figure 9a, CAEX, CNBL, and TBI7 have the highest global importance. VD, RSP, S1, CND, and Wetness follow them. Figure 9b shows that CAEX and CNBL contribute the most to the SSC estimation, while the contributions of the other factors decrease in order of importance. CAEX and CNBL also show prominent clustering characteristics. This suggests that different mechanisms influence SSC estimation at various sampling sites.

Figure 10 reveals the impact of each feature on the SSC for data points 8 and 15 using SHAP analysis. Due to the force diagram display, only the features with the most significant impact are shown. In the figure, red indicates that the feature increases the SSC, while blue means that the feature decreases the SSC. Feature names are shown at the bottom of the bars, with each bar’s length representing the feature’s contribution. The ‘Base value’ means the baseline value of the model (8.525 g⋅kg⁻¹), while ‘f(x)’ is the predicted SSC value. Figure 10a shows the analysis for the eighth data point. From left to right, the S1 value (about 0.553) increases the SSC by 0.36 g⋅kg⁻¹. The CNBL value (about 1017.938) decreases the SSC by 2.01 g⋅kg⁻¹. Similarly, the RSP and CAEX values (about 0.412 and 1.402, respectively) decrease the SSC by 1.71 g⋅kg⁻¹ and 1.58 g⋅kg⁻¹, respectively. Under the combined effect of all factors, the final predicted SSC (1.29 g⋅kg⁻¹) was lower than the baseline value. At this point, CNBL, RSP, and CAEX had the most significant impact on reducing SSC. Figure 10b analyses the 15th data point, where the predicted SSC (f(x)) is 32.20 g⋅kg⁻¹, which is higher than the baseline value. TBI7, Wetness, S1, RSP, VD, CAEX, and CNBL contributed to the SSC increase in this case. These features increased SSC by 1.78, 2.00, 2.75, 3.38, 3.88, 3.98, and 5.35 g⋅kg⁻¹, respectively. Features with more minor effects are not shown in the figure. By comparing the two data points, it can be observed that the same feature may have different impacts on SSC prediction at various points. For example, CNBL, RSP, CAEX, TBI7, and VD decreased SSC at data point 8 but increased SSC at data point 15.

This paper argues that salinity indices and dem derivatives are the main determinants for estimating the study area’s soil salt content (SSC). It has been shown that soil salt content directly affects the spectral reflectance in each band. Therefore, the salinity indices can effectively estimate SSC [44]. Topography also plays a crucial role in the water cycle. Flat terrain tends to accumulate water, which affects the salt content. In addition, topography affects the water table. More salt rises to the soil surface in areas with a higher water table. Topographic factors also determine the direction of movement of water and solutes, which affects the distribution and location of soil salt accumulation [45,46]. Studies have shown that CAEX represents soil carbonate content. Particle size determines the soil’s water-holding capacity; the more significant the particles, the easier the salt can migrate [47]. Humidity is another important factor that reflects the moisture of the soil. Soil moisture is critical for vegetation growth. Water stress caused by insufficient soil moisture can limit vegetation growth and lead to sparse vegetation. Sparse vegetation increases evaporation and leads to the accumulation of soil salts [48].

4.3. Data Accuracy Discussion

The field experiment of this study was conducted in Dalat Banner from 13 to 18 May 2022 and from 15 to 19 April 2023. Although sampling was conducted in two years, there were differences in the number of sampling sites and timing. In addition, only two years of data were used to estimate soil salinity during the spring bare soil period in 2019–2023. This may have introduced some bias into the results. Future studies should perform a more comprehensive accuracy assessment. Secondly, soil salinity is influenced by many factors. This study only considered topographic factors and did not consider other environmental variables such as climate, groundwater burial depth, and soil type. Future studies should include these factors to improve the predictive performance of soil salinity estimation models. Thirdly, differences in salinity types and sample sizes may lead to errors in data analysis. Future studies should investigate estimation models for different salinity types to improve the accuracy and fit of the models [25].

4.4. Migration of Models

High soil salinity can inhibit crop growth and lead to crop yield reductions. Accurate spatial mapping of soil salinity can guide adjustments to planting structures to avoid the negative effects of high salinity. It can help farmers adopt targeted crop management and sustainable irrigation methods. However, the question of whether the XGBoost model is transferable remains to be explored. The model can only be applied to areas with similar soil and climate characteristics, which needs to be verified with sufficient actual sampling data. If the model can be applied, it will help build a knowledge base for soil salinization. It provides a valuable tool for managing soil salinization and supporting the sustainable development of saline areas.

5. Conclusions

(1) The Optuna-XGBoost model had the highest accuracy, followed by the LSTM model, while the GWR and PLSR models had lower accuracy. The R², MAE, RMSE, and RPD of the optimal SSC model for the test set were 0.76, 3.38 g⋅kg⁻¹, 5.77 g⋅kg⁻¹, and 2.05, respectively.

(2) The SHAP-based interpretable model ranked feature importance and highlighted the role of feature groups. It concluded that salt index and topographic factors were the significant inversions. In addition, it revealed the different effects of the same inversion factor on SSC estimates at various points, which may increase or decrease the predicted SSC.

(3) Over time, soils with different degrees of salinization in the study area transformed into each other, and the degree of salinization generally showed a decreasing trend. Spatially, the severity of soil salinization gradually increased from south to north and it was most severe near the Yellow River.

Author Contributions

X.L. (Xia Liu): Supervision, Project administration, Writing—review and editing. Y.H.: Methodology, Formal analysis, Writing—original draft, Visualization. X.L. (Xiang Li): Software, Data curation, Writing—review and editing. Y.X.: Writing—review and editing. R.D.: Writing—review and editing. F.Z.: Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Overview of the study area and distribution of sampling sites ((a) geographic location map; (b,c) elevation map, distribution of sampling sites).

Figure 1. Overview of the study area and distribution of sampling sites ((a) geographic location map; (b,c) elevation map, distribution of sampling sites).

View Image - Figure 2. (a) Descriptive statistics box plot of measured SSC (CV: coefficient of variation); (b) SSC distribution map of different types of saline–alkali soils.

Figure 2. (a) Descriptive statistics box plot of measured SSC (CV: coefficient of variation); (b) SSC distribution map of different types of saline–alkali soils.

Figure 3. Correlation analysis between soil salinity and inversion factors.

Figure 4. Performance comparison of different model training sets.

Figure 5. Performance comparison of different model test sets.

Figure 6. Soil content grading chart 2019–2023.

Figure 7. Area of different classes of saline land from 2019 to 2023 ((a) change in area; (b) rate of change).

Figure 8. Transfer matrix between areas of different types of saline soils.

Figure 9. (a) SHAP global interpretation map: feature summary map for SHAP; (b) heat map of SHAP-based features.

Figure 10. SSC inversion data-processing procedure. Table (a) in the figure shows the 8th data point, and table (b) shows the 15th.

Table 1

Inversion factor.

Data Source	Category	Abbreviation	Expression	Reference
Sentinel-2	Salinity Indices	SI4	$(N i r * S W I R 1 - S W I R 1 * S W I R 1) / N i r$	[23]
		S1	$\frac{B l u e}{R e d}$	[12]
		S2	$(B l u e - R e d) / (B l u e + R e d)$	[12]
		IBI_temp	$\frac{S W I R 2 + R e d - (B l u e + N i r)}{S W I R 2 + R e d + (B l u e + N i r)}$	[24]
		CAEX	$\frac{B l u e}{G r e e n}$	[25]
		NDSI	$(R e d - N i r) / (R e d + N i r)$	[26]
	Three-band Indices	TBI3	$\frac{R e d E d g e 2 - S W I R 2}{R e d E d g e 2 + S W I R 2}$	[26]
		TBI4	$(S W I R 2 - G r e e n) / (G r e e n - S W I R 1)$	[26]
		TBI7	$S W I R 1 - 2 * S W I R 2 + R e d E d g e 1$	[26]
	Vegetation Indices	RVI	$\frac{N i r}{R e d}$	[24]
		EVI	$\frac{2.5 * (N i r - R e d)}{N i r + 6 * R e d - 7.5 * B l u e}$	[24]
		GVMI	$\frac{(N i r + 0.1) - (S W I R 1 + 0.02)}{(N i r + 0.1) + (S W I R 1 + 0.02)}$	[25]
		NDVI	$(N i r - R e d) / (N i r + R e d)$	[12]
		VSDI	$1 - ((S W I R 2 - B l u e) + (R e d - B l u e)$ )	[27]
		SIWIS	$(S W I R 2 - N i r) / (S W I R 2 + N i r)$	[28]
		NDI	$\frac{S W I R 2 - R e d E d g e 3}{S W I R 2 + R e d E d g e 3}$	[26]
	Principal ComponentAnalysis	Pca1, Pca2, Pca3		[12]
	Tasseled Cap	Brightness, Greenness, Wetness		[12]
Landsat-8	Surface Temperature	LST
NASA SRTM	Dem Derivatives	AS
		CNBL
		CND
		DEM		SAGA GIS
		RSP
		TWI
		VD

Key to terms: Blue, Green, Red, Red Edge 1, Red Edge 2, Red Edge 3, Nir, SWIR1, and SWIR2 correspond to the reflectivity of the Sentinel-2 bands B2, B3, B4, B5, B6, B7, B8, B9, and B11; SI4, Salinity Index 4; S1, Salinity Index 1; S2, Salinity Index 2; IBI_temp, Bare Soil Index; CAEX, Carbonate Index; NDSI, Normalized Difference Salinity Index; TBI3, Three Band Index 3; TBI4, Three Band Index 4; TBI7, Three Band Index 7; RVI, Ratio Vegetation Index; EVI, Enhanced Vegetation Index; GVMI, Global Vegetation Moisture Index; NDVI, Normalized Difference Vegetation Index; VSDI, Visible Optical and Short-Wave Infrared Drought Index; SIWIS, Shortwave Infrared Water Stress Index; NDI, Normalized Difference Index; AS, Aspect; CBNL, Channel Network Base Lever; CND, Channel Network Distance; DEM, Digital Elevation Model; RSP, Relative Slope Position; TWI, Topographic Wetness Index; VD, Valley Depth.

Table 2

Sentinel-2 image screening time and cloudiness.

Times	Satellites	Resolution (m)	Quantity of Cloud (%)
2023 4.15–4.19	Sentinel-2	10	<6
2022 5.13–5.18	Sentinel-2	10	<7
2021 4.15–4.19	Sentinel-2	10	<2
2020 5.13–5.19	Sentinel-2	10	<5
2019 5.09–5.18	Sentinel-2	10	<4

Table 3

Comparison of different models.

Model	Strengths	Weaknesses
PLSR	Can handle multicollinearity	Assumes linearity
	Dimensionality reduction	Inability to capture complex nonlinear relationships
	Easy interpretation	Inability to capture complex nonlinear relationships
GWR	Accounts for spatial heterogeneity	High computational cost
	Accounts for spatial heterogeneity	Lack of global interpretability
	Provides localized regression coefficients	Assumption of spatial continuity
	Provides localized regression coefficients	Difficulty with non-linear relationships
LSTM	Handles sequential and time-series data well	Computationally intensive
LSTM	Captures complex non-linear patterns	Requires large datasets and careful tuning
XGBoost	High accuracy	tuning of hyperparameters
	Robust to overfitting	Less interpretable than linear models
	Handles non-linear relationships well	Less interpretable than linear models

Table 4

XGboost model parameters.

S.No.	Parameters	Tuned Parameters
1	learning_rate	0.06411852795588255
2	n_estimators	107
3	max_depth	10
4	min_child_weight	0.4680734688578364
5	gamma	4.232987987199886
6	reg_alpha	9.963626711654813
7	reg_lambda	7.902872405464361
8	subsample	0.9888473010123858

References

1. Akça, E.; Aydin, M.; Kapur, S.; Kume, T.; Nagano, T.; Watanabe, T.; Çilek, A.; Zorlu, K. Long-Term Monitoring of Soil Salinity in a Semi-Arid Environment of Turkey. Catena; 2020; 193, 104614. [DOI: https://dx.doi.org/10.1016/j.catena.2020.104614]

2. Gao, Y.; Liu, X.; Hou, W.; Han, Y.; Wang, R.; Zhang, H. Characteristics of Saline Soil in Extremely Arid Regions: A Case Study Using GF-3 and ALOS-2 Quad-Pol SAR Data in Qinghai, China. Remote Sens.; 2021; 13, 417. [DOI: https://dx.doi.org/10.3390/rs13030417]

3. Pennock, D.; McKenzie, N.; Montanarella, L. Status of the World’s Soil Resources; Technical Summary FAO: Rome, Italy, 2015.

4. Yang, J.; Yao, R.; Wang, X.; Xie, W.; Zhang, X.; Zhu, W.; Zhang, L.; Sun, R. Research on salt-affected soils in China: History, status quo and prospect. Acta Pedol. Sin.; 2022; 59, pp. 10-27. [DOI: https://dx.doi.org/10.11766/trxb202110270578]

5. Ding, J.; Yu, D. Monitoring and Evaluating Spatial Variability of Soil Salinity in Dry and Wet Seasons in the Werigan–Kuqa Oasis, China, Using Remote Sensing and Electromagnetic Induction Instruments. Geoderma; 2014; 235–236, pp. 316-322. [DOI: https://dx.doi.org/10.1016/j.geoderma.2014.07.028]

6. Taghizadeh-Mehrjardi, R.; Minasny, B.; Sarmadian, F.; Malone, B.P. Digital Mapping of Soil Salinity in Ardakan Region, Central Iran. Geoderma; 2014; 213, pp. 15-28. [DOI: https://dx.doi.org/10.1016/j.geoderma.2013.07.020]

7. Bannari, A.; El-Battay, A.; Bannari, R.; Rhinane, H. Sentinel-MSI VNIR and SWIR Bands Sensitivity Analysis for Soil Salinity Discrimination in an Arid Landscape. Remote Sens.; 2018; 10, 855. [DOI: https://dx.doi.org/10.3390/rs10060855]

8. Ma, Y.; Chen, H.; Zhao, G.; Wang, Z.; Wang, D. Spectral Index Fusion for Salinized Soil Salinity Inversion Using Sentinel-2A and UAV Images in a Coastal Area. IEEE Access; 2020; 8, pp. 159595-159608. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3020325]

9. Allbed, A.; Kumar, L.; Aldakheel, Y.Y. Assessing Soil Salinity Using Soil Salinity and Vegetation Indices Derived from IKONOS High-Spatial Resolution Imageries: Applications in a Date Palm Dominated Region. Geoderma; 2014; 230–231, pp. 1-8. [DOI: https://dx.doi.org/10.1016/j.geoderma.2014.03.025]

10. Wang, T.; Shen, X.; Luan, W. Quantitative Inversion of Soil Salinity in Different Seasons in Huinong District, Ningxia Based on Sentinel-2 Satellite. Acad. J. Sci. Technol.; 2024; 10, pp. 100-105. [DOI: https://dx.doi.org/10.54097/6fn0q924]

11. Wang, D.; Li, X.; Zou, D.; Wu, T.; Xu, H.; Hu, G.; Li, R.; Ding, Y.; Zhao, L.; Li, W. et al. Modeling Soil Organic Carbon Spatial Distribution for a Complex Terrain Based on Geographically Weighted Regression in the Eastern Qinghai-Tibetan Plateau. Catena; 2020; 187, 104399. [DOI: https://dx.doi.org/10.1016/j.catena.2019.104399]

12. Meng, L.N.; Ding, J.L.; Wang, J.Z.; Ge, X.Y. Spatial distribution of soil salinity in Ugan-Kuqa River delta oasis based on environmental variables. Trans. Chin. Soc. Agric. Eng.; 2020; 36, pp. 175-181. [DOI: https://dx.doi.org/10.11975/j.issn.1002-6819.2020.01.020]

13. Nurmemet, I.; Sagan, V.; Ding, J.-L.; Halik, Ü.; Abliz, A.; Yakup, Z. A WFS-SVM Model for Soil Salinity Mapping in Keriya Oasis, Northwestern China Using Polarimetric Decomposition and Fully PolSAR Data. Remote Sens.; 2018; 10, 598. [DOI: https://dx.doi.org/10.3390/rs10040598]

14. Sun, D.; Xu, J.; Wen, H.; Wang, D. Assessment of Landslide Susceptibility Mapping Based on Bayesian Hyperparameter Optimization: A Comparison between Logistic Regression and Random Forest. Eng. Geol.; 2021; 281, 105972. [DOI: https://dx.doi.org/10.1016/j.enggeo.2020.105972]

15. Putatunda, S.; Rama, K. A Comparative Analysis of Hyperopt as Against Other Approaches for Hyper-Parameter Optimization of XGBoost. Proceedings of the 2018 International Conference on Signal Processing and Machine Learning; Beijing, China, 12–16 August 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 6-10. [DOI: https://dx.doi.org/10.1145/3297067.3297080]

16. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2623-2631. [DOI: https://dx.doi.org/10.1145/3292500.3330701]

17. Mangalathu, S.; Hwang, S.-H.; Jeon, J.-S. Failure Mode and Effects Analysis of RC Members Based on Machine-Learning-Based SHapley Additive exPlanations (SHAP) Approach. Eng. Struct.; 2020; 219, 110927. [DOI: https://dx.doi.org/10.1016/j.engstruct.2020.110927]

18. Yu, S.; Chen, Z.; Yu, B.; Wang, L.; Wu, B.; Wu, J.; Zhao, F. Exploring the Relationship between 2D/3D Landscape Pattern and Land Surface Temperature Based on Explainable eXtreme Gradient Boosting Tree: A Case Study of Shanghai, China. Sci. Total Environ.; 2020; 725, 138229. [DOI: https://dx.doi.org/10.1016/j.scitotenv.2020.138229] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32298895]

19. Zhou, M.; Li, Y. Spatial Distribution and Source Identification of Potentially Toxic Elements in Yellow River Delta Soils, China: An Interpretable Machine-Learning Approach. Sci. Total Environ.; 2024; 912, 169092. [DOI: https://dx.doi.org/10.1016/j.scitotenv.2023.169092]

20. Tong, C.; He, R.; Wang, J.; Zheng, H. Study on Water and Salt Transport Characteristics of Sunflowers under Different Irrigation Amounts in the Yellow River Irrigation Area. Agronomy; 2024; 14, 1058. [DOI: https://dx.doi.org/10.3390/agronomy14051058]

21. Wei, T.; Niu, J.; Jing, Y.; Lu, S.; Yao, Y.; Zhao, P.; Duan, Y.; Tuo, D. The Principal Component Analysis of Salinized in Dalate of Inner Mongolia. J. Inn. Mong. Agric. Univ. (Nat. Sci. Ed.); 2016; 37, pp. 55-61. [DOI: https://dx.doi.org/10.16853/j.cnki.1009-3575.2016.02.010]

22. Wang, Y.; Qu, Z.; Yang, W.; Chen, X.; Qiao, T. Inversion of Soil Salinity in the Irrigated Region along the Southern Bank of the Yellow River Using UAV Multispectral Remote Sensing. Agronomy; 2024; 14, 523. [DOI: https://dx.doi.org/10.3390/agronomy14030523]

23. Li, J.; Zhang, T.; Shao, Y.; Ju, Z. Comparing Machine Learning Algorithms for Soil Salinity Mapping Using Topographic Factors and Sentinel-1/2 Data: A Case Study in the Yellow River Delta of China. Remote Sens.; 2023; 15, 2332. [DOI: https://dx.doi.org/10.3390/rs15092332]

24. Guo, Z.; Li, Y.; Wang, X.; Gong, X.; Chen, Y.; Cao, W. Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China. Remote Sens.; 2023; 15, 3846. [DOI: https://dx.doi.org/10.3390/rs15153846]

25. Wang, N.; Peng, J.; Xue, J.; Zhang, X.; Huang, J.; Biswas, A.; He, Y.; Shi, Z. A Framework for Determining the Total Salt Content of Soil Profiles Using Time-Series Sentinel-2 Images and a Random Forest-Temporal Convolution Network. Geoderma; 2022; 409, 115656. [DOI: https://dx.doi.org/10.1016/j.geoderma.2021.115656]

26. Wang, J.; Ding, J.; Yu, D.; Ma, X.; Zhang, Z.; Ge, X.; Teng, D.; Li, X.; Liang, J.; Lizaga, I. et al. Capability of Sentinel-2 MSI Data for Monitoring and Mapping of Soil Salinity in Dry and Wet Seasons in the Ebinur Lake Region, Xinjiang, China. Geoderma; 2019; 353, pp. 172-187. [DOI: https://dx.doi.org/10.1016/j.geoderma.2019.06.040]

27. Wang, S.; Wang, W.; Wu, Y.; Zhao, S. Surface Soil Moisture Inversion and Distribution Based on Spatio-Temporal Fusion of MODIS and Landsat. Sustainability; 2022; 14, 9905. [DOI: https://dx.doi.org/10.3390/su14169905]

28. Ge, X.; Ding, J.; Teng, D.; Wang, J.; Huo, T.; Jin, X.; Wang, J.; He, B.; Han, L. Updated Soil Salinity with Fine Spatial Resolution and High Accuracy: The Synergy of Sentinel-2 MSI, Environmental Covariates and Hybrid Machine Learning Approaches. Catena; 2022; 212, 106054. [DOI: https://dx.doi.org/10.1016/j.catena.2022.106054]

29. Guo, L.; Chen, Y.; Shi, T.; Zhao, C.; Liu, Y.; Wang, S.; Zhang, H. Exploring the Role of the Spatial Characteristics of Visible and Near-Infrared Reflectance in Predicting Soil Organic Carbon Density. ISPRS Int. J. Geo-Inf.; 2017; 6, 308. [DOI: https://dx.doi.org/10.3390/ijgi6100308]

30. McMillen, D.P. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Am. J. Agric. Econ.; 2004; 86, pp. 554-556. [DOI: https://dx.doi.org/10.1111/j.0002-9092.2004.600_2.x]

31. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature; 2015; 521, pp. 436-444. [DOI: https://dx.doi.org/10.1038/nature14539]

32. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785-794. [DOI: https://dx.doi.org/10.1145/2939672.2939785]

33. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Proceedings of the Advances in Neural Information Processing Systems; Long Beach, CA, USA, 4–9 December 2017; pp. 4765-4774.

34. Wei, G.; Li, Y.; Zhang, Z.; Chen, Y.; Chen, J.; Yao, Z.; Lao, C.; Chen, H. Estimation of Soil Salt Content by Combining UAV-Borne Multispectral Sensor and Machine Learning Algorithms. PeerJ; 2020; 8, e9087. [DOI: https://dx.doi.org/10.7717/peerj.9087] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32377459]

35. Yin, H.; Chen, C.; He, Y.; Jia, J.; Chen, Y.; Du, R.; Xiang, R.; Zhang, X.; Zhang, Z. Synergistic Estimation of Soil Salinity Based on Sentinel-1 Image Texture and Sentinel-2 Salinity Spectral Indices. J. Appl. Remote Sens.; 2023; 17, 018502. [DOI: https://dx.doi.org/10.1117/1.JRS.17.018502]

36. Zhang, S.; Zhao, J.; Yang, J.; Xie, J.; Sun, Z. Feature Selection and Regression Models for Multisource Data-Based Soil Salinity Prediction: A Case Study of Minqin Oasis in Arid China. Land; 2024; 13, 877. [DOI: https://dx.doi.org/10.3390/land13060877]

37. Cui, X.; Han, W.; Zhang, H.; Cui, J.; Ma, W.; Zhang, L.; Li, G. Estimating Soil Salinity under Sunflower Cover in the Hetao Irrigation District Based on Unmanned Aerial Vehicle Remote Sensing. Land Degrad. Dev.; 2023; 34, pp. 84-97. [DOI: https://dx.doi.org/10.1002/ldr.4445]

38. Wang, J.; Peng, J.; Li, H.; Yin, C.; Liu, W.; Wang, T.; Zhang, H. Soil Salinity Mapping Using Machine Learning Algorithms with the Sentinel-2 MSI in Arid Areas, China. Remote Sens.; 2021; 13, 305. [DOI: https://dx.doi.org/10.3390/rs13020305]

39. Mukhamediev, R.; Amirgaliyev, Y.; Kuchin, Y.; Aubakirov, M.; Terekhov, A.; Merembayev, T.; Yelis, M.; Zaitseva, E.; Levashenko, V.; Popova, Y. et al. Operational Mapping of Salinization Areas in Agricultural Fields Using Machine Learning Models Based on Low-Altitude Multispectral Images. Drones; 2023; 7, 357. [DOI: https://dx.doi.org/10.3390/drones7060357]

40. Zhang, L.; Cai, Y.; Huang, H.; Li, A.; Yang, L.; Zhou, C. A CNN-LSTM Model for Soil Organic Carbon Content Prediction with Long Time Series of MODIS-Based Phenological Variables. Remote Sens.; 2022; 14, 4441. [DOI: https://dx.doi.org/10.3390/rs14184441]

41. Ayoubi, S.; Sahrawat, K.L. Comparing Multivariate Regression and Artificial Neural Network to Predict Barley Production from Soil Characteristics in Northern Iran. Arch. Agron. Soil Sci.; 2011; 57, pp. 549-565. [DOI: https://dx.doi.org/10.1080/03650341003631400]

42. Jiang, Z.; Xu, B. Geographically Weighted Regression Analysis of the Spatially Varying Relationship between Farming Viability and Contributing Factors in Ohio. Reg. Sci. Policy Pract.; 2014; 6, pp. 69-84. [DOI: https://dx.doi.org/10.1111/rsp3.12028]

43. Wang, N.; Xue, J.; Peng, J.; Biswas, A.; He, Y.; Shi, Z. Integrating Remote Sensing and Landscape Characteristics to Estimate Soil Salinity Using Machine Learning Methods: A Case Study from Southern Xinjiang, China. Remote Sens.; 2020; 12, 4118. [DOI: https://dx.doi.org/10.3390/rs12244118]

44. Wang, F.; Yang, S.; Wei, Y.; Shi, Q.; Ding, J. Characterizing Soil Salinity at Multiple Depth Using Electromagnetic Induction and Remote Sensing Data with Random Forests: A Case Study in Tarim River Basin of Southern Xinjiang, China. Sci. Total Environ.; 2021; 754, 142030. [DOI: https://dx.doi.org/10.1016/j.scitotenv.2020.142030] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32911147]

45. Nosetto, M.D.; Acosta, A.M.; Jayawickreme, D.H.; Ballesteros, S.I.; Jackson, R.B.; Jobbágy, E.G. Land-Use and Topography Shape Soil and Groundwater Salinity in Central Argentina. Agric. Water Manag.; 2013; 129, pp. 120-129. [DOI: https://dx.doi.org/10.1016/j.agwat.2013.07.017]

46. Taghizadeh-Mehrjardi, R.; Sheikhpour, R.; Zeraatpisheh, M.; Amirian-Chakan, A.; Toomanian, N.; Kerry, R.; Scholten, T. Semi-Supervised Learning for the Spatial Extrapolation of Soil Information. Geoderma; 2022; 426, 116094. [DOI: https://dx.doi.org/10.1016/j.geoderma.2022.116094]

47. Farzamian, M.; Paz, M.C.; Paz, A.M.; Castanheira, N.L.; Gonçalves, M.C.; Monteiro Santos, F.A.; Triantafilis, J. Mapping Soil Salinity Using Electromagnetic Conductivity Imaging—A Comparison of Regional and Location-specific Calibrations. Land Degrad. Dev.; 2019; 30, pp. 1393-1406. [DOI: https://dx.doi.org/10.1002/ldr.3317]

48. Xu, H.; Chen, C.; Zheng, H.; Luo, G.; Yang, L.; Wang, W.; Wu, S.; Ding, J. AGA-SVR-Based Selection of Feature Subsets and Optimization of Parameter in Regional Soil Salinization Monitoring. Int. J. Remote Sens.; 2020; 41, pp. 4470-4495. [DOI: https://dx.doi.org/10.1080/01431161.2020.1718239]

Word count: 8333

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Soil salinization is a serious land degradation phenomenon, posing a severe threat to regional agricultural resource utilization and sustainable development. It has been a mainstream trend to use machine-learning methods to achieve monitoring of large-scale salinized soil quickly. However, machine learning model training requires many samples and hyper-parameter optimization and lacks solvability. To compare the performance of different machine-learning models, this study conducted a soil sampling experiment on saline soils along the south bank of the Yellow River in Dalate Banner. The experiment lasted two years (2022 and 2023) during the spring bare soil period, collecting 304 soil samples. The soil salinity was estimated with the multi-source remote sensing satellite data by combining the extreme gradient boosting model (XGBoost), Optuna hyper-parameter optimization, and Shapley addition (SHAP) interpretable model. Correlation analysis and continuous variable projection were employed to identify key inversion factors. The regression effects of partial least squares regression (PLSR), geographically weighted regression (GWR), long short-term memory networks (LSTM), and extreme gradient boosting (XGBoost) were compared. The optimal model was selected to estimate soil salinity in the study area from 2019 to 2023. The results showed that the XGBoost model fitted optimally, the test set had high R² (0.76) and the ratio of performance to deviation (2.05), and the estimation results were consistent with the measured salinity values. SHAP analysis revealed that the salinity index and topographic factors were the primary inversion factors. Notably, the same inversion factor influenced varying soil salinity estimates at different locations. The saline soils of the study area in 2019 and 2023 were 65% and 44%, respectively, and the overall trend of soil salinization decreased. From the viewpoint of spatial distribution, the degree of soil salinization showed a gradually increasing trend from south to north, and it was most serious on the side near the Yellow River. This study is of great significance for the quantitative estimation of salinized soil in the irrigated area on the south bank of the Yellow River, the prevention and control of soil salinization, and the sustainable development of agriculture.

Details

Title

An Interpretable Model for Salinity Inversion Assessment of the South Bank of the Yellow River Based on Optuna Hyperparameter Optimization and XGBoost

Author

Liu, Xia¹; Hu, Yu¹; Li, Xiang¹; Du, Ruiqi²; Youzhen Xiang²

; Zhang, Fucang²

¹ College of Water Conservancy and Civil Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China; [email protected] (X.L.); [email protected] (X.L.); Autonomous Region Collaborative Innovation Center for Integrated Management of Water Resources and Water Environment in the Inner Mongolia Reaches of the Yellow River, Hohhot 010018, China
² College of Water Resources and Architecture Engineering, Northwest A&F University, Yangling 712100, China; [email protected] (R.D.); [email protected] (Y.X.); [email protected] (F.Z.)

First page

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20734395

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/agronomy15010018

ProQuest document ID

3159265006