Modeling Spatial Flood using Novel Ensemble

Full text

Turn on search term navigation

1. Introduction

Floods and earthquakes are the most common and unpredictable natural events that cause damage to communities by affecting lives and property [1]), and determine the basin landforms [2]. Flash flooding is more disastrous than any other type of flooding because of its very short lag time [3,4]. Flash floods occur suddenly in any region and, thus, it appears to be very difficult to predict such events [5]. Moreover, flash flood depends on volume of run-off, the associated run-off velocity and infiltration by soil and its impact varies from region-to-region based on its geo-physical background [5]. Thus, to identify the probable regions of flash flood accurately is quite difficult. The increase in frequency of such devastating flood events in many parts of the world is mainly driven by climate change and large total rainfall within short time periods [6,7,8]. It is also quite surprising that more than 90% of the loss of lives and properties in Asia is due to flooding rather than other natural or/and man-induced disasters [9,10]. Between 1988 and 2011, natural events caused the death of 390,000 people worldwide, of which 58% was due to flooding [1]. Since complete mitigation of floods and other natural disasters is not possible, reducing the impacts of floods is crucial for livelihoods and sustainable development. That is why flood susceptibility modeling helps to identify the flood prone zones within a vast territory, which becomes helpful for planners and decision-makers for flood management [11]. Thus, ultimately identifying flood-prone zones and mapping these areas in detail can help planners reduce the detrimental impacts. This is an important contribution to achieving the United Nations sustainable development goals and to effectively resolve the problem of the neutrality of land degradation [12] in arid and semi-arid lands in which floods, degradation of land, drought, and desertification endanger people and their natural resources [13,14].

In Iran, floods are one of the most devastating natural disasters, resulting in an almost $3.6 billion loss in Iran [15]. A variety of factors, such as rainfall change and geophysical conditions are believed to be responsible for the rise in the number of flood events causing the loss of life and property [16]. In the last six decades, in Iran, there has been 3700 flood incidents, the most widely affected northern Iran [17]. More specifically, the Gorganroud River Basin in Iran is well-known for its large number of floods, with over 120 recorded floods between 1991 and 2012 [18].

The suitability of a method for flood susceptibility (FS) assessment depends on the capability of this method, with respect to flood conditioning factors (FCFs) used in the modeling. The selection of factors is, therefore, the first and most important step, and is mainly based on previous flood studies. The next task is to determine the order of importance of these flood occurrence factors and apply the best methods to identify and predict the flood hazard zones precisely. There are many physically-based models for predicting hydrological events. Numerous researches, however, indicate that there is a short-term predictive potential difference in current physical models [19,20,21,22,23,24,25]. Consequently, physically-based models do not provide reliable quantitative flood predictions with the help of a limited amount of data, and these models accumulate uncertainty errors in the modeling results [26,27,28]. These shortcomings promote the use of machine learning (ML) models. Soft computing methods are now becoming very popular due to their capability for dealing with a large number of datasets of varying nature, and for generating highly accurate results. These methods include logistic regression [11,29,30,31], neuro-fuzzy inference systems [32,33,34], decision trees [30], support vector machines (SVMs) [31,32,33,34,35], evidential belief functions [36,37], artificial neural networks (ANNs) [38], analytical hierarchy process and fuzzy logic [39,40,41,42,43], frequency ratio [11,30,31,44,45], conditional probability, certainty factor and Shannon entropy [10], Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), and VIseKriterijumska Optimizacija I Kompromisno Resenje (VIKOR) [37].

Several ML algorithms were introduced and used for various hazard-mapping in the current decade [46,47,48,49]. Here, the hybrid ensemble methods are more capable of performing reliable predictive analysis [50]. Ensemble techniques, such as Bagging and Boosting, are widely used for classification problems where multiple independent categorical variables are simultaneously used to address one issue. These techniques are now emerging in the field of spatial analysis where the ensemble models are used in landslide hazard mapping [32,51,52,53] and groundwater potential analysis [54,55], but, so far, have rarely been used to model flood hazards [56,57]. The Real AdaBoost can obtain a strong classifier using fewer decision trees [58]. It combines the AdaBoost with a real-valued classifier; therefore, it gives a more accurate result compared to the AdaBoost model. The key benefit of the approach of Random Subspace is that it uses its samples of different spatial features to achieve sub-classification differences and provide better results [59,60,61,62]. MultiBoosting can generate a decision tree with a lower error than both AdaBoost and Wagging [63,64]. Instead, Real AdaBoosting, MultiBoosting, Random Subspace and J48 are widely used in other sectors, such as banking and finance. Accordingly, in the work reported herein, we applied and tested ensemble methods combining meta-classifiers (Real AdaBoosting, MultiBoosting, and Random Subspace) with a tree-based algorithm (i.e., J48) as novel approaches. In doing so, we tested the new approaches for flood prediction and hazard zoning in the Gorganroud River Basin, in northern Iran. More specifically, we evaluated the contribution of a range of independent factors on flood susceptibility using the FR method. The accuracy of the models were assessed using the efficiency, sensitivity, root-mean square-error (RMSE), and receiver operating characteristics (ROC) technique.

The principal aim of the work discussed in this paper is to evaluate flood susceptibility (FS) model capacity, in particular, the hybrid machine learning ensemble approaches using the Gorganroud River Basin, Iran, as a case study. This will bring information to understand, forecast, and prevent flash floods.

2. Materials and Methods

2.1. Description of the Study Area

Located in the Golestan Province in northern Iran, the Gorganroud River Basin stretches between latitudes 36°25′to 38°15′N and longitudes 56°26′ to 54°10′ (Figure 1). This basin spans an area of 11,290 km², and falls into the Caspian Sea from the eastern Alborz Mountains. The height varies in the western zone (Caspian Sea) from −96 m above sea level (a.s.l). to 3669 m a.s.l. in the southern regions. The estimated annual temperature and precipitation are 17.8 °C and 500 mm, respectively [65]. The main climate types in the study area are semi-arid in the central part, Mediterranean in the northern areas, semi-humid in the south, and humid in the northeastern and northwestern regions. Based on the National Cartographic Centre and Geological Survey of Iran [66], lithological formations are also complex, but can be classified into four major age classes, consisting of Cenozoic, Mesozoic, Paleozoic, and Proterozoic. Agriculture, bare land, dry farmland, timber, rangeland, orchard, residential, water/wetland, and woodland are the nine major land use/land cover (LU/LC) types found in the study basin. Agriculture, woodland, and rangelands occupy much of the watershed. In the study area, seven types of soil have been described: Entisols, Inceptisols, Salt Flats, Alfisols, Aridisols, Inceptisols, and Mollisols, with spatially dominant Mollisols. Slopes in the watershed range from 0° to 79°, with most of the region (61.24%) having low slope angles (<14.2°). Due to extensive and unplanned land-use changes, and especially deforestation and urbanization, combined with heavy rainfall, this study area is recognized as one of the flood susceptible watersheds in Iran [18]. According to available evidence, between 1991 and 2019 more than 120 major and minor floods occurred in the basin [18]. For instance, the Gorganroud river flow reached 3020 cubic meters per second during the flood on 11 August 2001, and the river widened from 10 m to 400 m. This flood has caused over 500 deaths and is, thus, considered to be the most destructive flood in Iranian history [17]. Another example is a flood that occurred on 17 March 2019, during which, along with 12,000 houses, facilities, agricultural land, and gardens, most of the towns, including Aqqala and Gomishan, and at least 70 villages, were affected [67]. Given such devastating impacts arising from flooding, identifying the most susceptible areas in this flood prone region is crucial for targeting management for reducing flood damage.

2.2. Methodology

Flood susceptibility maps were prepared as part of this study using three novel ensemble ML algorithms. Using a number of main steps, the analysis was carried out. Firstly, two kinds of data, i.e., flood incidence and flood conditioning factors (FCFs), were gathered. Secondly, the flood inventory map and thematic layers of the FCFs were prepared (elevation, slope, curvature of the plan, topographic position index (TPI), topographic wetness index (TWI), convergence index (CI), stream power index (SPI), drainage density (DD), stream distance, precipitation, lithology, land use/land cover (LU/LC), normalized difference vegetation index (NDVI), and soil type). Third, 426 flood locations were reported, 70% of which were considered to prepare FS models, and 30% of which were used to validate models produced using training datasets. In step four, the tolerance was applied and the variance inflation factor (VIF) multicollinearity was tested among the selected FCFs. VIF estimates the multicollinearity problem among the independent factors, and if it is present, then that factor should not be used for modeling. In step five, the contribution of the factors was analyzed using the frequency ratio (FR), boosted regression tree (BRT), and AdaBoost. For step six, in order to produce the flood susceptibility map (FSM), training data was first applied to predict the spatial flood susceptibility (FS) using the base classifier of J48. Then, FSMs were prepared using meta-classifiers (i.e., Real AdaBoost, Random Subspace, and MultiBoosting) ensemble models (Real AdaBoost J48, Random Subspace J48, and MultiBoosting J48). Finally, we validated the FSMs generated by the ensemble models using the ROC curve, efficiency, accuracy, and true skill statistics (TSS), and analyzed the sensitivity of the individual flood conditioning factors to the four models. Figure 2 summarizes the key steps in our methods. J48, Real AdaBoost J48, Random Subspace J48, and MultiBoosting J48 models were run in R studio programming software using the CRAN R Weka, party, regRSM packages. Finally, the layouts of the flood susceptibility maps were prepared using ArcGIS software.

2.3. Database

2.3.1. Flood Inventory Map (FIM)

The flood location was regarded as the dependent variable for the calibration of the FSMs, and the effective FCFs were regarded as the independent variables. Using the dependent and independent variable data, the FSMs were prepared based on the J48 and ensemble models. It is necessary to prepare a FIM for the spatial flood probability prediction [10,36]. In previous works, several sources were used for preparing the FIM, such as governmental sources, newspaper articles, field surveys, etc. Modified Normalized Difference Water Index (MDNWI) derived from the Sentinel-2 satellite images through the Google Earth Engine can be used for preparing the FIM [56]. In this study, for the period 2001–2019, we used records from the Ministry of Iranian Water source, investigation documents on disaster management in the Golestan Province, and, three times, field surveys were carried out (on 8 February 2018; 2 April 2019; and 21 May 2019) in the randomly selected points to construct the flood inventory map. Flood points were selected randomly among the flood sites, having the flood frequency of one flood per year. We listed 426 flood sites in the research region (Figure 3), which we randomly divided into two datasets, i.e., a training dataset (70%) and a test data set (30%). We also selected the same amount of non-flood sites, randomly, for training as well as validating the models.

2.3.2. Generating Flood Conditioning Factors (FCFs)

Many factors control the flood susceptibility (FS) of an area, but there are no standard criteria for selecting the FCFs to depict the spatial FS. We initially selected 16 FCFs on the basis of previous literature, but eventually selected six topographical factors (elevation, slope, plan curvature, SPI TPI and CI), four hydrological factors (TWI, DD, stream distance, and rainfall), two environmental factors (LU/LC and NDVI), as well as lithology and soil type after multicollinearity analysis for FS modeling [10,35]. A geological map from the Geological Survey of Iran, rainfall data from the Islamic Republic of Iran Metrological Organization, Landsat 8 Operational Land Imager (OLI) from the United States Geological Survey (USGS) (https:/earthexplorer.usgs.gov/), Phased Array type L-band Synthetic Aperture Radar Digital Elevation Model (PALSAR DEM) of 12.5 × 12.5 m resolution from the Alaska satellite facility, and soil from Mazandaran Agricultural and Natural Resources Research Centre were obtained for the preparation of the thematic FCF layers. PALSAR DEM is considered the most accurate DEM available free of charge [68]. The level of vertical accuracy is most important for the mapping the spatial flood susceptibility. Just like Gesch et al. [69], it was done to calculate the uncertainty in the DEM comparison between Advanced Land Observing Satellite (ALOS) DEM elevations and those of the ground control points (GCPs) [69]. In this analysis, ALOS DEM was used with a vertical precision of 0.3 m.

For geomorphological research and modeling natural disasters, a reliable topographical data source is essential [70]. The Interferometric synthetic aperture radar (InSAR) system and the ALOS–PALSAR data (slave image: 12 August 2009, and master image: 18 September 2012) were used in our research to construct a high precision DEM. The most critical phases in the generation of InSAR DEM include phase estimation and the transformation from phase to height [71]. In Figure 4, the steps for generating a DEM using InSAR are presented. PALSAR sensor images for the study area were downloaded (www.nasa.gov) to provide a DEM of the region. For the entire study area, the PALSAR DEM was generated using the Mosaicing method incorporated in ENVI5.1 software. To boost the accuracy and reduce the errors of the DEM derived from PALSAR, filters were used. The right filters were chosen using experiments, defects, and recommendations from other researchers [72]. A basic filter (smooth 2 × 2) was used to eliminate errors and enhance detection precision and the hydrological calculation precision was improved with the Fill Sinks filter.

Slope: flooding is specifically related to the gradient of slopes [73,74,75]. The surface slope of the area (Figure 5a) is an important factor influencing the frequency of flooding, since there is very little opportunity for rainwater to penetrate steep hills and much more opportunity for rainwater to penetrate plain and gently sloping land [30]. The slope determines not only the velocity of the flow, but also the intensity and volume of run-off. Steeper slopes are more vulnerable to flash floods, but gentle slopes are more likely to experience longer duration flooding. Ultimately, the combination of steep slopes in the upper part of the watershed and gentle slopes in the lower portions magnifies the local flood hazard. The stagnation of water for longer periods mostly affects the flat regions and the most devastating floods also occur in the lower sloping areas. Therefore, the slope of the study area significantly affects the intensity and duration of any flood event.

Elevation: among the different FCFs (Figure 5b), there is a reciprocal relationship between flooding and elevation [30]. The areas at a low altitude are more susceptible to floods than the higher altitude regions [76]. In reality, flooding in high elevation areas is almost impossible [77]. The height of the basin varies from 64 to 3686 m, suggesting the existence of low-lying flat plains as well as hilly terrain and mountains, while the bulk of the sample region is flat plains (Figure 5b).

Plan curvature: the curvature of the plan is a morphological element regulating the vulnerability of flooding by the controlling of water flow on a region’s surface [44]. The plan curvature in the region analyzed ranges from −25.79 to 31.61 (Figure 5c). The area’s plan curvature is categorized into three concave, flat, and convex types, whereby concave areas preserve water and convex areas induce runoff. Interestingly, both concave and convex areas make up similar proportions of the study basin (45% each), which indicates that the basin is susceptible to both high run-off and flood retention.

Topographic wetness index (TWI): introduced by Beven and Kirkby [78], the TWI defines the spatial wetness status in the basin, which influences the susceptibility to flooding [28]. The TWI represents the effect of topography at every point in the catchment on the output of runoff and the volume of flow accumulation [79]. The TWI is calculated from the following Equation (1):

(1) $TWI = In (A_{S} / \tan β)$

where, in respect to per unit contour length (m²/m),

A_{S}

represents the upslope area and β is the slope gradient (degrees). The TWI in this basin ranges from 0.55 and 24.09 (Figure 5d).

Topographic positioning index (TPI): TPI is defined as the altitudinal difference within a specified radius (R) between the central point (Z₀) and the average elevation ( $\bar{Z}$ ) around it [80,81]. Utilizing Equations (2) and (3), the TPI is measured.

(2) ${TPI = Z}_{0} - \bar{Z}$

(3) $\bar{Z} = \frac{1}{n_{R}} \sum_{i \in R} Z_{i}$

The TPI value relies on the unit (R) of the landscape. A small R reveals small valleys and ridges [82]. TPI ranges from −142.59 to 152.29 within the area of study (Figure 5e).

Convergence index (CI): the CI is a major terrain element showing the relief structure of the study region as a series of channels (convergent areas) and ridges (divergent areas). It was posed by Kiss [83]. The CI is calculated using the following Equation (4):

(4) $C I = (\frac{1}{8} \sum_{i = 1}^{8} θ_{i}) - 90^{0}$

where: the average angle is shown by θ_i between the direction of the neighboring cells and the central cell location. The value of CI varies between −100 and +100 in the study region (Figure 5f).

Stream power index (SPI): the SPI demonstrates the erosive capacity of surface running water [84], which has been demonstrated to affect flood vulnerability [85]. DEM was used to derive the SPI using the following Equation (5):

(5) $SPI = A s \times \tan β$

where: the upstream region is As, and the gradient is β. The spatial distribution of the SPI ranges from 6.27 to 26.35 (Figure 5g) in the basin.

Distance to stream: a method for extracting stream buffer in ArcGIS was applied is the Euclidean distance buffer (Figure 5h). Areas along the main river are more susceptible to floods, but less susceptible to remote areas [86].

Drainage density (DD): drainage density (Figure 5i) is able to exert good influence over the frequency of floods and is calculated from the DEM using the ArcGIS line density tool [87]. This research has identified four DD groups (e.g., <0.33, 0.33–0.51, 0.51–0.7 and >0.7).

Rainfall: for the purposes of plotting average rainfall thirty years (1986 through 2016) rainfall data were obtained for 44 meteorological stations (Figure 1). The Wahba first suggested thin plate spline process is an effective and efficient system of interpolating. In order to prepare the raster layer of rainfall data collected from different weather stations, a modified orthogonal least-square plate spline (TPS-M) was used. This technique is more reliable than the traditional method of interpolation. The works of Boer et al. [88] offered detailed explanations for the technique. Rainfall plays a central role in flood occurrence within a region, as it is the source of water in the channels as well as in the ground [89]. We defined five rainfall categories for the purpose of this study, namely <419.71 mm, 419.71–547.87 mm, 547.87–682.61 mm, 682.61–820.63 mm, and >820.636 mm.

Lithology and soil type: lithological and soil maps are of considerable significance when assessing flood-prone areas as the characteristics of the soil, such as density, degree of permeability, composition, and form of substrate directly influence the drainage cycle. The lithology and soil of a region regulate flooding by regulating erosion and infiltration in a watershed [85]. The lithological and soil maps (Figure 5k and 5l for the study area were drawn up on the basis of an available geological map (scale: 100,000) and a soil map (1:250,000) from a Research Centre for Agricultural and Natural Resources in Mazandaran. Nine lithological units (Supplementary Table S1) and seven soil classes, (Entisols, Inceptisols, Salt Flats, Alfisols, Aridisols, Inceptisols, and Mollisols) were identified.

LU/LC and NDVI: Landsat 8 OLI images were used for producing the LU/LC and NDVI maps (Figure 5m,n) in ArcGIS. Before preparing the LU/LC and NDVI, the satellite images are pre-processed through atmospheric and radiometric correction. Classes of LU/LC with bare land and concrete surfaces increase land flow, while dense forest and vegetation limit overland flows [26]. Moreover, 632 ground control points (GPs) were identified to test the classification accuracy of the LU/LC map. Out of the 632 GCPs 190, 239, 6, 10, 4, 56, 116, 8, and 3 GCPs were taken for forest cover, agriculture class, residential class, orchard class, bare land class, dry farming, rangeland, wood land, and water/wetland classes, respectively. Kappa coefficient was calculated using the Equation (6) [90]. The average precision of the LU/LC map was reasonable (Kappa coefficient = 0.972). Nine LU/LC categories were created for the study basin. From the Landsat 8 OLI, NDVI was calculated. For calculating the NDVI, band 4 and band 3 were used. The value above zero (towards +1) indicates the vegetation and the value below zero (towards −1) indicates the water body of the region. Three NDVI categories were defined, (i.e., <0.201, 0.201–0.369 and >0.369) to represent vegetation density. An NDVI value of >0 indicates the presence of vegetation and with increasing values, the probability of it being a forest increases.

(6) $K = \frac{{N \sum_{i = 1}^{r} (X_{i i}) - N \sum_{i = 1}^{r} (X_{i +} \cdot X_{+ i})}}{N^{2}} - \sum_{i = 1}^{r} (X_{i +} \cdot X_{+ i})$

2.4. Multicollinearity Test of Effective Factors

Pradhan [91] reported a standard measure for detecting collinear variables among the factors used in probability modeling. Multicollinearity is a condition in mathematics where one predictor variable can be predicted linearly, based on the others, with a large degree of accuracy in a multiple regression model. It can identify the independent variables that can be used further in the model with the help of variance inflation factors (VIF) [92,93,94]. A VIF value greater than 5 suggests a concern with multicollinearity [95,96]. If the test identifies any collinear variable, then it should not be used for predictive modeling. In this study, fourteen independent variables were used for flood susceptibility analysis based on previous flood-related research, meaning that assessment of multicollinearity was important. The VIF was determined using the following Equation (7) among certain variables:

(7) $V I F = \frac{1}{T o l e r a n c e}$

2.5. Analysis of the Relationship between FCFs and Flood Occurrences Using the Frequency Ratio (FR) Model

The FR was used to determine the relation between FCFs and the frequency of floods. For a given class of variables, the FR can be defined as the ratio of the likelihood of flood to non-flood, which can be calculated in terms of the percentage share of flood pixels divided by the domain type pixel [97]. The FR values represent the importance of the class variables in terms of flood probability:

(8) $FR = \frac{Li / Ci}{L / C}$

where: Li is flood cells in the ith category, Ci is the total cells in the ith category, L is total flood cells, and C is the total cells. An FR value >1 shows the greater flood cell concentration in a given segment as compared to the total data layer. On the other side, if a certain category’s value is below 1, this suggests a small concentration of flood cells in the data layer.

2.6. Flood Susceptibility Spatial Modeling using Machine Learning Ensemble Methods

2.6.1. J48 Decision Tree

J48 is a decision tree algorithm that can be used to determine the attribute vector’s behavior for any given number of instances [98]. The algorithm creates the rules for target variable prediction, and with the help of tree classification and the data distribution, can also be simply conceptualized [95]. The add-on features in J48 include its ability to account for missing values, tree pruning, and setting rules. It is also able to generate data mining results with higher accuracy compared to other tree-based algorithms. For the generation of decision trees, this J48 ensemble model uses the C4.5 algorithm, which is very efficient in statistical classification [99]. It is an attribute selection method performed using input information and entropy equations from any types of data with class levels [100].

2.6.2. Real AdaBoost

Freund and Schapire [101] proposed this new ML ensemble model to make multiple weak classifiers into one strong classifier through the process of boosting. Training data subsets can easily be obtained from such ensemble models. The AdaBoost algorithm is based on the following mechanisms: firstly, a subset of the training samples is developed, and then the initial classification configuration is generated according to the same weighted factors. Secondly, all training dataset events are simulated in this fundamental model. The erroneous cases now weight more, while the correctly graded instances weigh the same. Thirdly, the weights are standardized for all training instances and a new subset is randomly generated to develop the next model classifier-based model. That is working continuously until it stops. Ultimately, we end up with a good classifier model owing to the weighted total of all the classifier-based models previously developed.

The Real AdaBoost is a modified and advanced form of the AdaBoost that can be used to obtain a strong classifier using fewer decision trees [58]. Friedman, Hastie, and Tibshirani [101] first introduced this ensemble method from a formal statistical perspective. Their work has shown that the Real AdaBoost removed the necessity of having a coefficient as the optimal coefficient—α_j = 1 [58]. It combines the AdaBoost with a real-valued classifier or half of the weighted log odds predicted by the classifier; therefore, it gives a more accurate result compared to the AdaBoost model. In the final stage, the Real AdaBoost combines all the decision tree values into one value; hence, the model can be represented as a scorecard [101].

2.6.3. Random Subspace

The problem of over-fitting the training data for classification in decision trees with the highest accuracy can be overcome using Random Subspace [56]. This can be used to create a tree-based decision classifier that can preserve the maximum precision of training data coupled with increased generalization performance with increasing difficulty. Here, this ensemble method is quite similar to Bagging and trees are constructed in randomly selected subspaces [102,103,104,105]. Therefore, from the original training datasets, the subspaces were randomly chosen and through a combination of voting methods, the final result was generated from the subset of each classifier training dataset [106,107]. The key benefit of the approach of learning Random Subspace is that it uses its set of samples of varying spatial characteristics to achieve variations in sub-classification [59,60,61,62].

2.6.4. MultiBoosting

MultiBoost is an enhancement of the AdaBoost concept that uses the power of both AdaBoosting and Wagging to solve the issue of over-fitting [63,64]. MultiBoosting can generate a decision tree with a lower error than both AdaBoost and Wagging. It is also able to perform parallel executions, which makes it more advanced compared to AdaBoost. MultiBoosting uses a random bootstrap sampling approach to establish training subsets and allocate random weights to each subset [64]. Therefore, the step-by-step process of MultiBoosting is as follows. Firstly, using the training dataset, a set of training subsets is formed through random selection and replacement and then it is used to build the classifier-based model. Secondly, the weights are determined based on the overall accuracy of the classifier-based model. Thirdly, the new subsets are developed on the basis of continuous sampling and instance weightings to train the new classifier-based models from which the result is generated in the form of classifiers.

2.7. Model Validation Techniques

Validation of the results generated from flood hazard susceptibility maps (J48, MultiBoosting J48 (MJ48), real AdaBoost J48 (RJ48), and Random Subspace J48 (RSJ48)) is an essential part of the work required to obtain a meaningful conclusion. To quantitatively validate each model, the receiver operating characteristic (ROC) curve was applied [30,108]. For various types of predictive models, ROC is a widely known validation method [109]. The ROC curve defines the reliability of the model by the AUC (area under curve). The AUC values range between 0.5 and 1 where 0.5 reveals that the model cannot estimate the susceptibility to flooding, and 1 implies a good forecast [110]. Of all the flood inventory points, flood susceptibility index values were binarized considering the threshold flood probability index values, which is “the optimal point on the ROC curve closest to the point (0, 1)” using the ‘Optimal Cutpoints’ package in R [111]. For both training and testing data sets, the ROC curve was used to test the susceptibility maps for flood hazards. The ROC curves of training datasets are called success rate curves for existing flood hazards, whereas the ROC curves of validation datasets are termed prediction rate curves for future flood hazard probabilities. The ROC was equipped to equate the true positive (TP) values with the false positives (FP) at different thresholds. The true positive rate (TPR) is seen as precision and the false positive rate (FPR) as the specificity, drop, or probability of an inaccurate calculation. A higher TPR is more reliable and a higher FPR is the representation of model prediction errors. The equations for TPR and FPR are as follows:

(9) $TPR = \frac{TP}{TP + FN}$

(10) $FPR = \frac{FP}{FP + TN}$

(11) $A U C = (\sum T P + \sum T N) / (P + N)$

where: TN, FP, TP, and FN are the true negative, false positive, true positive, and false negative respectively [112]. The scale of TPR ranges from 0 to 1, whereby the categorical subdivisions are as follows, Excellent (0.9–1), very good (0.8–0.9), good (0.7–0.8), moderate (0.6–0.7), and weak (0.5–0.6) [107]. RMSE (Equation (10)) was used to determine the difference between actual and expected flood results. Low values of RMSE indicate better models [106,113].

(12) $R M S E = \sqrt{\frac{1}{n}} \sum_{i = 1}^{n} {(X_{p r e d i c t e d} - X_{a c t u a l})}^{2}$

In addition, two threshold-dependent evaluation criteria, i.e., Efficiency (E) and True Skill Statistics (TSS) [114,115] were implemented to test model robustness. The value of efficiency and TSS ranges from 0 to 1. These two evaluation matrices were calculated based on the following equations:

(13) $E = \frac{TP + TN}{TP + TN + FP + FN}$

(14) $TSS = TPR - FPR$

2.8. Sensitivity Analysis (SA)

The uncertainty in the preparation of the data layers is very difficult to eradicate completely [116]. The sensitivity analysis was used in numerous experiments [117,118,119] to calculate the impact of differences on model outputs, thereby allowing for a systematic estimation of the relative significance of sources of uncertainty. Chen et al. [120] introduced map removal sensitivity analysis (MRSA) method was used in this research. The MRSA approach measures the sensitivity of susceptibility maps to flood by eliminating one or more parameters. Various studies have used this technique to examine the relative importance of the effective factors [121,122,123]. The percentage contribution (PC) of each FCF was calculated using the MRSA process, using the following Equation (14) to examine the relative importance of the model output [124].

(15) $PC = \frac{({AUC}_{all} {- AUC}_{i})}{{AUC}_{i}} \times 100$

where: AUC_all and AUC_i designate the AUC values obtained from flood susceptibility models using all FCFs and the model when the ith FCF has been excluded.

3. Results

3.1. Considering Multicollinearity of Effective Factors

Based on the VIF and tolerance values, it can be inferred that no variables have problem of multicollinearity as the VIF is less than 10, and the tolerance is less than 0.2 (Table 1). Therefore, the variables are independent and can be used for the spatial flood susceptibility modeling.

3.2. Spatial Relationship between Flood Probability and FCFs

The second part of the work investigated the relation between independent variables and flood sensitivity using the FR bivariate statistical method outlined in Table 2. The FR values represent the importance of the class variables in terms of flood probability. The highest FR value (2.43) for the elevation variable is in the <287 m class, which means the lowest elevation range has the highest flood susceptibility, whereas the flood susceptibility is 0 for the elevation class of >784 m, which indicates no flood probability above this elevation. For the slope factor, the highest flood probability is in the slope range of <5.8° as indicated by FR = 2.12, which is the highest for this variable, while it is 0 for slopes between 14.2° and 32.5° (Table 2). The values of FR for the factor plan curvature exhibits a kind of uniform response, but is still highest (1.15) for flat plan curvature and lowest (0.92) in convex areas, while concave areas have a significant FR value (1.05). In terms of the CI, the maximum FR value (1.94) was found for the <−52.9 CI class and the lowest (0.69) for the −16.07 to 14.5 CI class. It can, therefore, be argued that the <−52.9 CI class plays a significant role in controlling susceptibility to floods. From the FR values for SPI, we can see that the lowest SPI level of <8.87 has the maximum FR value (1.88), and as the strength of the stream reduces, the probability of a flood rises. The FR value for SPI is lowest (0.26) in the 12.8 to 15.7 range class, which indicates a lower flood probability in this SPI category. For TPI, the utmost FR value (1.45) was estimated for the −3.71 to 2.34 class range and it was zero for the >9.62 range, which means that positive TPI values are associated with lower or even zero flood susceptibility. However, in the negative range, the flood probability is quite high, which implies that because the central part of the study landscape is low-lying, it is more prone to flooding and vice-versa. For the TWI, the opposite is true in that increasing TWI values indicate a higher flood probability and vice-versa. Here, the highest FR value (1.58) of the TWI classes is in the 7.49 to 11.08 category, whilst the lowest (0.23) for the <5.07 class, which indicates that topographic wetness plays a significant role in flood occurrence. A wetter surface will hinder the infiltration of water and increase the amount of overland flow, thus, encouraging flooding. The FR of the drainage density (DD) classes was highest in the greater DD classes; i.e., 1.48 FR for the >0.7 km/km^{2 class}, and lowest in the smallest DD class. For the distance to stream variable, the FR value is highest (3.56) in the less than 100 m class and lowest (0.31) in the >400 m class. Therefore, it may be inferred that proximity to the river matters the most in terms of flood susceptibility. Quite interestingly, for rainfall, the utmost FR value was estimated for the lower ranges, i.e., an FR of 2.16 in the 419.7 to 547.8 mm class, while the FR was 0 in the highest rainfall range of >820.6 mm. Here, the highest rainfall zones are mainly located in the higher elevation areas, which reduce the chance of having flood pixels located there. Thus, in this region lowest rainfall zones are more flood-prone and due to altitude factor, highest rainfall regions are least flood susceptible. The high rainfall in the upper part of the region is actually causing flash flood in the downstream area due to slope induced high velocity run-off. The wetland and residential areas in the LU/LC divisions had higher FR values of 22.86 and 11.73, respectively, while the equivalent value was calculated as 0 for orchards, flat land, and forest. In terms of the lithology, only the ‘Qsw, Qft2, Qm, Qft1, Qs, d, Qal’ category has an FR value because this lithological segment is a low-lying area. A summary of the lithology of the basin is provided in Supplementary Information. The lowest NDVI class (<0.201) has the highest FR value (1.49) and the highest class (>0.369) has the lowest FR value (0.04), which indicates that the proneness of flood is less in the more forested and vegetated areas. In terms of the soil types, the Aridisols have the highest flood probability (FR = 2.18) whereas Inceptisols, Salt Flat, and Alfisols do not affect flooding. Therefore, from this analysis, we can identify the relative importance of different factors and their classes on flood susceptibility in the study area.

3.3. Flood Susceptibility Models (FSMs)

Our flood hazard susceptibility models have been developed using ML ensemble methods. These are capable of forecasting the susceptibility of flooding with the aid of various categorical factors. Flood hazard susceptibility maps (FSMs), based on Jenks’ natural break mechanisms, were divided into five subclasses: very high, high, moderate, low, and very low. The method of Jenks optimization, also called natural break classification, is a process of data segmentation to estimate the optimal value structure of the different classes. This method of classification aims to reduce the average disparity between classes while increasing the disparity from the rest of the class. The method thus eliminates intraclass and maximizes group disparities. A higher index rating represents areas that are particularly vulnerable to flooding, and vice versa. According to the J48 model, 46.63% of the basin area comes under the low flood susceptibility zone, whereas 18.67% and 29.13% of the area are very high and high flood-prone areas, respectively. The FSM generated using MJ48 suggested that 62.81% of the area falls into the very low flood susceptibility zone, whereas 24.6% of the basin is very highly prone to flooding (Table 3). The results generated in the RJ48 FSM suggested that 76.38% of the study basin has a very low flood occurrence probability, whereas 16.23% has a very high chance of flooding. The other categories have a significantly lower percentage of shares in terms of flood probability. Lastly, the RSJ48 ensemble method gives a less skewed result, predicting that 48.72% of the study area is a very low and 9.21% is a very high flood susceptibility zone, respectively (Figure 6). According to this model, 20.98% (Table 3) of the basin area is a moderately flood-prone region, which is the highest of all the models tested.

As per the output of the models, it is clear that the western and northwestern part of the study basin is a highly flood-susceptible region along with a few flanks of the southeastern and northeastern parts. The lowest elevations, slope, CI, SPI, and NDVI are located in these western and south-western parts of the study basin, which is the reason for the high to very high flood susceptibility along these zones. The rivers route water from the uphill areas and converge in this lower elevation region, which creates floods during the high rainfall periods. The natural setting of the region does not allow rapid transmission of flood water, resulting in numerous flood events. Accordingly, combination of geographic, geologic, and meteorological conditions are the main reasons for a higher propensity for flooding along this western part of the study basin.

3.4. Validation of Machine Learning Ensemble Models

Often a single validation approach is inadequate because of the usage of limited samples to validate model results. In this analysis, the ML models performance was tested using the ROC curve, efficiency, and TSS methods. For the success rate curves of the training sample (70%), the RSJ48 model was more well fitted relative to the others, as the AUC value was the maximum (0.931) followed by MJ48 (0.901), RJ48 (0.889), and J48 (0.850), respectively (Figure 7a). The highest efficiency value was achieved by the RSJ48 model followed by MJ48, RJ48, and J48, respectively (Table 4). Similarly, for the training data the TSS and sensitivity values are highest for the RSJ48. The RMSE values of J48, MJ48, RJ48, and RSJ48 were 0.33, 0.35, 0.39, and 0.30, respectively.

A robust conclusion regarding the accuracy of predictions cannot be reached based on only training datasets. Prediction rate curves denote the predictive capacity of the models for future flood hazard susceptible zones in the study basin. The findings of the prediction rate curve suggest that the RSJ48 model is the most effective as the AUC was maximum (0.951) followed up by MJ48 (0.929). The efficiency values were 0.84, 0.84, 0.85, and 0.86, respectively of the J48, MJ48, RJ48, and RSJ48 models for the validation datasets, whereas TSS values for the J48, MJ48, RJ48, and RSJ48 models were 0.68, 0.70, 0.71, and 0.71, respectively. The sensitivity values for J48, MJ48, RJ48, and RSJ48 were 0.83, 0.80, 0.83, and 0.84, respectively (Table 4). The corresponding respective RMSE values were 0.35, 0.40, 0.34, and 0.33, respectively (Table 4).

Figure 8 shows the values of FR for the selected models. Sub-class FR values were measured by dividing the amount of flood locations contained in the FS class and the total pixels of a particular FS class and the total number of flood location pixels by the FSM’s total pixel. Here, a higher value of FR for the very high FS class indicates the accuracy of the models and vice-versa. In our study, RSJ48 returned the highest FR value for the very high FS class and the lowest value for the very low FS class. However, other models also returned high FR values for the very high FS class.

On the basis of the above, it may be said that the ML ensemble models show uniform results for both success rate curves and prediction rate curves and that these are reasonably accurate. Therefore, these models can demarcate the flood hazard susceptibility zones in the study basin with higher precision compared to other models used by previous research. The RSJ48 model is better able to accurately map the FS within the study area, according to the AUC of the ROC curve, the efficiency and TSS methods, than the other ensemble methods we evaluated.

3.5. Sensitivity Analysis

A sensitivity research was conducted to determine the FCF’s impact on the estimation of flood susceptibility. The results were summarized as percentage contribution (PC; Table 5 and Figure 9). Flood susceptibility maps produced by the ensemble ML method had the greatest sensitivity to elevation, followed by distance to the stream, NDVI, slope, LU/LC, rainfall, TWI, SPI, drainage density TPI, lithology, convergence index and soil type (Figure 9). Study of sensitivity may help in understanding the significant geo-environmental factor, which are very relevant for understanding the model structure in question. According to the sensitivity evaluation, in the case of RSJ48, every factor had a greater percentage contribution compared to the other ML models. Therefore, based on the analysis, it can be said that RSJ48 is better than the other models we tested.

4. Discussion

4.1. Model Performance and Comparison

Identification and zonation of flood-prone areas in a watershed are essential for reducing the damage caused by flooding. Until now, several methodologies have been developed by the researchers around the world for FSM [35,108,125], and each of them has drawbacks and advantages. Remote Sensing (RS) and Geographic Information System (GIS) are quite powerful and effective in analyzing multidimensional incidents like flooding that are affected by many controlling factors [35,126,127]. The widely used conventional machine learning models like artificial neural network (ANN), SVM, and Random Forest (RF) are well capable of predicting flood but ensemble of machine learning algorithms with any statistical or other machine learning techniques provide better result than the single method [32,33,37]. The decision tree based methods like ANN, SVM, and RF gives flexibility in nonlinear applications with large database and these are very much effective in flood modeling but whenever any ensemble classifiers are applied with them, the problem of over-fitting as well as prediction error gets reduced significantly. In this study, we developed flood susceptibility models using ML methods (hybrid J48 approaches) with one of three ensemble techniques (Real AdaBoost, Random Subspace, and MultiBoost). Ensemble approaches combine the art of data mining for classification while the J48 is a tree classifier considered to have good performance in forecasting flood susceptibility. Hybrid ML ensemble frameworks have been shown to deliver preferential outcomes for spatial flood estimation [128,129]. The imperative to implement such models and to develop an appropriate framework for risk management of flood hazards derives from the fact that floods are responsible for significant economic loss and destruction of human infrastructure and natural environments [30,31]. The findings of our study revealed that the accuracy of the J48 model (SRC = 0.850, PRC = 0.871, TSS = 72, E = 0.86) was enhanced after the ensemble with all three meta-classifiers. Our findings are rational as the ensemble classifiers reduce overfitting problems in FS modeling to improve the efficiency of base classifiers [128,130]. The used ensemble models in the study were shown better accuracy than the single model used in the flood studies (8, 10, 24, 31). The results of our analysis show that the RSJ48 model (AUC = 0.931, E = 0.89, TSS = 0.78, sensitivity = 0.87, and RMSE = 0.3 for testing dataset and AUC = 0.951, E = 0.86, TSS = 0.71, sensitivity = 0.84, and RMSE = 0.33 for training dataset) has the maximum predictive power, followed by the MJ48 model and the RJ48 model. The explanation for that the Random Subspace ensemble is more effective in minimizing uncertainty and discrimination compared to other ensemble approaches [131,132] and takes samples at different spatial layers from randomly chosen subspaces [58,101]. All of the models used in this research are efficient enough for predicting the flood susceptibility. The main advantage of these methods is that these can handle both the data mean continuous and categorical simultaneously. CRAN RWeka, party, regRSM packages of R studio were used for preparing these models. The RMSE of these models are low and, particularly for RSJ48, is very low. Besides, in terms of execution time and computational resources consumption, the RSJ48 model performs better than the other models used here.

4.2. Factor Contribution Analysis

Flood hazard susceptibility delineation depends on the interplay of the FCFs. In this analysis, the Real AdaBoost and boosted regression tree (BRT) were used to calculate the relative merits of the chosen variables. The non-parametric model of ML approaches is BRT, which defines a relationship between dependent and independent variables [133]. This is a key tool for evaluating the relevancy of a particular variable and addressing classification and forecasting problems [134]. Details regarding the BRT are available in the literatures of Breiman et al. [134], Therneau et al. [135] and Friedman et al. [101]. We tested fourteen factors and, based on the results, we can identify the relative importance of the factors for flood susceptibility. According to the Real AdaBoost model, NDVI, elevation, LU/LC, rainfall, and stream distance are the top five most significant flood conditioning factors in descending order, whereas SPI, TPI, and TWI are the least relevant variables (Figure 10). However, using the BRT, elevation was determined to be the most important factor controlling flood hazard susceptibility with the highest importance value of 0.534, preceded by distance to stream (0.093), NDVI (0.080), rainfall (0.048), and LU/LC (0.043). Here, plan curvature (0.003) and SPI (0.007) were the two least important factors for flooding susceptibility.

5. Conclusions

Flooding hazards are very sensitive and devastating issues in the world. Recurring flood events cause a huge loss of human life and damage to properties, which should be mitigated at every possible level. Flash floods are a real threat to the sustainability of our societies. Therefore, the accurate prediction of susceptibility to flood events is necessary to improve management practices in effort to reduce flood damage. This work investigated the application of ML ensemble models for predicting flood hazard susceptibility zones in the study basin, along with the identification of the key controlling variables. The ML models suggested that 9.21% to 18.67% of the study basin has a very high susceptibility to flooding. The meta-classifiers we tested increased the accuracy of the J48 machine learning model. Of the three new ensemble models we tested, RSJ48 has better accuracy. Remote sensing and geographical information system integrating with the ML algorithms are creating great foundations for the management of different natural hazards. However, the present work will provide some guidance to other researchers wanting to apply ML ensemble methods to investigate susceptibility to flooding. More applications of the methods reported herein are, however, needed to assess the full extent of the suitability of our methods beyond the case study.

Supplementary Materials

The following are available online at https://www.mdpi.com/2072-4292/12/20/3423/s1.

Author Contributions

Conceptualization, A.A.; methodology, A.A.; formal analysis, A.A.; W.C.; investigation, A.A.; resources, A.A.; supervision, A.A.; writing—original draft preparation, A.A. S.S., K.M., writing—review and editing, A.A. S.S., K.M., T.B., P.T.T.N., S.S.B.; visualization, A.A.; supervision, A.A., S.S.B.; funding acquisition, T.B. All authors have read and agreed to the published version of the manuscript.

Funding

Open Access was funded by the Austrian Science Fund (FWF) through the Doctoral College GIScience (DK W 1237-N23) at the University of Salzburg.

Acknowledgments

Open Access Funding by the Austrian Science Fund (FWF).

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

Figure 1. Location of the study area in Iran.

Figure 2. Flowchart showing the methodology of the present work.

Figure 3. Flood inventory map (a) flooded area in the city of Aqqala (b).

Figure 4. InSAR data-processing procedure for DEM production.

View Image - Figure 5. Flash flood conditioning factors, (a) slope (b) elevation, (c) plan curvature, (d) topographic wetness index (TWI), (e) topographic position index (TPI), (f) convergence index (CI), (g) stream power index (SPI), (h) distance to stream, (i) drainage density), (j) rainfall, (k) lithology, (l) soil type, (m) land use/land cover (LU/LC), (n) normalized difference vegetation index (NDVI).

Figure 5. Flash flood conditioning factors, (a) slope (b) elevation, (c) plan curvature, (d) topographic wetness index (TWI), (e) topographic position index (TPI), (f) convergence index (CI), (g) stream power index (SPI), (h) distance to stream, (i) drainage density), (j) rainfall, (k) lithology, (l) soil type, (m) land use/land cover (LU/LC), (n) normalized difference vegetation index (NDVI).

View Image - Figure 6. Flood susceptibility mapping, (a) J48 model, (b) MultiBoosting J48 (MJ48) model, (c) real AdaBoost J48 (RJ48) model, (d) Random Subspace J48 (RSJ48) model.

Figure 6. Flood susceptibility mapping, (a) J48 model, (b) MultiBoosting J48 (MJ48) model, (c) real AdaBoost J48 (RJ48) model, (d) Random Subspace J48 (RSJ48) model.

View Image - Figure 7. Receiver operating characteristic (ROC) curves representing the accuracy of the flood susceptibility models (a) training dataset (success rate curve), (b) validation dataset (prediction rate curve).

Figure 7. Receiver operating characteristic (ROC) curves representing the accuracy of the flood susceptibility models (a) training dataset (success rate curve), (b) validation dataset (prediction rate curve).

Figure 8. Trend of the frequency ratio (FR) in different models.

View Image - Figure 9. Sensitivity analysis results from the percentage of contribution (PC) of each flood conditioning factor (FCF) in different models.

Figure 9. Sensitivity analysis results from the percentage of contribution (PC) of each flood conditioning factor (FCF) in different models.

Figure 10. Relative importance of flood conditioning factors using (a) AdaBoost model (b) boosted regression tree (BRT).

Table 1

Multicollinearity test among flood conditioning factors.

Factors	Collinearity Statistics	Factors	Collinearity Statistics
Factors	VIF	Factors	VIF
Land Use/Land Cover (LU/LC)	1.330	Distance to stream	1.405
Soil type	2.070	Slope	1.33
Elevation	4.677	TWI	6.876
NDVI	1.025	Plan curvature	2.116
Lithology	3.864	TPI	3.246
CI	1.218	Drainage density	1.521
Rainfall	1.523	SPI	4.21

Table 2

Spatial relationship between conditioning factors and floods by frequency ratio model.

Factor	Class	Pixels in Domain		Flood Pixels		FR
Factor	Class	No	%	No	%	FR
Elevation (m)	<287	4,346,424	41.05	266	99.63	2.43
	287–784	2,168,214	20.48	1	0.37	0.02
	784–1331	1,878,667	17.74	0	0	0
	1331–1930	1,595,868	15.07	0	0	0
	>1930	599,235	5.66	0	0	0
Slope (°)	<5.8	4,572,709	43.21	245	91.76	2.12
	5.8–14.2	1,907,632	18.03	21	7.87	0.44
	14.2–22.6	1,985,057	18.76	0	0	0
	22.6–32.5	1,468,311	13.88	0	0	0
	>32.5	648,343	6.13	1	0.37	0.06
plan curvature (100/m)	Concave	4,792,528	45.29	127	47.57	1.05
	Flat	966,851	9.14	28	10.49	1.15
	Convex	4,822,674	45.57	112	41.95	0.92
CI (100/m)	<−52.9	566,739	5.42	28	10.49	1.94
	−52.9–−16.07	1,913,693	18.29	63	23.60	1.29
	−16.07–14.5	5,483,516	52.42	97	36.33	0.69
	14.5–50.5	2,040,048	19.50	70	26.22	1.34
	>50.5	456,989	4.37	9	3.37	0.77
SPI	<8.87	2,478,650	23.64	118	44.53	1.88
	8.87–10.91	3,180,547	30.33	88	33.21	1.09
	10.91–12.8	3,156,931	30.10	41	15.47	0.51
	12.8–15.7	1,367,215	13.04	9	3.40	0.26
	>15.7	303,841	2.90	9	3.40	1.17
TPI	<−10.98	428,583	4.05	2	0.75	0.18
	−10.98–−3.71	1,287,289	12.16	7	2.62	0.22
	−3.71–2.34	6,565,378	62.04	240	89.89	1.45
	2.34–9.62	1,761,349	16.64	18	6.74	0.41
	>9.62	539,452	5.10	0	0	0
TWI	<5.07	4,230,756	39.98	25	9.36	0.23
	5.07–7.49	4047,824	38.25	152	56.93	1.49
	7.49–11.08	1,859,417	17.57	74	27.72	1.58
	>11.08	444,055	4.20	16	5.99	1.43
Drainage density (km/km²)	<0.33	2,292,946	21.67	29	10.86	0.50
	0.33–0.51	3,948,804	37.32	91	34.08	0.91
	0.51–0.7	2,702,068	25.53	86	32.21	1.26
	>0.7	1,638,253	15.48	61	22.85	1.48
Dis to stream (m)	<100	1,092,472	10.32	98	36.70	3.56
	100–200	930,658	8.79	64	23.97	2.73
	200–300	947,940	8.96	33	12.36	1.38
	300–400	777428	7.35	18	6.74	0.92
	>400	6,833,573	64.58	54	20.22	0.31
Rainfall (mm)	<419.7	2,055,068	19.44	64	23.97	1.23
	419.7–547.8	2,772,850	26.23	151	56.55	2.16
	547.8–682.6	2,365,873	22.38	26	9.74	0.44
	682.6–820.6	1,890,891	17.88	26	9.74	0.54
	>820.6	1,488,169	14.08	0	0	0
LU/LC	Forest	3,185,820	30.13	1	0.37	0.01
	Agriculture	4,003,024	37.86	200	74.91	1.98
	Residential	94,551	0.89	28	10.49	11.73
	Orchard	171,849	1.63	0	0	0
	Bare land	8996	0.09	0	0	0
	Dry farming	966,237	9.14	5	1.87	0.20
	Rangeland	1,954,197	18.48	4	1.50	0.08
	Wood land	139,029	1.31	0	0	0
	Water/Wetland	50,232	0.48	29	10.86	22.86
Lithology	Cm, Cl	491,682	4.64	0	0	0
	Dp, DCkh	593,573	5.61	0	0	0
	Ekh, E1m	74,810	0.71	0	0	0
	Jsc, Jd, Jl, Jmz, Jch	1,339,797	12.65	0	0	0
	Kat, Ksn, Ksr, Ku, Kad-ab, Kl, K, Ktr	743,545	7.02	0	0	0
	Murm, Murmg	90,537	0.86	0	0	0
	PlQc, Pz, pC-C, Pr, Pz1a.bv, Pd, pCmt2, Plc, P	576,300	5.44	0	0	0
	Qsw, Qft2, Qm, Qft1, Qs, d, Qal	6057261	57.21	267	100	1.75
	TRe, TRe2, TRJs	620,205	5.86	0	0	0
NDVI	< 0.201	6,200,349	58.77	234	87.64	1.49
	0.201–0.369	1,538,520	14.58	30	11.24	0.77
	> 0.369	2,812,213	26.65	3	1.12	0.04
Soil type	Rock Outcrops/Entisols	1,453,170	13.73	1	0.37	0.03
	Rock Outcrops/Inceptisols	229,933	2.17	0	0	0
	Salt Flats	22,331	0.21	0	0	0
	Alfisols	1,792,754	16.94	0	0	0
	Aridisols	910,811	8.61	50	18.73	2.18
	Inceptisols	1,262,793	11.93	0	0	0
	Mollisols	4,908,833	46.39	216	80.90	1.74

Table 3

Percentage of area under susceptibility classes in different machine learning models.

Flood Susceptibility Classes	J48	MJ48	RJ48	RSJ48
Very high	18.67%	24.60%	16.23	9.21%
High	29.13%	4.13%	2.56%	9.54%
Moderate	0.16%	5.45%	1.53%	20.98%
Low	46.63%	3.02%	3.31%	11.55%
Very low	5.39%	62.81%	76.38%	48.72%

Table 4

Values of efficiency (E), true skill statistic (TSS), sensitivity, and root mean square error (RMSE).

Criteria	Validation Dataset				Training Dataset
Criteria	J48	MJ48	RJ48	RSJ48	J48	MJ48	RJ48	RSJ48
TN	100	95	90	101	221	206	217	221
FP	15	9	4	9	40	23	28	31
FN	18	23	28	17	46	61	50	46
TP	103	109	114	109	227	244	239	236
TPR	0.85	0.83	0.80	0.87	0.83	0.80	0.83	0.84
FPR	0.13	0.09	0.04	0.08	0.15	0.10	0.11	0.12
Efficiency	0.86	0.86	0.86	0.89	0.84	0.84	0.85	0.86
TSS	0.72	0.74	0.76	0.78	0.68	0.70	0.71	0.71
Sensitivity	0.85	0.83	0.80	0.87	0.83	0.80	0.83	0.84
RMSE	0.33	0.35	0.39	0.3	0.35	0.4	0.34	0.33
AUC	0.871	0.929	0.893	0.951	0.850	0.889	0.906	0.931

Table 5

Sensitivity analysis results from the percentage of contribution (PC) of each conditioning factor in different models.

ModelsFactors	J48	RJ48	RSJ48	MJ48
Elevation	18.5	16.5	21	16.5
Distance to stream	14.4	13.4	16.9	13.4
NDVI	11.5	9.5	13.25	9.5
Slope	8.5	7.5	10.25	7.5
LU/LC	7.5	5.5	8.25	5.5
Rainfall	5.5	4.75	7.75	4.75
TWI	3.75	3	4.5	3
SPI	3.2	2.45	3.95	2.45
Drainage density	2.7	1.95	4.2	1.95
TPI	2.25	1.5	3.75	1.5
Lithology	1.75	1.25	3.5	1.25
Plan curvature	1.45	0.7	3.25	0.7
Convergence index	0.75	0.5	2.25	0.5
Soil type	0.5	0.25	1.5	0.25

Word count: 9149

Show less

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The uncertainty of flash flood makes them highly difficult to predict through conventional models. The physical hydrologic models of flash flood prediction of any large area is very difficult to compute as it requires lot of data and time. Therefore remote sensing data based models (from statistical to machine learning) have become highly popular due to open data access and lesser prediction times. There is a continuous effort to improve the prediction accuracy of these models through introducing new methods. This study is focused on flash flood modeling through novel hybrid machine learning models, which can improve the prediction accuracy. The hybrid machine learning ensemble approaches that combine the three meta-classifiers (Real AdaBoost, Random Subspace, and MultiBoosting) with J48 (a tree-based algorithm that can be used to evaluate the behavior of the attribute vector for any defined number of instances) were used in the Gorganroud River Basin of Iran to assess flood susceptibility (FS). A total of 426 flood positions as dependent variables and a total of 14 flood conditioning factors (FCFs) as independent variables were used to model the FS. Several threshold-dependent and independent statistical tests were applied to verify the performance and predictive capability of these machine learning models, such as the receiver operating characteristic (ROC) curve of the success rate curve (SRC) and prediction rate curve (PRC), efficiency (E), root-mean square-error (RMSE), and true skill statistics (TSS). The valuation of the FCFs was done using AdaBoost, frequency ratio (FR), and Boosted Regression Tree (BRT) models. In the flooding of the study area, altitude, land use/land cover (LU/LC), distance to stream, normalized differential vegetation index (NDVI), and rainfall played important roles. The Random Subspace J48 (RSJ48) ensemble method with an area under the curve (AUC) of 0.931 (SRC), 0.951 (PRC), E of 0.89, sensitivity of 0.87, and TSS of 0.78, has become the most effective ensemble in predicting the FS. The FR technique also showed good performance and reliability for all models. Map removal sensitivity analysis (MRSA) revealed that the FS maps have the highest sensitivity to elevation. Based on the findings of the validation methods, the FS maps prepared using the machine learning ensemble techniques have high robustness and can be used to advise flood management initiatives in flood-prone areas.

Details

Title

Modeling Spatial Flood using Novel Ensemble Artificial Intelligence Approaches in Northern Iran

Author

Arabameri, Alireza¹

; Saha, Sunil²

; Mukherjee, Kaustuv³; Blaschke, Thomas⁴

; Chen, Wei⁵; Phuong Thao Thi Ngo⁶

; Band, Shahab S⁷

¹ Department of Geomorphology, Tarbiat Modares University, Tehran 14117-13116, Iran; [email protected]
² Department of Geography, University of Gour Banga, Malda 732103, West Bengal; [email protected]
³ Department of Geography, Chandidas Mahavidyalaya, Birbhum 731215, India; [email protected]
⁴ Department of Geoinformatics–Z_GIS, University of Salzburg, 5020 Salzburg, Austria; [email protected]
⁵ College of Geology & Environment, Xi’an University of Science and Technology, Xi’an 710054, China; [email protected]; Key Laboratory of Coal Resources Exploration and Comprehensive Utilization, Ministry of Land and Resources, Xi’an 710021, China; Shaanxi Provincial Key Laboratory of Geological Support for Coal Green Exploitation, Xi’an 710054, China
⁶ Faculty of Information Technology, Hanoi University of Mining and Geology, No. 18 Pho Vien, Duc Thang, Bac Tu Liem, Hanoi 10000, Vietnam; [email protected]
⁷ Future Technology Research Center, College of Future, National Yunlin University of Science and Technology, 123 University Road, Section 3, Douliou, Yunlin 64002, Taiwan; Institute of Research and Development, Duy Tan University, Da Nang 550000, Vietnam

First page

3423

Publication year

2020

Publication date

2020

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs12203423

ProQuest document ID

2550312496

Modeling Spatial Flood using Novel Ensemble Artificial Intelligence Approaches in Northern Iran

Jump to:

Full text

Abstract

Details

Suggested sources