A Comparative Study of Kernel Logistic

Full text

Turn on search term navigation

1. Introduction

Flooding is considered to be one of the most dangerous natural disasters, associated with damage to properties, infrastructure, and people around the world [1,2]. Approximately 90% of human losses occur from flooding in Asia, especially in tropical cyclone regions such as Southeast Asia [3,4]. There are many types of floods including pluvial (surface), fluvial (riverine), and coastal (surge). The main difference between pluvial and fluvial flood is that pluvial flood caused by heavy rainfall creates a flood event independent of an overflowing water body, whereas fluvial flood is caused by excessive rainfall over an extended period of time which is dependent on overflowing water bodies. Floods also occur due to excessive amounts of snow melt and sudden breaking of natural and manmade dams. Pluvial floods can also occur at higher elevation areas that lie above coastal and river floodplains. Flash flooding is characterized by intense, high-velocity torrential rainfall within a short period. Flash floods can occur on the ground surface as well as on the riverbed. Much environmental research has indicated that human activities affect the water cycle, such as deforestation. Forests play a critical role in the fight against natural disasters. However, there is an increasing trend towards deforestation in recent years regarding development [5]. Erratic rainfall due to climate change, in conjunction with deforestation and un-planned city development, has resulted in the occurrence of more flash floods with disastrous consequences, which require greater attention from government and other organizations. Although it is impossible to prevent flash floods, their accurate prediction by appropriate model studies may help in reducing damage [6].

The determination of flash flood susceptibility zones is essential for risk management strategies and is helpful for the decision-makers to manage land-use planning [7,8]. A flood susceptibility map will show areas where floods are likely to occur. Flood susceptibility is defined as a quantitative or qualitative assessment of an area with spatial distribution of flood, where probability of flood occurrence is likely [9]. This is a measure of the probability of future floods likely to occur depending on meteorological conditions [10]. However, there is a limit to the temporal frequency of floods. Flood hazard is a phenomenon that may cause loss of life, injury or other health impacts, property damage, loss of livelihoods and services, social and economic disruption, or environmental damage (http://www.charim.net/methodology/31). It is a combination of extent, depth, and flow velocity [11]. The information needed depends on the hazard interpretation (evacuation, building damage, early warning etc.). It depends on the intensity of the phenomenon within specified time and area [11]. However, flood risk is a measure of the damage anticipated to occur in an area [12]. Risk is often expressed as a combination of exposure, vulnerability, and flood hazard [13,14]. A hazard map is not a risk map. The risk is dependent on the hazard and potential damage [12]. A risk analysis includes the impact of one or more hazards, taking into account the vulnerability and resilience of the elements at risk [15]. In general, a flash flood susceptibility map is a critical tool for flood risk management [16]. However, it is difficult to accurately predict specific areas which would be affected most, because of the nature and dynamics of meteorological (climatic) conditions [16].

In recent years, different statistical methods have been developed and applied effectively in flood susceptibility mapping. Presently, Machine Learning (ML) or Artificial Intelligence (AI) methods, which are advanced soft computing approaches for natural hazard prediction and assessment, are mostly used for the flood study [17]. These methods are based on effective and objective mathematical algorithms for analysis and prediction [18,19,20,21]. Some popular ML methods used for flood susceptibility assessment are Artificial Neural Networks (ANN) [22,23], Logistic Model Trees (LMT) [24], Support Vector Machines (SVM), Logistic Regression (LR) [25,26], Adaptive Neuro-Fuzzy Inference Systems (ANFIS) [27], and Neural-Fuzzy (NF) approach [28,29]. So far, there is no existing model that can be applied in all regions for flood susceptibility assessment and mapping accurately [30]. There is a need for ongoing research to explore the possibility of the selection of appropriate models for accurate identification and mapping of flash flood-susceptible areas. With this objective, we have experimented with the four ML models, namely Kernel Logistic Regression (KLR), Radial Basis Function Classifier (RBFC), Multinomial NB (MNB), and LMT, which were not applied and compared earlier in flash flood studies. These models were applied in the Nghe An province, which is one of the flash flood-prone areas of Vietnam. All these models use supervised learning algorithms to solve classification problems with high prediction accuracy. Receiver Operating Characteristic (ROC) and various statistical measures were used to validate and compare the performance of the models. Results were compared to select the best method among these four models for flash flood susceptibility mapping. Arc Map 10.2 and Weka 3.7.12 software were used to process data and generating flash flood susceptibility maps.

2. Description of Study Area

Vietnam in general and Nghe An in particular has been affected by different natural hazards such as flood, arsenic pollution [31], radiation hazard [32], erosion [33,34,35], sea level rise [36,37], earthquakes [38,39,40,41,42], volcanos [43,44], and landslides [45]. Nghe An province is in the North Central Coast region of Vietnam (Figure 1). The morphology of the region consists of mountains, midlands, plains, and coastal areas. The topography of the area is very complicated, with very steep slopes, narrow valleys, and deep gorges. In the study area, the highest peak is Pulaileng peak (2711 m) in the Ky Son district, and the lowest area is the plain in Quynh Luu, Dien Chau, and Yen Thanh districts, which is only 0.2 m above the sea level. Mountains and hills account for 83% of the province’s natural land.

In Nghe An province, rainfall is concentrated in the coastal zone and the eastern slopes of the Truong Son mountain range. The rainy season, lasting until December, has most rain between September and November. These maximums are associated with atmospheric disturbances that develop in the inter-tropical convergence zone, and with tropical cyclones. Agricultural area increase and dam filling are some of anthropogenic causes of deforestation [46,47]. Loss of watershed forest makes flood prevention difficult.

Nghe An province has seven river basins with a total length of rivers and streams in the region of 9828 km, giving an average density of 0.7 km/km². The steep upstream slopes are associated with dense hydrological networks that add to the complexity of flash floods in the event of a rain episode of increasing intensity. In this study, a minor part of Nghe An province (Longitudes: 104.7544° N to 105.0364° N and Latitudes: 19.4890° E to 19.6947° E) is selected for flash flood mapping (Figure 1).

3. Data Used

3.1. Flash Flood Inventory

In the modelling, a knowledge of historical flash floods is important [24,48]. Thus, a flash flooding inventory map is essential. Every year, there are 10–15 flash floods in Vietnam due to extreme weather conditions causing heavy rainfall within a short period. A large part of Nghe An’s surface is covered by forests, which play an essential role in the fight against flash floods and landslides. However, in recent years, forested areas have decreased because of agricultural activity and other anthropogenic activities of development. Therefore, flash floods have become increasingly hazardous in this area. Typhoons in this area also cause flash flood. In 2018 in Nghe An flash flood caused severe damage to properties and material: 6 houses collapsed, 5 schools were affected, more than 19,000 hectares of rice and vegetables damaged, and more than 15,000 m of road was affected besides loss of lives.

In this research, an inventory map with 126 flash flood events (locations) obtained from the Department of Natural Resources and Environment, Nghe An province (Vietnam) and verified from aerial photographs, satellite images, and field surveys were used for the construction of a flash flood inventory map (Figure 1).

3.2. Flash Flood Influencing Parameters

For flash flood modelling, it is crucial to select the appropriate influencing factors adapted for flash flood assessment. In our research, the choice of factors is based on the nature of flash flood observation related to different conditions of study area such as physical, hydrologic, climatic conditions, and human activity. A total of 10 factors, including soil, slope, curvature, river density, flow direction, distance from rivers, elevation, aspect, land use, and geology (Figure 2), were selected and used for analysis and modelling. In this research, a digital elevation model (DEM) with a resolution of 20 m were constructed from topographic maps at a scale of 1:50,000. DEM was used to extract the geomorphology factors (slope, aspect, curvature, and elevation) and hydrology factors (river density and distance from the river). This data was verified from the data of the Department of Natural Resources and Environment, Nghe An province (Vietnam).

Slope is an essential factor for studying flash flood susceptibility because it controls the speed of water flow from high to low altitude [49]. In this study, five main classes are used for the slope map (Figure 2a). Aspect is related to the directions of water flow affecting flash flood occurrence [50] and aspect map was built with eight classes: flat, north, northeast, east, southeast, south, southwest, and northwest (Figure 2b). Curvature is a conditioning factor in flash flood modelling that influences accumulation and runoff on the slope. In addition, flash flood zones are linked to convergence of topographic height [51]. Curvature classes used in this research are concave, flat, and convex (Figure 2c). River density is related to surface runoff, which can promote flash flooding. Areas closer to the river are more prone to experience flooding. Density of rivers and distance from rivers are considered the main factors affecting the occurrence of a flash flood [52]. Maps of river density and distance from rivers were constructed with various classes (Figure 2d,f). Flow direction, which is the direction in which water travels, is considered to be a conditioning factor of flash flood. Flow direction of this area was grouped into eight classes: 1, 2, 4, 8, 16, 32, 64, and 128 (Figure 2e). Elevation is a conditioning factor due to the weathering of rocks and soil on the slope [53,54]. An elevation map was constructed with five groups: 77–297.3, 297.3–487.4, 487.4–695.5, 695.5–961.4, and 961.4–1 551.1 m (Figure 2g).

Soil type is considered an essential factor that is strongly related to rainfall runoff mechanisms affecting flash flood occurrence [55]. In this study, soil type was divided in five categories. The soil map was extracted from the MONRE geologic map at a scale of 1:100,000 (Figure 2h). Land use is an essential conditioning factor in flash flood research as it affects surface runoff. Runoff often occurs differently on agricultural and settlement lands. In addition, forests play an important role in reducing runoff speed and reducing the possibility of flash floods. A land use map (1:100,000 scale) of this area was extracted from the Landsat 7 satellite and classified into five types: natural forest land, planted forest land, forest restoration land, agriculture land, and settlement land (Figure 2i). Geology is an essential factor related to the process of runoff and infiltration, thus affecting flash flood occurrence. In this area, a geology map was compiled based on four tiles of the Geoscience and Mineral Resources Map of Vietnam at a scale of 1: 100,000 and constructed with eight classes: eruption rock of Song Ma complex, limestone rock of La Khe formation, eruption rock of Huoi Nhi complex, limestone rock of Muong Long formation, metamorphic and sedimentary rock of Bu Khang formation, eruption rock of Muong Hinh complex, granite rock of Dai Loc complex, and sedimentary and metamorphic of Song Ca formation, quaternary formation (Figure 2j).

4. Methods Used

In this study, selection of ML model depends on the type of data and nature of the problem. In the present study our data is of labeled type. Therefore, we have selected supervised algorithm-based models, namely LMT, KLR, NBM, and RBFC. The reason for the selection of these four ML models is that, as per the literature review, performance and prediction capabilities of these models are good but they were not applied and compared earlier for flash flood studies.

4.1. Logistic Model Tree (LMT)

LMT is a method that integrates two algorithms: C4.5 and LR. In LMT, the gain ratio information of C4.5 is used to split the tree into node and leaves, whereas the LogitBoost algorithm is applied to adapt the LR functions occurring at a tree node [56]. Out of these algorithms, C4.5 is considered to be a standard algorithm for creating classification rules in the form of decision tree. C4.5 is often referred to as a statistical classifier, which is an extension part of ID3. The information gain ratio is the default criteria of choosing to split attributes in C4.5. Instead of using the information gain as ID3, the information gain avoids the bias of selecting attributes with different values. In the LMT model, the overfitting problem is significant. To solve this challenge, the Classification and Regression Trees (CART) algorithm is used for the pruning the tree during training [57]. CART is one of the important machine-learning algorithms presenting information in a way that is intuitive and easy to visualize. CART encloses a nonparametric regression algorithm that “grows” a decision tree based on a technical binary hesitation. In LMT, let c be the sum of flash flood and non-flash flood layers and x = x_i (i = 1 – n) be defined as flash flood conditioning factors (n is the number of the factors used). The probabilities at the leaf nodes are measured using the linear LR model as follows [56]:

(1) $p (c | x) = \frac{\exp (L_{c} (x))}{\sum_{c^{'} = 1}^{c} \exp (L_{c^{'}} (x))}$

where while L_c(x) is the least-squares fit that is changed using following equation:

(2) $\sum_{c^{'} = 1}^{c} L_{c} (x) = 0$

4.2. Kernel Logistic Regression (KLR)

KLR is considered to be one of the best known machine-learning techniques for classification using nonlinear LR and probabilistic current [58]. To learn the parameters, this model estimates the class-posterior probabilities with the kernel’s log-linear function combination by applying the penalized maximum likelihood method [59]. In this model, the kernel function is used to look at a discriminant function with a goal of dealing with the classification problem by transforming the original input space into a high-dimensional feature space. Considering the predisposing factors of the flash flood as the input vector x, and the kernel function is used to complete the nonlinear transformation of x. As a result, the nonlinear form of the LR can be formulated as follows:

(3) $l o g i t {p} = ω . φ (x) + b$

where w and b are the optimal model parameters obtained by minimizing a cost function, which represents the regularized negative-log likelihood of the data [60], and p presents the probability of flash flood that occurs in an area.

4.3. Multinomial Naïve Bayes (NBM)

NBM relies on a probabilistic method with separated training and testing processes [61]. For the training process, suppose t = t_i represents the flash flood and non-flash flood classes and c = c_i (i = 1 – n) is defined as flash flood conditioning factors (n is the number of the factors used). The probability of each event in a class of can be measured using the following formula:

(4) $P (t | c) = \frac{T_{c t}}{\sum_{t^{'} \in V} T_{c t^{'}^{'}}},$

where T_c_t is the sum of times t emerges in the training information of factor c, and

\sum_{t^{'} \in V} T_{c t^{'}^{'}}

is the sum of attributes in factor c. To avoid problems that occur when T_ct is zero or some events are not present in the training data, smoothing of the square is performed by adding one to each equation:

(5) $P (t | c) = \frac{T_{c t} + 1}{\sum_{t^{'} \in V} (T_{c t^{'}} + 1)} = \frac{T_{c t} + 1}{(\sum_{t^{'} \in V} T_{c t^{'}}) + B^{'}}$

For the best class, the maximum a posteriori (MAP) formula is applied to avoid underflow of the test process:

(6) $C m a p = a r g m a x_{c \in ∁} [\log P_{(c)} + \sum_{1 \leq k \leq n r} \log P (t / c)]$

where p(c) is given by

P (C) = \frac{N_{c}}{N}

, Nc is the sum of data in layer c, and N is the sum of information in the dataset.

4.4. Radial Basis Function Classifier (RBFC)

RBFC is a supervised neural network considering an approximation problem in poly-dimensional space which is used to answer questions such as interpolation and recognition [62]. In this learning process, the network is looking for a surface in multidimensional space, which allows for a better comparison of the training dataset. Correspondingly, the test data can be interpolated using the multidimensional surface [62]. The network is composed of three layers: the first is the input layer, the second is the masked layer, and the last is the output layer. Each layer is grouped by the elements that make up the inputs and outputs. The elements of each layer are linked to transmit the information (the elements of each layer are not related).

In the process of transmitting information, a Gaussian function is used as the following radial basis function:

(7) $h_{j} (x) = e x p (- \frac{‖ x - c_{j} ‖^{2}}{r^{2}})$

where

h_{j} (x)

is output data defined as flash flood or non-flash flood classes from

j

. The element in the hidden layer where the activation function is applied to analyze the relationship between input and output variables,

x = (x_{1}, \dots, x_{n})

is the input data vector of flash flood conditioning factors linked to the element in the hidden layer,

c_{j}

is inferred as the centrepoint of the basis function and r is radius of the basis function.

4.5. Validation Methods

Validation methods such as Area Under the ROC Curve (AUC) and various statistical measures were used to validate and compare the models in this study. ROC curve is a popular measure to evaluate the accuracy of the model and can be used to determine the accuracy of natural hazard susceptibility mapping [63,64,65,66,67,68]. Two values are used to build the ROC curve: sensitivity and 100-specificity [69,70,71,72,73,74]. Performance of the models is analyzed quantitatively using the area under the curve (AUC) [75,76,77,78,79,80]. An AUC value of 1 indicates the best classification, while 0.5 corresponds to non-accurate models [81,82,83,84,85]. AUC values are calculated according to the equation:

(8) $A U C = \sum T P + \sum \frac{T N}{P} + N$

where TP and TN are considered the rate of pixels classified correctly as flood and non-flood, P and N are the total number of flash floods and non-flash floods, respectively.

Various statistical measures such as accuracy (ACC), sensitivity (SST), specificity (SPF), root mean squared errors (RMSE), kappa (K) positive predictive value (PPV), and negative predictive value (NPV) were also selected to validate flood flash modelling [86]. PPV and NPV are the values of pixel probabilities classified correctly as “flood” occurrence and “non-flood” occurrence [87]. The proportion of flash flood pixels is represented by SST value and proportion of non-flash flood pixels is represented by SPF. K is used to analyze the accuracy of modelling [88]. K value varies between -1 and 1. Values of K close to 1 represent better reliability [8]. ACC is the ratio of the rate number of correct predictions and the total number of predictions [88]. RMSE represents the difference between data observations and data estimates [89,90,91,92,93,94,95,96,97,98,99,100,101,102,103]. Equations for the different measures are given below:

(9) $S S T = \frac{T P}{T P + F N}$

(10) $S P F = \frac{T N}{T N + F P}$

(11) $P P V = \frac{T P}{F P + T P}$

(12) $N P V = \frac{T N}{F N + T N}$

(13) $K = \frac{P_{p -} P_{e x p}}{1 - P_{e x p}}$

(14) $A C C = \frac{T P + T N}{T P + T N + F P + F N}$

(15) $R M S E = \sqrt{\frac{1}{N}} \sum_{i - 1}^{n} (X_{p r e d i c t e d} - X_{a c t u a l}) ²$

where FP and FN are the rate of pixels classified incorrectly as the flood and non-flood. P_p is the rate of pixels classified correctly for flood or non-flood. Expected agreements is defined by P_exp.

X_{p r e d i c t e d}

and

X_{a c t u a l}

are the predicted and real values in the training samples or the testing samples of the models, and n is the total number of samples in the training samples or testing samples.

5. Modelling Methodology

Methodology used for constructing the flash flood susceptibility map of study area includes five steps (Figure 3): (1) Collection of data: Various thematic maps of factors were constructed using ArcGIS software in raster format with 20 m pixel size. These maps were sampled with flash inventory to generate the sampling data for further processing; (2) Dataset preparation: In this study, the sampling data has been randomly shared by two parts: the training data (70%) used for constructing the models and maps, and the validation data (30%) used for validation of the models and maps; (3) Model configuration and implementation. Four models, namely KLR, RFBC, NBM, and LMT, were constructed using training data. Out of these models, RBFC was constructed with batch size, number of functions, number of threads, ridge, and seed of 100, 2, 1, 0.01, and 1, respectively; NBM was built with batch size of 100; LMT was built with batch size, minimum number of instances, and number of boosting iterations of 100, 15, and 1, respectively; KLR was built with batch size, lambda, number of threads, and seed of 100, 0.01, 1, and 1, respectively; (4) Model validation: In this step, validation of the flash flood susceptibility models was conducted by using PPV, NPV, SST, SPE, ACC, RMSE, K, and AUC values; (5) Development of flash flood susceptibility maps: In this step, flash flood susceptibility was evaluated using flood flash susceptibility indices that were produced from the model construction processes. These indices were then transferred to all the pixels of the flash flood zone in the study space and classified to determine susceptibility levels using natural breaks classification method in ArcGIS application—a popular method for classifying the natural hazard susceptibility classes [104].

6. Results and Analysis

6.1. Models Validation and Comparison

Performance of the models (RBFC, NBM, LMT, and KLR) is shown in Figure 4, Figure 5 and Figure 6 and summarized in Table 1, which is based on both the training and validation datasets. For the training data, the results show that KLR and RBFC have the highest values of PPV (94.32%), KLR has the highest values of NPV (95.45%), SST (95.4%), SPF (94.38%), and ACC (94.89%) compared with those of other models. In the case of the validation data, LMT and NBM achieve the highest values of PPV (94.74%), LMT, KLR, and RBFC have the highest values of NPV (97.37%), LMT has the highest value of SST (97.3%), SPF (94.38%), and ACC (96.05%) (Figure 4). In terms of K value, KLR has the highest value of K (0.8977) with training data whereas LMT has the highest value of K (0.9211) with validation data (Figure 5). Regarding the RMSE value, KLR has the highest value of RMSE (0.215) with training data whereas LMT has the highest value of RMSE (0.184) with validation data (Table 1). Based on these results, it can be stated that performance of KLR is better than other models in the training dataset; however, LMT has the best predictive capability compared to other models in terms of validation dataset.

ROC curve results indicate that RBFC model (AUC = 0.983) outperforms three other models in terms of the training prediction rate (KLR:AUC = 0.982; NBM:AUC = 0.970; and LMT:AUC = 0.970). In terms of validation, LMT is more accurate in comparison to the other models with the AUC of 0.988, followed by KLR with AUC of 0.985, RBFC with AUC of 0.984 and NBM with AUC of 0.983, respectively (Figure 6).

6.2. Flash Flood Susceptibility Map

Flash flood susceptibility maps were constructed using four ML models (KLR, RBFC, NBM, and LMT) with five classes: very low, low, moderate, high, and very high (Figure 7). The distribution of each susceptibility class on the maps obtained with different methods is shown in Figure 8. A map generated by KLR model indicates that 61.84% of the pixels are in the very low class, 6.372% in the moderate class and 13.18 in the very high. In the map constructed by RBFC model, 47.63% of the study area is in the very low level, 11.33% in the moderate level, and 12.94% in the very high level. The map built by NBM model shows 62.59% of the study area as very low level, 6.641% as moderate level, and 11.96% as very high level. Finally, the map constructed by LMT model shows that 40.06% of the area is in the very low level, 6.163% in the moderate level and 9.589% in the very high level (Figure 8). Validation of the maps using frequency ratio, which is a ratio of percentage of flash flood pixels observed on each susceptibility class, and percentage of all pixels of susceptibility class, was also done as shown in Figure 8. Validation results show that most of the flash flood pixels were observed in high and very high levels. However, the frequency ratio of flash flood observed in high and very high classes of the map produced by LMT is higher than those of other maps produced by other models (KLR, RBFC, and NBM). Thus, it can be stated that the map produced by LMT is more reliable than those of other models.

7. Discussion

Determining the areas that are most susceptible to flash floods is considered to be the most critical issue for risk management and land-use planning. Although there are several different methods developed and applied for the flash flood zone prediction around the world, generation of a flash flood susceptibility map using suitable methods for a specific area remains a topic of concern among researchers. In this study, the main purpose is to assess and compare various methods to choose the best for generating an accurate flash flood susceptibility map of the mountain area of the Nghe An province, which is one of the most affected flash flood disaster area in Vietnam. For flash flood modelling, four methods, namely KLR, RBFC, NBM, and LMT, were selected as these are advanced and effective ML models for natural hazard prediction and assessment [105,106,107]. Conditioning factors may change depending on the local geo-environmental conditions of the study area [108]. In general, flash flooding occurs mainly on watersheds, especially in hilly areas, where the topography is favorable to rapid flow (runoff) in the event of heavy rainfall within a short time. Loss of vegetation accentuates the flooding process. Topography and river density affect the occurrence of flash flood [109]. Considering this, ten factors, namely soil, slope, curvature, river density, flow direction, distance from rivers, elevation, aspect, land use, and geology, were used to construct the flood database for modelling.

In the context of spatial planning, selection of suitable models for the generation of accurate flood susceptibility map is desirable to avoid damage to property and human losses [110]. Out of the four models proposed in this paper, KLR is the best compared with other models using training data. However, LMT achieves a higher predictive capability during the validation process. This model is more reliable than the other models for flash flood susceptibility mapping. Performance of LMT is related to its robustness, noise reduction, and variance, as well as the reduction of overfitting. Thus, LMT is better compared to other models because of its reduced overfitting and variance. In addition, KLR uses the fractal dimension for input data, and thus performed well in the training dataset. Results also indicate that NBM has less accuracy compared to the other three models, as it rests on the independent hypothesis of the conditioning factors that could influence its accuracy. Overall, the four flash flooding models have an acceptable performance for assessing flash flood susceptibility but LMT is the best compared with other models.

Even though flash flood prediction ability may decrease when a low proportion of training samples were used, in the present case, models demonstrated robustness. With the complexity of flash floods and the interaction of several factors, a comparison of more modelling methods are required and different sets of characteristics and factors can be determined using various techniques that would make it possible to give different points of view regarding feature selection and improvement of performance of machine-learning models.

8. Conclusions

In this study, four ML models, namely LMT, KLR, RBFC, and NBM, were used to generate flash flood susceptibility maps of Nghe An province in Vietnam. For this purpose, 126 flash flood historic events and ten conditioning factors (soil, slope, curvature, river density, flow direction, distance from rivers, elevation, aspect, land use, and geology) were used for the construction the flash flood database for modelling. Various methods such as area under ROC curve (AUC), and several statistical measures were used for the validation and comparison of the models.

Validation results show that LMT had the best performance (AUC = 0.988), followed by KLM (0.985), RBFC (0.984), and NBM (0.983), respectively. LMT model also achieved the highest PPV (94.74%), NPV (97.37%), SST (97.3%), SPF (94.38%), and ACC (96.05%) in comparison to other models. Therefore, this method can be used for flash flood susceptibility mapping of other areas also. There is always scope for improvement in the performance of methods adopted in this study by using different combinations of ML models considering greater numbers of flash flood events and influencing factors depending on the physical, hydrological, and meteorological conditions of the area.

Author Contributions

Conceptualization, B.T.P., N.A.-A., H.D.N., L.S.H., H.-B.L., I.P., A.A., and D.T.B.; Data curation, L.S.H., H.D.N., T.T.T. and H.P.H.Y.; Formal analysis, T.V.P., H.D.N., C.Q., N.A.-A., L.S.H., T.T.T., H.P.H.Y. and H.-B.L.; Funding acquisition, N.A.-A.,; Methodology, B.T.P., T.V.P., and D.T.B.; Project administration, B.T.P., N.A.-A., and I.P.; Supervision, B.T.P., H.-B.L., I.P. and D.T.B.; Validation, H.P.H.Y., H.-B.L., A.A., and I.P.; Visualization, H.D.N., A.A., T.T.T. and H.P.H.Y.; Writing—original draft, B.T.P., T.V.P., H.D.N., A.A., C.Q., N.A.-A., L.S.H., T.T.T., H.P.H.Y. and H.-B.L.; Writing—review and editing, A.A., B.T.P., N.A.-A., and I.P. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported by the research fund of Vinh University, Vietnam in Nghe An Province, Vietnam.

Acknowledgments

We thank to the Department of Natural Resources and Environment, Nghe An province (Vietnam) for providing us the data used in this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Figures and Table

Figure 1. Location of the study area and flash floods.

View Image - Figure 2. Maps of flash flood conditioning factors: (a) slope, (b) aspect, (c) curvature, (d) river density, (e) flow direction, (f) distance from rivers, (g) elevation, (h) soil, (i) land use, and (j) geology.

Figure 2. Maps of flash flood conditioning factors: (a) slope, (b) aspect, (c) curvature, (d) river density, (e) flow direction, (f) distance from rivers, (g) elevation, (h) soil, (i) land use, and (j) geology.

Figure 3. Methodological flow chart of this study.

Figure 4. Value of statistical measures of the models.

Figure 5. Kappa values of the models.

Figure 6. ROC analysis of the models: (a) training dataset; and (b) testing dataset.

Figure 7. Flood susceptibility maps using various models: (a) KLR, (b) RBFC, (c) NBM, (d) LMT.

View Image - Figure 8. Analysis of the frequency of flash floods on the susceptibility maps (class pixels represents the total number of pixels in whole susceptibility class and flash flood pixels is the total number of flash flood pixels observed in the susceptibility class).

Figure 8. Analysis of the frequency of flash floods on the susceptibility maps (class pixels represents the total number of pixels in whole susceptibility class and flash flood pixels is the total number of flash flood pixels observed in the susceptibility class).

Table 1

Summary of validation results of the models.

Statistical Measures	Models
	Training Dataset				Validation Dataset
	KLR	RBFC	NBM	LMT	KLR	RBFC	NBM	LMT
PPV	94.32	94.32	92.05	93.18	92.11	92.11	94.74	94.74
NPV	95.45	94.32	92.05	93.18	97.37	97.37	92.11	97.37
SST	95.4	94.32	92.05	93.18	97.22	97.22	92.31	97.3
SPF	94.38	94.32	92.05	93.18	92.5	92.5	94.59	94.87
ACC (%)	94.98	94.32	92.05	93.18	94.47	94.74	93.42	96.05
RMSE	0.215	0.222	0.254	0.241	0.205	0.207	0.217	0.241
K	0.8977	0.8864	0.8409	0.8636	0.8947	0.8947	0.8684	0.9211
AUC	0.982	0.983	0.970	0.97	0.985	0.984	0.983	0.988

Word count: 5163

Show less

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Risk of flash floods is currently an important problem in many parts of Vietnam. In this study, we used four machine-learning methods, namely Kernel Logistic Regression (KLR), Radial Basis Function Classifier (RBFC), Multinomial Naïve Bayes (NBM), and Logistic Model Tree (LMT) to generate flash flood susceptibility maps at the minor part of Nghe An province of the Center region (Vietnam) where recurrent flood problems are being experienced. Performance of these four methods was evaluated to select the best method for flash flood susceptibility mapping. In the model studies, ten flash flood conditioning factors, namely soil, slope, curvature, river density, flow direction, distance from rivers, elevation, aspect, land use, and geology, were chosen based on topography and geo-environmental conditions of the site. For the validation of models, the area under Receiver Operating Characteristic (ROC), Area Under Curve (AUC), and various statistical indices were used. The results indicated that performance of all the models is good for generating flash flood susceptibility maps (AUC = 0.983–0.988). However, performance of LMT model is the best among the four methods (LMT: AUC = 0.988; KLR: AUC = 0.985; RBFC: AUC = 0.984; and NBM: AUC = 0.983). The present study would be useful for the construction of accurate flash flood susceptibility maps with the objectives of identifying flood-susceptible areas/zones for proper flash flood risk management.

Details

Title

A Comparative Study of Kernel Logistic Regression, Radial Basis Function Classifier, Multinomial Naïve Bayes, and Logistic Model Tree for Flash Flood Susceptibility Mapping

Author

Binh Thai Pham¹

; Tran Van Phong²

; Nguyen, Huu Duy³; Chongchong Qi⁴; Al-Ansari, Nadhir⁵

; Amini, Ata⁶

; Lanh Si Ho⁷; Tran Thi Tuyen⁸; Hoang Phan Hai Yen⁹; Hai-Bang Ly¹

; Prakash, Indra¹⁰

; Dieu Tien Bui¹¹

¹ University of Transport Technology, Hanoi 100000, Viet Nam; [email protected] (B.T.P.); [email protected] (H.-B.L.)
² Institute of Geological Sciences, Vietnam Academy of Sciences and Technology, 84 Chua Lang Street, Dong da, Hanoi 100000, Viet Nam
³ Faculty of Geography, VNU University of Science, 334 Nguyen Trai, Hanoi 100000, Viet Nam; [email protected]
⁴ School of Resources and Safety Engineering, Central South University, Changsha 410083, China; [email protected]
⁵ Department of Civil, Environmental and Natural Resources Engineering, Lulea University of Technology, 971 87 Lulea, Sweden
⁶ Kurdistan Agricultural and Natural Resources Research and Education Center, AREEO, Sanandaj 66177-15175, Iran; [email protected]
⁷ Institute of Research and Development, Duy Tan University, Da Nang 550000, Vietnam
⁸ Department of Resource and Environment Management, School of Agriculture and Resources, Vinh University, Nghe An 470000, Vietnam; [email protected]
⁹ Department of Geography, School of Social Education, Vinh University, Nghe An 470000, Vietnam
¹⁰ Department of Science & Technology, Bhaskarcharya Institute for Space Applications and Geo-Informatics (BISAG), Government of Gujarat, Gandhinagar 382002, India; [email protected]
¹¹ Geographic Information System group, Department of Business and IT, University of South-Eastern Norway, 3674 Notodden, Norway

First page

239

Publication year

2020

Publication date

2020

Publisher

MDPI AG

e-ISSN

20734441

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/w12010239

ProQuest document ID

2550490565

A Comparative Study of Kernel Logistic Regression, Radial Basis Function Classifier, Multinomial Naïve Bayes, and Logistic Model Tree for Flash Flood Susceptibility Mapping

Jump to:

Full text

Abstract

Details

Suggested sources