Probabilistic Models Significantly Reduce

Full text

Turn on search term navigation

Introduction

Quantifying the future economic risk of pluvial flooding is critical for climate change adaptation of an increasing urban population. Pluvial, or often referred to as surface water flooding, is directly caused by extreme rainstorms with rainfall rates exceeding the capacity of the urban drainage system. Cities around the globe have been impacted by recent pluvial flood events. Large‐scale pluvial flooding in the Houston area in Texas during Hurricane Harvey has led to 68 deaths and estimated total damage in the range of U.S.$90 to 160 billion, making it the second most expensive natural disaster in the history of the United States (Blake & Zelinsky, ). Other examples include flooding after a rainstorm in Copenhagen 2011 causing total economic damage of U.S.$1 billion (Wojcik et al., ) or in Beijing 2012 causing total economic damage of U.S.$1.86 billion and 79 fatalities (Wang et al., ). An increasing pluvial flood risk caused by an expected increase of intensity and frequency of heavy precipitation events (Donat et al., ; Kundzewicz et al., ) combined with an ongoing urbanization with a concentration of population and assets in cities (Skougaard Kaspersen et al., ) motivates the need to assess the current and future risk of pluvial flooding. A review by Rosenzweig et al. () identified the lack of knowledge in the quantification of present and future pluvial flood impacts as one of three key research areas for the development of flood resilient cities. However, pluvial flood risk is mostly excluded or neglected in flood risk analysis, although there is evidence that the high frequency of these events lead to long‐term cumulative losses comparable to less frequent but severe flood events (Veldhuis, ). This lack of knowledge includes risk management and mitigation plans. With few exceptions, official flood hazard maps are exclusively focused on fluvial and coastal flood risk. For the conterminous United States, Wing et al. () found that the poor coverage of urban catchments in flood hazard maps produced by the Federal Emergency Management Agency (FEMA) has lead to an underestimation of the population affected by pluvial and fluvial flooding by a factor of 2.6–3.1. With scarce information on the hazard, only few loss estimation models for pluvial floods have been developed. Existing approaches include adapting water depth‐damage functions (also known as stage‐damage models) from river floods (Freni et al., ; Olsen et al., ; Zhou et al., ), using multiple linear regression models (Van Ootegem et al., ), or by correlating rainfall measurements with insurance claims or survey data (Spekkers et al., ; Van Ootegem et al., ). However, the lack of data, the complex nature of the hazard and impact as well as the lack of a consistent quantification of the associated uncertainties, has so far hampered an extensive estimation of expected pluvial flood losses needed to decide on adaptation strategies in cities. Van Ootegem et al. (, ) construct different multivariate pluvial flood damage models from survey data of a study in Belgium based on water depth‐damage and rainfall‐damage relationships. Key findings of their study include the importance of additional nonhazard variables such as risk awareness and the effect of reported zero loss cases. However, the results do not provide information as to whether additional variables can also improve loss estimates.

In this study, we use probabilistic high‐resolution loss models to estimate pluvial flood losses on different spatial scales. Unlike widely used deterministic stage‐damage functions, these probabilistic univariable and multivariable loss models provide a consistent approach to quantify how certain a loss prediction is by providing predictive distributions instead of point estimates. Application and validation of different high‐resolution probabilistic loss models in Harris County, Texas, reveal significant differences in the dispersion and reliability of property and county level pluvial flood loss predictions for Hurricane Harvey. Only two out of the six tested models reliably predicted the reported loss with a difference of 78% in the 90% prediction intervals between the two models equaling to an absolute difference of U.S.$3.8 billion for pluviual flood building structure loss in Harris County. These results have major implications for cost‐benefit analysis of flood risk management and adaptation decisions in cities.

Background

With the need to adapt cities to an expected increase in pluvial flood risk, decision makers face the challenge to take appropriate decisions under the uncertainty of how the risk of pluvial flooding evolves in the future including the expected losses. As uncertainties in flood losses estimates are usually high, probabilistic loss models could greatly aid a comprehensive pluvial flood risk management (Todini, ). Unlike deterministic estimates, probabilistic predictions provide continuous predictive distributions where the dispersion of the distribution can provide the range an expected loss would fall in with a certain probability (e.g., 90%). The reliability of a probabilistic prediction can be expressed as the ability of the predictive distribution to cover the actual observed loss. Although probabilistic loss models have been developed for river floods (Dottori et al., ; Kreibich et al., ; Schröter et al., ), these models are the exception and deterministic estimates based on empirical or synthetic relationships between the water depth and the absolute or relative building loss are still widely used for loss estimations for all types of flooding (Gerl et al., ; Merz et al., ; Scawthorn et al., ). The resulting loss estimates in these stage‐damage functions are commonly expressed as point estimates for the repair and/or replacement costs in monetary values (i.e., U.S.$) or percentage of the depreciated value of a building. Instead of a direct quantification of uncertainty inherent to probabilistic predictions, uncertainty in stage‐damage functions is often based on expert judgment and/or by calculating a range of possible outcomes using different loss functions (Dittes et al., ). Missing information, and/or a lack of consistency in quantifying how certain a loss estimate is, makes it challenging for decision makers to, for example, evaluate the potential of an adaptation measure to reduce future losses. While the deviations of point estimates for deterministic loss models are often shown to be reasonably small for loss estimates on large spatial scales typical for river or coastal flooding, loss predictions become highly uncertain on smaller scales (i.e., individual buildings; Merz et al., ; Scawthorn et al., ). However, due to the local extent and small‐scale variations, reliable small‐scale loss models are required to quantify pluvial flood risk for a specific location. In this context, we use machine learning as well as different univariable and multivariable probabilistic approaches to investigate three main research objectives: we (i) identify important loss influencing variables and their effect on the uncertainty of loss predictions; (ii) analyze the potential of parametric and nonparametric probabilistic approaches on reducing the dispersion and increasing the reliability of building‐level loss estimates; and (iii) evaluate the applicability of probabilistic multivariable loss models in the context of new sensors and data sources for pluvial flood loss estimation on different spatial scales (Ford et al., ; Schröter et al., ).

Materials and Methods

Data

We construct a data set that consists of self‐reported pluvial flood losses and related information of private households. The data were obtained through a standardized questionnaire using computer‐aided telephone surveys after pluvial flood events in five cities in Germany between 2005 and 2014 (Rözer et al., ; Spekkers et al., ). Based on 120 items in the questionnaire, a data set with 56 predictors and two loss variables is constructed covering eight groups: reported loss, hazard, warning, emergency response, precaution, experience, building information, and social‐economic information. The loss variables are represented as relative loss (rloss) and a variable with binary information if a building suffered from structural damage or not (dam). rloss is on the scale from 0 (no loss) to 1 (total loss), normalizing the reported direct building loss in Euro [EUR] with the total replacement cost value less depreciation of the respective building. We exclude observations where rloss could not be derived due to missing information on the building replacement value or the reported loss itself resulting in a total of 431 observations. Out of 56 predictors in the data set, 12 are excluded from the analysis, because of their zero or near‐zero variance, resulting in 44 variables to be considered for further analysis. To address the issue of censoring zero loss observations, pluvial flood affected households without direct building loss are included in the data set if water intrusion into the building was reported (9% of observations; see Van Ootegem et al., ). Missing values in other variables were imputed using complementary information available in the questionnaire (i.e., missing information of the total living area through building footprint and number of habitable floors). In few cases where causal inference was not possible, missing values are imputed using nearest neighbor imputation. A more detailed description of the data including a table describing all 56 predictors, the two loss variables, the variables excluded from the analysis, and the percentage of imputed missing values is provided in the supporting information (SI; Data section).

Detection of Important Loss Influencing Variables

Prior to the actual model development, we screen the previously described data set for variables with the highest predictive power given the complex correlations and interdependencies in the data set using machine learning. A reduced set of variables out of the full 44 variables is then used to develop the multivariable probabilistic models described in the following section. The most important loss influencing variables are detected by using an ensemble of variable importance measures of two tree‐based (Bagging [cRF; Strobl et al., ] and Boosting [GBM; Friedman, ]) and two linear regression‐based (Ridge [Hoerl & Kennard, and LASSO [Tibshirani, ]) machine learning algorithms. The four different types of algorithms are used in two different settings: a binary classification between loss/no loss (dam) and a regression analysis modeling the degree of loss (rloss) of a building. Based on the variable importance score of each variable, its rank within each ensemble member as well as its overall rank is determined. The top five variables with the highest overall rank for rloss and dam are further considered in the model development process.

For details on the variable selection procedure, see SI (Materials and Methods section).

Probabilistic Loss Estimation Models

Bayesian zero‐inflated beta regression (Ospina & Ferrari, ) is used to predict the relative loss to a building by pluvial flooding (rloss) using the previously selected important loss influencing variables. The probabilistic prediction y for rloss on the interval [0,1) is modeled as follows: We define z_i to be a binary variable for the occurrence of flooding in the ith observation and estimate it with a logistic regression: $z_{i} \sim Bernoulli (γ X_{i})$ where X_i is the vector of predictors for the ith observation, γ is the vector of coefficients, and Bernoulli(θ) indicates a Bernoulli trial with probability θ. Once z_i is known, then we can calculate y_i following a zero‐inflated Beta regression model $y_{i} = \{\begin{array}{lr} Beta (α_{i}, β_{i}), & z_{i} = 1 \\ 0, & z_{i} = 0 \end{array}$ where α_i > 0 and β_i > 0 are the shape and scale parameters, respectively, of the Beta distribution. To estimate these parameters, we define $\begin{matrix} α_{i} & = μ_{i} ϕ \\ β_{i} & = (1 - μ_{i}) ϕ \end{matrix}$ following Ferrari and Cribari‐Neto (). This parameterization allows us to define $μ_{i} = X_{i} β$ where β is the coefficient vector for the Beta regression. In summation, our zero‐inflated beta regression model conducts simultaneous inference on the vector γ, the vector β, and the scalar ϕ, given observations of flood occurrence z, flood damage y (i.e., the variable rloss), and predictive variables X.

The probabilistic predictions of rloss from the Bayesian zero‐inflated beta model (Beta) are compared with probabilistic predictions of two additional model types used for empirical flood loss estimation in previous studies. A simple Bayesian parametric model based on a Gaussian response distribution is used as a probabilistic representation of a model type widely used in flood loss estimation (Gerl et al., ; Van Ootegem et al., ) and a nonparametric model based on the RandomForest algorithm, used for probabilistic flood loss estimation in previous studies (Schröter et al., ). The three model types (Beta, Gaussian, and RandomForest) are fit as univariable and multivariable models (i.e., with a single predictor in X or with multiple predictors) to investigate the effect of additional variables on the predictive performance, resulting in six different models in total. The univariable models are fit using water depth wd as their only predictor, reflecting the current standard in flood loss estimation (Gerl et al., ; Merz et al., ). The univariable parametric models (Beta and Gaussian) are fit with the square root of the water depth to be comparable with reference functions in previous studies (Merz et al., ; Schröter et al., ; Wagenaar et al., ). All multivariable models use the set of predictors shown in Table . For more details on the models including details on the priors of the Bayesian models, see SI (Materials and Methods section).

Mean Variable Importance Scores of the Five Most Important Predictors for rloss and dam on the Scale (0, 100) for Each Ensemble Member (Tree‐Based Bagging [cRF] and Boosting [GBM]; Penalized Regression With L1 [LASSO] and L2 [Ridge] Regularization)

Name	Variable	cRF	GBM	LASSO	Ridge	Avg. rank	Corr
Degree of loss (rloss)
Water depth	wd	100¹	100¹	94¹	97¹	1	+
Duration	d	38²	50²	81³	90²	2	+
Basement [Y/N]^†	bu	12⁹	11¹³	84²	85³	6	+
Contamination [Y/N]	con	15⁸	9¹⁷	77⁴	81⁴	6	+
Household size^†	hs	17⁴	17⁸	45⁷	64⁵	6	‐
Loss/no loss (dam)
Water depth	wd	99¹	100¹	89¹	90²	1	+
Household size	hs	84²	14²	67³	93¹	2	‐
Knowledge hazard	pre1	72³	6⁴	48⁷	81³	3.5	‐
Age of respondent^†	age	69⁴	13³	3³²³	42⁹	6.5	+
Multifamily home [Y/N]	bt	49⁷	1¹¹³	50⁶	51⁶	6.5	‐

Note. Corr indicates direction of the trend: “+” increasing; “‐” decreasing. Superscript numbers indicate rank within each ensemble member. Avg. rank indicates the overall rank based on the median rank of each ensemble member. Variables marked with a “†” showed no improvement in the predictive performance of the probabilistic loss models and were therefore not considered in the final models. Importance scores not stable.

Model Validation and Comparison

We validate the probabilistic loss predictions on the building level for the previously described models and data using 10‐fold cross validation. For determining the error of the point estimate (median of the predictive distribution), the root‐mean‐square error (RMSE) and the mean bias error (MBE) are used. For validating and comparing the reliability of the loss estimate, we calculate the hit rate (HR), meaning the percentage of cases where the observed value lies within the 90% highest density interval (HDI) of the predictive distribution. We use the width of the 90% HDI to evaluate the dispersion of the predictive distribution. In addition, we calculate the interval score, a combined dispersion and reliability score, penalizing predictions based on the width of the 90% HDI and the percentage of observations that are outside the 90% HDI of the respective predictive distributions (Gneiting & Raftery, ). To evaluate the effect of including the option to have no building loss in the model, we validate and compare the different models for three scenarios: one where zero‐loss observations are removed from the data set prior to fitting the model, one where the zero‐loss observations are kept in the data set (zero‐loss proportion 9%), and one where the proportion of zero‐loss observations is upsampled to 20%. Details on the validation procedure and the different scores used to compare the models are provided in SI (Materials and Methods section).

Application Harris County, TX

We apply the previously trained probabilistic loss models in Harris County, TX, to analyze the potential for reducing the dispersion and improving the reliability of probabilistic loss estimates for direct building damage of private households caused by pluvial flooding during Hurricane Harvey. To demonstrate the feasibility of probabilistic building‐level loss estimation, we construct a high‐resolution data set from publicly available data sources for Harris County, TX. Based on refined pluvial flood inundation maps for Hurricane Harvey provided by JBA Risk Management (), detailed information of affected properties are gathered from the Harris County Appraisal District Real & Personal Property Database including the type and value of each affected building (HCAD, ). In addition, census information is used to derive the average household size on the block level (U.S. Census Bureau, ). Besides this information, the constructed data set contains data on the knowledge about the flood hazard based on if a property is within the 100‐year flood zone derived by FEMA (Zone A) and the probability of a property being affected by contamination. The contamination data was created by spatially interpolating reported point sources of contamination from the National Response Center of United States Coast Guard and volunteered geographic information using 2‐D kernel density interpolation (National Response Center, ; Sierra Club). The resulting data set for Harris County contains information of more than 304,000 individual buildings affected by pluvial flooding during Hurricane Harvey. For validation and visualization the property level loss distributions of each model are aggregated on the zip code as well as on the county level. The aggregated loss estimates are validated using the sum and average total building damage from FEMA's Housing Assistance Program available on the zip code level as well as for the entire county for Hurricane Harvey (Federal Emergency Management Agency, ). Details on the data sets and models used in Harris County including the validation data are provided in SI (Materials and Methods section).

Results

Important Loss Influencing Variables

Screening the high‐dimensional data set for the most important loss influencing variables to be considered in the probabilistic loss models, we find that the drivers for having loss or not having any loss (dam) and the drivers for the degree of loss (rloss) to a building are different, indicating different damaging mechanisms. While both cases share the water depth as the most important predictor, other important predictors hardly overlap. Looking at the second to fifth most important predictors for dam, the resistance of a building and its inhabitants is decisive. Given a low inundation depth, larger households, multifamily buildings, younger residents, and residents who previously informed themselves about pluvial flooding have a lower probability of having any loss. In contrast, the second and fourth most important predictors influencing rloss are directly related to the flood intensity. Higher inundation depths, longer flood duration, and contamination of the flood water lead to higher losses. The variable importance scores of the five most important predictors of the four machine learning algorithms their rank within each ensemble member and the median rank of all ensemble members are summarized in Table . Starting with the most important predictor both the overall rank and the importance scores drop sharply. Of the five preselected important loss influencing variables shown in Table , we find three variables for rloss and four variables for dam to improve the predictive performance in the probabilistic loss models. Variable importance values for all 44 variables and differences between the machine learning algorithms are shown in SI (Results section).

Predictive Performance of Probabilistic Models

The prediction performance of the six probabilistic models (univariable and multivariable models for Gaussian, RandomForest, and Beta) for the cross‐validated predictions are summarized in Table . Looking solely on the error of the point estimate of the predictions (median of the predictive distribution), we find only a minor nonsignificant reduction in root‐mean‐square error for the three models for both the univariable and multivariable versions. However, for the 90% HDI of each predictive distribution, the parametric Beta and Gaussian models are significantly more reliable with an average HR of 97% and 95% for the univariable and multivariable Beta models and 91% for both Gaussian models compared to 67% and 49% for the RandomForest counterparts. However, when we control the HR of the predictive distributions for dispersion and distance to missed observations using the interval score, the high HR scores of the Gaussian models can be attributed to consistently wider 90% HDI's (see Figure b) compared to the other two models. The difference in shape and width of the predictive distributions of the different models is illustrated in Figure a, for the example of a loss estimate for a single building with an observed rloss of 0.016. While the RandomForest models tend to give very sharp predictive distributions with shapes close to a normal distribution, the predictive distributions of the Gaussian and Beta models both have longer tails. The almost lognormal shape of the Gaussian models is caused by the backtransformation of the logit‐transformed predictive distribution. Although the sharp predictive distributions of the RandomForest models lead to considerably narrower prediction intervals it significantly increases the risk of the 90% HDI not covering the actual observed loss (see Table ). With its flexibility in shape and clearly defined interval of the response distribution, we find the Beta models to provide the best trade‐off between reliability and dispersion. Compared to the widely used reference function (univariable Gaussian), the univariable and multivariable have between 47% and 50% narrower HDI's with HRs above 90%. Comparing the difference between the univariable and multivariable models, we find an increase in the variability in shape and width of the predictive distributions for all multivariable models. Although this increase in variability only show a minor, nonsignificant improvement in accuracy, reliability, and dispersion (see Table ), we find that multivariable models perform significantly better compared to models using the water depth as only predictor when individual predictions are aggregated (see Figure c).

Performance of Loss Model Predictions for Out of Sample Observations (Median)

Model type	Variables	RMSE	MBE	Hitrate (90% PI)	Interval Score (90% PI)
Gaussian	univariable	0.028 (0.018)	0.015 (0.008)	0.91 (0.01)	0.26 (0.01)
	multivariable	0.027 (0.017)	0.013 (0.007)	0.91 (0.02)	0.25 (0.02)
RandomForest	univariable	0.028 (0.017)	0⁵ (0.009)	0.49 (0.07)	0.17⁵ (0.11)
	multivariable	0.025 (0.016)	0.005 (0.008)	0.67 (0.08)	0.11⁵ (0.08)
Beta	univariable	0.027 (0.017)	0.010 (0.008)	0.97 (0.06)	0.09⁵ (0.08)
	multivariable	0.025 (0.017)	0.009 (0.008)	0.95 (0.07)	0.08⁵ (0.08)

Note. Standard deviation in brackets. RMSE = root‐mean‐square error; MBE = mean bias error. Significantly different from Gaussian model for the 0.05 significance level (univariable and multivariable models, respectively). Significantly different from univariable models for the 0.05 significance level for each model type.

View Image - Probabilistic predictive distributions of different univariable and multivariable models (RandomForest, Gaussian, and Beta) for cross‐validated observations. The predictive distributions for Gaussian and Beta models are based on 2000 MCMC samples from the respective posterior predictive distributions. The predictive distributions from RandomForest model are based on the predictions of 2,000 individual trees used for training the forest. (a) The different predictive distributions for a single household (single‐family home) with a recorded relative loss of 0.016 (dotted vertical line). The upper plot of (a) shows the predictive distributions for three univariable models using the water level as only predictor (dashed lines). The lower plot of (b) shows the same three model types, but with five additional predictors (solid lines). (b) The widths of the 90% HDI for the predictive distributions of all cross‐validated observations (n = 431) are summarized. The points show the medians for the univariable (hollow) and multivariable (solid) models for the three different model types.The gray boxes show the 25th to 75th percentile ranges for each model. HDI = highest density interval.

Probabilistic predictive distributions of different univariable and multivariable models (RandomForest, Gaussian, and Beta) for cross‐validated observations. The predictive distributions for Gaussian and Beta models are based on 2000 MCMC samples from the respective posterior predictive distributions. The predictive distributions from RandomForest model are based on the predictions of 2,000 individual trees used for training the forest. (a) The different predictive distributions for a single household (single‐family home) with a recorded relative loss of 0.016 (dotted vertical line). The upper plot of (a) shows the predictive distributions for three univariable models using the water level as only predictor (dashed lines). The lower plot of (b) shows the same three model types, but with five additional predictors (solid lines). (b) The widths of the 90% HDI for the predictive distributions of all cross‐validated observations (n = 431) are summarized. The points show the medians for the univariable (hollow) and multivariable (solid) models for the three different model types.The gray boxes show the 25th to 75th percentile ranges for each model. HDI = highest density interval.

Effect of Zero‐Loss Cases on the Damage Estimates

The often low water levels of pluvial flooding compared to river or coastal flooding increases the chances that direct building loss can be completely avoided, although water entered the building. Analyzing different zero‐loss proportions, we find that not explicitly accounting for these cases can considerably affect model predictions in terms of reliability and dispersion of the predictive distribution. For the Gaussian models, none, and for multivariable RandomForest model, 28 of the 38 zero‐loss observations in the data set were inside the respective 90% HDI. For increasing the zero‐loss proportions we observe a significant increase in the reliability of the RandomForest model and a significant increase in the width of the 90% HDI of the loss prediction for the Gaussian model (Figure ). The increase in reliability of the RandomForest model reflects the capability of the model to learn implicitly to account for zero‐loss cases, when the learning sample becomes large enough. Without the possibility to consider zero‐loss cases, a higher proportion of zero‐loss observation simply adds additional variability, which the Gaussian models cannot explain. Bias caused by varying zero‐loss proportions is found to be reduced to a minimum by explicitly accounting for zero‐loss observation in the (zero‐inflated) Beta models (see Beta model in Figure ). Findings for the univariable models are, for the sake of readability, shown in SI (Results section).

View Image - Trade‐off between reduction in uncertainty and reliability for cross‐validated predictions for different multivariable loss models and different proportions of zero‐loss observations in the data set. Results for univariable models are shown in SI (Results section). Uncertainty is represented as mean width of the 90% HDI for all observations. Reliability is represented as proportion of the out‐of‐sample observation, which are inside the respective 90 % HDI. Error bars represent the 90% interval for the HDI width of all out‐of‐bag predictions. HDI = highest density interval.

Trade‐off between reduction in uncertainty and reliability for cross‐validated predictions for different multivariable loss models and different proportions of zero‐loss observations in the data set. Results for univariable models are shown in SI (Results section). Uncertainty is represented as mean width of the 90% HDI for all observations. Reliability is represented as proportion of the out‐of‐sample observation, which are inside the respective 90 % HDI. Error bars represent the 90% interval for the HDI width of all out‐of‐bag predictions. HDI = highest density interval.

Hurricane Harvey Building Loss for Harris County, TX

Modeled direct losses to the building structure caused by pluvial flooding during Hurricane Harvey in Harris County, TX, are summarized in Figure . Our main finding is that the width of the 90% HDI of the predictive distribution for individual buildings can be reduced by 21% or U.S.$3,685 on average when using the multivariable Beta model instead of the univariable Gaussian model representing the current standard in empirical flood loss estimation. Panel (b) shows the mean relative reduction in the width of the 90% HDI between the two models for individual buildings on the zip code level. For individual buildings we find spatial variations for the average building structure loss ranging from U.S.$544 to U.S.$10,134 with the majority of areas being in the range of U.S.$2,000 to U.S.$5,000. The highest average building structure loss with values above U.S.$7,500 are found west and southwest of Downtown Houston (panel a).

View Image - Modeled direct building structure losses for Harris County, TX, caused by pluvial flooding during Hurricane Harvey. (a) The modeled average building structure loss per building aggregated on the zip code level using the multivariable Beta model. (b) The average relative reduction in uncertainty (expressed through the width of the 90% HDI) per building between the univariable Gaussian model (reference function) and the multivariable Beta model in percent aggregated on the zip code level. Crosses in (a) and (b) indicate zip code areas where the reported average building loss is outside the 90% HDI of the modeled average building loss. (c) Box plots of the aggregated predictive distributions of the absolute direct building structure damage for the entire county for three different model types (RandomForest, Gaussian, and Beta) in their univariable (hollow) and multivariable (solid) versions. Bars indicate the median absolute loss, boxes the 90% HDI, and whiskers the 98% HDI of the absolute direct building loss for Harris County. The red dashed line represents the official reported absolute building structure loss based on data from the Federal Emergency Management Agency Housing Assistance Program. HDI = highest density interval.

Modeled direct building structure losses for Harris County, TX, caused by pluvial flooding during Hurricane Harvey. (a) The modeled average building structure loss per building aggregated on the zip code level using the multivariable Beta model. (b) The average relative reduction in uncertainty (expressed through the width of the 90% HDI) per building between the univariable Gaussian model (reference function) and the multivariable Beta model in percent aggregated on the zip code level. Crosses in (a) and (b) indicate zip code areas where the reported average building loss is outside the 90% HDI of the modeled average building loss. (c) Box plots of the aggregated predictive distributions of the absolute direct building structure damage for the entire county for three different model types (RandomForest, Gaussian, and Beta) in their univariable (hollow) and multivariable (solid) versions. Bars indicate the median absolute loss, boxes the 90% HDI, and whiskers the 98% HDI of the absolute direct building loss for Harris County. The red dashed line represents the official reported absolute building structure loss based on data from the Federal Emergency Management Agency Housing Assistance Program. HDI = highest density interval.

For the aggregated predictive distribution of the absolute loss to the building structure of over 304,000 affected residential buildings (single‐family and multifamily homes) in Harris County, the corresponding samples of the individual predictive distributions of each building are summed up. This leads to an effect, know as the central limit theorem, where the Beta‐distributed predictive distributions for individual buildings coming from the Beta model tend to form a normal distribution when enough individual predictive distributions are summed. In combination with a higher variability, introduced by the additional variables, the considerably higher reliability and lower dispersion of the multi‐variable Beta model compared to the univariable Gaussian model on the building‐level vanishes when the predictions are aggregated over a large amount of individual buildings (panel c).

This effect is also described by Sieg () and provides further evidence why univariable stage damage functions based on Gaussian response distributions yield sufficiently accurate loss predictions on larger scales while the same model produces highly uncertain loss estimates on the building level. For results aggregated to the county level, we find univariable and multivariable Gaussian models to overestimate the absolute building structure losses by U.S.$0.7 and U.S.$3.4 billion, respectively. This can be partly attributed to the underestimation of zero‐loss cases described in the previous chapter, which leads to higher intercepts in the model. For the multivariable model this effect is considerably stronger as the model is fit as a linear instead of a square root function (see section ). Of the six models tested, none of the univariable models, and only the aggregated predictive distributions of the multivariable RandomForest and Beta models are covering the reported loss from FEMA's Housing Assistance Program (U.S.$1.04 billion). Here the multivariable Beta performs significantly better with a total reduction in width of the 90% HDI of U.S.$3.8 billion (or 78%) compared to the multivariable RandomForest model, providing the best trade‐off between dispersion and reliability.

Discussion and Conclusions

Despite causing severe losses in cities around the globe, pluvial flooding is still widely neglected when estimating the current and future flood risk in urban areas. This results in a widespread underestimation of flood risk especially in urban areas where fluvial or coastal floods are not the dominant sources of flooding (Rosenzweig et al., ). One key limitation in reliably quantifying pluvial flood risk is the local extend of pluvial floods, requiring loss estimates on spatial scales where damaging processes are still hardly understood and the associated uncertainties are often unknown. We present the first consistent quantification of uncertainties in pluvial flood loss models for private buildings in the shape of predictive distributions using a fully probabilistic modeling approach. We train and validate different univariable and multivariable probabilistic loss models with a local training data set and use these models for a probabilistic estimate of building structure losses of over 304,000 individual buildings in Harris County during Hurricane Harvey. Our analysis reveal significant differences in the dispersion and reliability of the continuous predictive distributions between different models depending on (i) the use of additional predictors, (ii) the choice of response distribution, (iii) the ability of the model to account for zero‐loss cases, and (iv) the spatial scale of the analysis. We find that the assumption of a normal or lognormal distribution of uncertainties in loss estimates, which most loss models implicitly use today, results in unnecessarily wide prediction intervals. In the case of property level predictive distributions, we find that the width of the 90% HDI exceeds the median of the prediction by factor 30 on average. Our results suggest that the with of the 90% HDI for pluvial flood loss estimates on the property level can be significantly reduced by 47% when using a zero‐inflated beta distribution instead of normal response distributions without sacrificing the reliability (Table ). While not evident on the property level, we find that using water depth as only predictor results in an underestimate of the prediction intervals leading to unreliable loss estimates when spatially aggregating loss predictions (Figure c). Here, we find additional predictors to improve the pluvial flood loss predictions in two ways: (i) by increasing the variability of individual predictive distributions leading to a more realistic representation of uncertainties when aggregating estimates and (ii) by improving the detection of cases where water entered the building but did not cause any monetary damage to the structure (Figure ). For the latter our analysis indicate the ability of households to prevent direct damage to their homes should be included in loss models.

The analysis of important loss influencing variables has further shown that the probability of a household to not have any monetary loss to the building structure is—other than for the degree of loss—strongly influenced by household characteristics such as the number of people living in a household and their prior knowledge about the pluvial flood hazard. This highlights the need to account for differences in the ability of households to reduce or avoid damage to their homes in loss models for pluvial floods.

For loss estimates in Harris County, the use of additional predictors in zero‐inflated beta models considerably increases the reliability while at the same time significantly reduces the dispersion of the predictive distribution given validation data. For direct building losses aggregated on the county level this reduction accounts for U.S.$3.8 billion or 78% compared to loss models based on normal response distributions. These findings are relevant for a larger discussion on using probabilistic loss estimates for decision making in flood risk management. This includes the potential of probabilistic approaches to improve the spatial transferability of loss models. We further demonstrate the potential to significantly improve the dispersion and reliability of pluvial flood loss estimates using probabilistic models, which goes beyond previous studies considering only point estimates (Van Ootegem et al., ; Zhou et al., ). Although these results are limited to a quantification of uncertainties of loss predictions, the results can easily be extended for robust decision making on adaptation strategies based on exceeding probabilities, which can be directly derived from predictive distributions. While our results suggest that models that use a zero‐inflated beta response distribution provide predictive distributions with a significantly lower dispersion and higher reliability, a general paradigmatic change toward probabilistic models would greatly aid a better understanding of uncertainties in loss models (Todini, ). Same is true for multivariable models, where emerging cloud‐based reporting systems and open data portals now allow the use of high‐dimensional data sets in flood loss modeling.

Acknowledgments

The data collection campaign after the flood event in Münster, Germany, in 2014 was supported by the project ‘“EVUS Real‐Time Prediction of Pluvial Floods and Induced Water Contamination in Urban Areas” (BMBF, 03G0846B), the University of Potsdam, and Deutsche Rückversicherung AG. The data collection campaigns after the pluvial floods in Lohmar and Hersbruck in 2005 were undertaken within the project “URBAS ‐ urban flash floods”; we thank the German Ministry of Education and Research (BMBF; 0330701C) for financial support. Data collection after the pluvial flood in Osnabrück in 2010 were funded by the University of Potsdam, the German Research Centre for Geosciences GFZ, and the Deutsche Rückversicherung AG. Additional financial support is gratefully acknowledged from the German‐American Fulbright Commission for V. R. J. D.‐G. thanks the NSF GRFP program for support(Grant DGE 16‐44869). We would also like to acknowledge JBA Risk Management, who supported our work by providing the pluvial flood inundation map for Hurricane Harvey. The pluvial flood inundation map from JBA Risk Management is available via the OASIS Hub (https://oasishub.co/dataset/surface-water-flooding-footprint-hurricane-harvey-august-2017-jba). The data sets of the flood events in Germany from 2005 and 2010 are available via the German flood damage data base HOWAS21 (http://howas21.gfz-potsdam.de/howas21/). The data set from 2014 will be made available via the HOWAS21 database in June 2023. All other data sets used for the application in Harris County, TX, are openly available and cited in the text and SI. Detailed information on all data sets used for this study and how to access them are available in the supporting information (SI; Data section).

Word count: 5839

Show less

© 2019. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Pluvial flood risk is mostly excluded in urban flood risk assessment. However, the risk of pluvial flooding is a growing challenge with a projected increase of extreme rainstorms compounding with an ongoing global urbanization. Considered as a flood type with minimal impacts when rainfall rates exceed the capacity of urban drainage systems, the aftermath of rainfall‐triggered flooding during Hurricane Harvey and other events show the urgent need to assess the risk of pluvial flooding. Due to the local extent and small‐scale variations, the quantification of pluvial flood risk requires risk assessments on high spatial resolutions. While flood hazard and exposure information is becoming increasingly accurate, the estimation of losses is still a poorly understood component of pluvial flood risk quantification. We use a new probabilistic multivariable modeling approach to estimate pluvial flood losses of individual buildings, explicitly accounting for the associated uncertainties. Except for the water depth as the common most important predictor, we identified the drivers for having loss or not and for the degree of loss to be different. Applying this approach to estimate and validate building structure losses during Hurricane Harvey using a property level data set, we find that the reliability and dispersion of predictive loss distributions vary widely depending on the model and aggregation level of property level loss estimates. Our results show that the use of multivariable zero‐inflated beta models reduce the 90% prediction intervalsfor Hurricane Harvey building structure loss estimates on average by 78% (totalling U.S.$3.8 billion) compared to commonly used models.

Details

Title

Probabilistic Models Significantly Reduce Uncertainty in Hurricane Harvey Pluvial Flood Loss Estimates

Author

Rözer, Viktor¹

; Kreibich, Heidi²

; Schröter, Kai²

; Müller, Meike³; Sairam, Nivedita⁴

; James Doss‐Gollin⁵

; Lall, Upmanu⁶

; Merz, Bruno¹

¹ Section Hydrology, Helmholtz Centre Potsdam GFZ German Research Centre for Geosciences, Potsdam, Germany; Institute for Environmental Sciences and Geography, University Potsdam, Potsdam, Germany
² Section Hydrology, Helmholtz Centre Potsdam GFZ German Research Centre for Geosciences, Potsdam, Germany
³ Deutsche Rückversicherung AG, Düsseldorf, Germany
⁴ Section Hydrology, Helmholtz Centre Potsdam GFZ German Research Centre for Geosciences, Potsdam, Germany; Geography Department, Humboldt University of Berlin, Berlin, Germany
⁵ Columbia Water Center, Columbia University, New York, NY, USA
⁶ Columbia Water Center, Columbia University, New York, NY, USA; Department of Earth and Environmental Engineering, Columbia University, New York, NY, USA

Pages

384-394

Section

Research Articles

Publication year

2019

Publication date

Apr 2019

Publisher

John Wiley & Sons, Inc.

e-ISSN

23284277

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1029/2018EF001074

ProQuest document ID

2267006878

Probabilistic Models Significantly Reduce Uncertainty in Hurricane Harvey Pluvial Flood Loss Estimates

Jump to:

Full text

Abstract

Details

Suggested sources