Introduction
Geospatial data describing tree species or forest structure are required for many analyses and models of forest landscape dynamics, including estimating stocks of terrestrial carbon (Jenkins et al. ), forest biomass (Blackard et al. ), forest growth and mortality (Brown and Schroeder , Falkowski et al. ), national‐level fire planning and risk assessment (Schmidt et al. ), simulating continental‐scale burn probabilities (Finney et al. ), simulating wildfire intensity patterns and fuel treatment strategies (Finney et al. ), tree species abundance and distribution (Wilson et al. ), basal area (Wilson et al. ), estimating timber volume (Franco‐Lopez et al. , Muinonen et al. ), mapping wildland fuels for simulating fire growth (Keane et al. ), and simulating wildfire risk transmission from federal lands to the wildland–urban interface (Haas et al. ). Forest data must have resolution and continuity sufficient to reflect site gradients in mountainous terrain and stand boundaries imposed by historical events, such as wildland fire and timber harvest. Such detailed forest structure data are not available for large areas of public and private lands in the United States, which rely on forest inventory at fixed plot locations at sparse densities of one plot per 6000 acres (Burkman ). While direct sampling technologies such as light detection and ranging (LiDAR) may eventually make broad coverage of detailed forest inventory feasible (e.g., Hudak et al. , Latifi et al. ), no such data sets at the scale of the western United States are currently available.
Models of geospatial forest structure have utilized various statistical methods to assign the measured plots from a sparse sample to unmeasured locations using a set of predictor variables. These methods have a common goal: to take more detailed observations at relatively few locations (e.g., field plots or stand inventories) and assign their characteristics to the unmeasured locations on the landscape in order to provide seamless information about all locations. The detailed observations are often referred to as the “reference data,” while the landscape data (often derived from aerial photographs, stand records, or satellite imagery) are referred to as the “target data.” These methods include linear models, image classification (Van Wagtendonk and Root ), classification and regression trees (CART), kriging (Krige ), universal kriging (UK; Cressie ), and an assortment of nearest neighbor methods, including normalized and unnormalized Euclidean distance, Mahalanobis distance, independent component analysis (ICA), most similar neighbor (MSN, also called canonical correlation analysis), gradient nearest neighbor (GNN, also called canonical correspondence analysis), and random forests (RF; Moeur and Stage , Pierce et al. , Hudak et al. , Wilson et al. , Breiman ). A distinguishing factor among all of these methods is whether they allow for the use of categorical predictor variables as well as continuous variables.
Linear models use the values of one or more predictor variables and a set of coefficients to predict the value of a response variable. Most classification and regression tree (CART) analyses use a look‐up table or classification rules to match a response variable with input characteristics (Breiman , Pierce et al. , Rollins ). Random forests uses a set of decision trees (a “forest”) to predict which among the reference data (e.g., forest plots) are most similar to the characteristics at a target location, including both continuous and categorical variables (Cutler et al. ). Kriging is a form of interpolation, using “a Gaussian process governed by prior covariances” (Krige ). Universal kriging is similar to kriging, but with a local trend; it can be viewed as a point interpolation, using a point map as input and returning a raster map with estimations (Cressie ). Nearest neighbor methods represent a more recent approach to mapping forest attributes, and use a set of numerical predictors (often spectral and environmental continuous variables) to assess which of the candidate reference data (e.g., field plots) are most similar to each target (map) location (Pierce et al. ). Using continuous variables only, most similar neighbor (MSN) uses a similarity measure that employs canonical correspondence analysis to summarize the multivariate relationships between the set of target data and the set of reference data derived from field samples (Moeur and Stage ). Similarly, gradient nearest neighbor (GNN) imputation also utilizes canonical correspondence analysis, but incorporates the direct gradient analysis in assigning weights to predictor variables (Ohmann and Gregory , Pierce et al. ). GNN and MSN techniques assign values at each target location that are the original values measured at a field plot, while regression‐based and interpolation methods assign the modeled values (Pierce et al. ). The GNN and MSN techniques give users the option to choose multiple nearest neighbors in order to predict continuous variables, as in k‐nearest neighbor techniques (e.g., McRoberts , Wilson et al. ).
Recent studies have used a variety of these methods to impute forest structure variables or forest plots to target data, and have in some cases compared the robustness of various methods. In a project focused on estimating the tree mortality, Drury and Herynk () used a CART procedure to create a national‐scale 30‐m grid of tree‐list plots having the median bark thickness within stratifications of existing vegetation type, biophysical setting, succession class, and canopy bulk density. Among several methods, Pierce et al. () found that GNN performed best for forest structure variables in Oregon, while linear models and universal kriging demonstrated stronger performance for both forest structure and canopy variables in Washington and California. Among the nearest neighbor methods, Hudak et al. () found that random forests produced the best results for predicting the plot‐level basal area and tree density and was the most robust and flexible method for a study area in north‐central Idaho.
In this study, we describe the methods for developing a tree‐list data set for the purpose of using forest inventory data with national‐scale wildfire simulations, among others. Potential uses of this tree list include estimating the effect of fuel treatments and wildfires on mortality and carbon stores. For use in these research applications, the tree‐list data set needed to be compatible with several other existing data sets: (1) Landscape Fire and Resource Management Planning Tools (LANDFIRE) vegetation and fuels data, which provide landscape inputs to the wildfire simulations (Rollins ; data available at
With the requirement for having compatibility between the tree‐list data and the vegetation and fuels data provided by LANDFIRE, most of the methods listed above were precluded. For example, kriging would model the values based on the point plot data, but would not have good correspondence with the LANDFIRE data. In addition, the vegetation group data are categorical, while other predictor variables are numerical and continuous. This limits the set of possible methodologies to classification trees (i.e., Pierce et al. ). We chose random forests as our methodology because it leverages a “forest” of classification trees in order to produce high accuracies and model complex interactions among predictor variables, two notable strengths of this methodology over other statistical classifiers (Breiman , Cutler et al. ). The modified random forests method used here evaluates a set of forest plots and identifies the best‐matching plot for each grid cell on the landscape. Several important differences exist between the random forests methodology used here and that of Drury and Herynk (): (1) We limited our scope to a single set of nationally consistent plot data, whereas they obtained a variety of fixed‐ and variable‐radius plot designs from multiple agencies; (2) because tree mortality was not the primary variable of interest, we did not use it as a predictor; and (3) we wanted to identify a single best‐matching plot for each point on the landscape rather than utilizing the median plot in a class, thus retaining more variability on the landscape.
Given our objective to find the single best‐matching plot for each pixel on the landscape, we optimized our model for a set of response variables linked to the prediction of terrestrial carbon: forest cover, forest height, and existing vegetation group (EVG). Here, we demonstrate high model accuracies and high levels of agreement between the target (LANDFIRE) and imputed data for a random forests imputation run on all 997,153,322 forested pixels in the western United States at 30‐m grid resolution. The primary output of this project is a raster grid of plot identifiers, which can in turn be used to generate a list of the number, size, and species of trees assigned to each pixel.
Methods
In this section, we first describe our data sources, then present a brief description of the modified random forests methodology and how we applied it specifically to our problem. We finish by offering a description of the methods we used for verifying our outputs.
Data sources
Forest Inventory Analysis forest plot data
We obtained the measurements of tree size, height, species, and status (dead or alive) from the US Forest Service's Forest Inventory Analysis (FIA). FIA measures the forest attributes on a network of plots in all 50 states using a standardized plot design that was implemented beginning in 1999 (Fig. ) (O'Connell et al. ). Version 5.1 data were downloaded from the FIA Data Mart (
The LANDFIRE Reference Database
The subset of FIA plots used in this study were further restricted to those in the LANDFIRE Reference Database (LFRDB), a database of plots leveraged by LANDFIRE in the production of their spatial data sets. The LFRDB was the sole source of three stand‐level descriptions not available from the native FIA data: existing vegetation cover (EVC), existing vegetation height (EVH), and existing vegetation group (EVG), which were assigned to FIA plots by the LANDFIRE project based on geographic location and the characteristics of the trees recorded. Existing vegetation group describes the ecological system (NatureServe ). Existing vegetation cover represents the vertically projected percent cover of the live canopy layer. Existing vegetation height is the average height of the dominant vegetation. Once plots were restricted to single‐condition forested plots that appear in the LFRDB, 15,333 plots in the western United States were available for imputation.
LANDFIRE Target Landscape Data
LANDFIRE provides a suite of topographic, biophysical, and vegetation data at 30‐m grid resolution for the western United States that served as the target data for this project. Existing vegetation group is assigned to each pixel using a set of hierarchical and iterative CART models, Landsat imagery, biophysical gradients, and training databases developed from the LFRDB (Rollins ). Existing vegetation cover and height are mapped using regression tree‐based models, empirical models, and spectral mixture models, leveraging spectral information from Landsat imagery and land cover information from the National Land Cover Database (Homer et al. ).
Random forests imputation
In this modified random forests imputation, a set of reference observations comprising the FIA plot data was imputed to a set of target points corresponding to the center of each 30‐m pixel of the LANDFIRE landscape grid (Crookston and Finley ). The output consists of a raster grid attributed with the best‐matching plot ID for each pixel (Fig. ). The modified random forests model was created by inputting the forest plot data to the yaImpute package in the statistical program R (Crookston and Finley , R Foundation ). We refer to our method as a modified random forests approach because yaImpute adapts the randomForest package in several ways, most notably: (1) by using the “nodes” matrix directly to compute proximity without the necessity of holding the often‐large proximity matrix in memory and (2) it is possible to have more than one response variable. We chose to use the modified random forests approach because our study design ideally involved the use of more than one response variable. However, although we did utilize the modified random forests approach as coded in yaImpute, we often refer to our method in the remainder of the manuscript as simply “random forests” for the sake of brevity.
Random forests requires all predictor variables to be available for both the reference and target data, which greatly constrains the list of possible variables. After examining the variables used in other imputation efforts and testing a more extensive list of predictor variables, we chose to include the variables listed in Table based on their variable importance scores, the expected relevance to forest characteristics, and the lack of redundancy with other predictor variables: three topographic variables (slope, aspect, and elevation), two location variables (latitude and longitude), three vegetation variables (forest cover, forest height, and vegetation group), and six biophysical variables (maximum temperature, minimum temperature, relative humidity, precipitation, photosynthetically active radiation, and vapor pressure deficit).
Predictor variables for reference (FIA plots) and target (LANDFIRE raster) data| Category | Variable | Source for reference (FIA plot) data | Source for target (LANDFIRE raster) data |
| Topographic | Slope | FIADB | LANDFIRE National topographic layers |
| Aspect (sine and cosine) | “ | “ | |
| Elevation | “ | “ | |
| Location | Latitude | MOC with FIA | Center of 30‐m pixels for LANDFIRE Refresh 2008 layers |
| Longitude | “ | “ | |
| Vegetation | Existing Vegetation Cover (forest cover) | LFRDB (derived from FIA plot data) | LANDFIRE Refresh 2008 layers |
| Existing Vegetation Height (forest height) | “ | “ | |
| Existing Vegetation Group (vegetation group) | “ | “ | |
| Biophysical | Maximum temperature | Overlay of plot location with LANDFIRE biophysical grid | LANDFIRE biophysical grid |
| Minimum temperature | “ | “ | |
| Relative humidity | “ | “ | |
| Precipitation | “ | “ | |
| Photosynthetically active radiation | “ | “ | |
| Vapor pressure deficit | “ | “ |
The predictor variables for the reference (plot) data were derived as follows. FIA collects numerous variables at plots, but for the imputation we directly utilized only elevation, slope, aspect, latitude, and longitude (the latter two attributes are not publicly available, but were acquired via a Memorandum of Cooperation signed between the authors of this manuscript and FIA). The three vegetation variables are calculated based on FIA plot characteristics by the LANDFIRE program, as noted above, which calculates EVH (hereafter referred to as “forest height”), EVC (“forest cover”), and EVG (“vegetation group”) for each plot in their LANDFIRE Reference Database. These data are also not publicly available, but were acquired via the Memorandum of Cooperation. A suite of biophysical predictors was derived via an overlay of the plot locations with gridded biophysical data from the LANDFIRE project (maximum temperature, minimum temperature, relative humidity, precipitation, photosynthetically active radiation, and vapor pressure deficit).
The same suite of predictor variables was obtained from various LANDFIRE raster data sets for the target (gridded) data. The three topographic variables (elevation, slope, and aspect) and the three vegetation variables (forest cover, forest height, and vegetation group) are publicly available as 30 × 30 m rasters (
Because we wanted to optimize the tree list for estimating carbon storage, we chose the response variables of forest cover, forest height, and vegetation group. Note that these also appear nominally in the list of predictor variables. Although the terms “predictor” and “response variable” are used in the descriptions of the random forests methodology, they do not have the same meaning as in other statistical approaches where the predictor variables are used to predict the value of the response variable. In random forests, the predictor and response variables are used to find the associations among the reference data and to find which observations are most like one another. In that sense, a variable can serve as both predictor and response without conflict. It is important that the response variable should not be used also as a predictor variable for the target data, and our methodology meets this criterion, as the three response variables were derived using different data sources and methodologies than were used to derive the predictor variables for the target data: For the reference data, the response variables we call forest cover, forest height, and vegetation group were based on the characteristics of the trees in the FIA plots, and for the gridded target data, the predictor variables we call forest cover, forest height, and vegetation group were based on satellite imagery and land cover data, among other inputs. We expect that using several of the predictor variables from the reference data as response variables will have the effect of increasing accuracy for these three variables over more traditional approaches, an innovation of our approach. It is also important to note that although the response variables are forest cover, forest height, and vegetation group, that is not what we are predicting: We are predicting the plot that is the best match for each pixel and outputting a map of plot IDs, which users can link to data in the FIA databases.
At the time of this study, the randomForest package in R had a maximum of 32 classes of data it could use, and there were more than 32 vegetation groups present in the plot data; therefore, we performed the imputation on one zone of LANDFIRE data at a time, which always limited the number of classes to less than 32 (Fig. ). From the list of 15,333 plots in the western United States, we created a subset consisting only of the plots with vegetation groups appearing in each zone. For example, when performing the imputation for zone 9, we limited the population of plots available for imputation to only those plots with vegetation groups that LANDFIRE had mapped in zone 9. Then, we formed the random forest model using the suite of predictor variables listed in Table . For each zone, we employed 249 total decision trees, with these trees divided equally among the three response variables: In other words, there were 83 decision trees to predict forest height, 83 for forest cover, and 83 for vegetation group. We decided on 249 trees after finding that the error rates barely differed from those of parameterizing the model using 500 trees, but saved a significant amount of computational time. A short description of how each decision tree is constructed follows, based on Cutler et al. (). Each decision tree is formed using a random sample of 66% of the plots, and the remainder (referred to as the out‐of‐bag observations) are set aside to assess the accuracy. Then, the bootstrap sample is divided into two groups in a process called binary partitioning. In order to determine how to partition the data, a small number of randomly selected variables (in this case, the square root of the number of predictor variables) are examined to see which best minimizes the variance in the response variable. That variable is chosen, at a point called a “node,” and the bootstrap sample is divided into two groups, or “buckets.” Binary partitioning continues until the variance in each bucket cannot be reduced significantly, or until further divisions cannot be made without reducing the number of observations in a bucket to less than 5. Each “fully grown” decision tree is used to predict the out‐of‐bag observations, in order to assess the model's accuracy. To better illustrate this process, we show a simplified schematic of two trees in Fig. . In random forests, the decision trees cannot be viewed, so we cannot show one here, but Fig. is provided as an illustration of how the method works.
Once the “forest” of decision trees has been grown using the plot data, they are used to predict the best‐matching plot for each pixel of the gridded target data. For a chosen pixel, its suite of predictor variables are used to determine its terminal bucket for each decision tree. The plots that appear in the same terminal bucket with the pixel observation are recorded for each of the 249 trees (recall that there are 83 decision trees to predict forest height, 83 for forest cover, and 83 for vegetation group). The plot that most frequently co‐occurs with the pixel observation across all 249 trees (and thus all three response variables) is chosen as the best match for that pixel, with ties split randomly. In this project, we used random forests to find the best‐matching FIA plot for each forested 30‐m pixel of LANDFIRE data, imputing an FIA plot number to each pixel, and generating a raster grid of the best‐matching plot numbers. The plot numbers can be linked back to the database of plot characteristics, so for any pixel on the landscape, maps can then be made of any number of plot characteristics, ranging from the response variables (cover, height, and vegetation group) to other plot characteristics that were neither predictor nor response variables (such as terrestrial carbon, basal area, or the number of trees).
We obtained an overall accuracy of the model by taking the out‐of‐bag misclassification for each tree and considering them in aggregate to assess the overall quality of the random forests model. We report the error rates for four randomly chosen zones in Table . Error rates were similar across the four zones for the three response variables. The low error rates indicate high model accuracy.
Out‐of‐bag error rates for each response variable in four randomly chosen LANDFIRE zones| Zones | Forest cover (%) | Forest height (%) | Existing vegetation group (%) |
| 7 | 7.79 | 2.77 | 1.13 |
| 9 | 6.99 | 1.79 | 0.897 |
| 12 | 7.98 | 2.29 | 1.08 |
| 21 | 9.85 | 3.02 | 1.87 |
FIA data security restrictions did not allow the distribution of the tree list if any plot with its center located in a pixel was imputed to that pixel, which occurred at 3679 of the 15,333 plot locations. For these plots, we normalized each of the predictor variables to a scale of 0–1, then found the second best‐matching plot, and imputed that plot ID instead. In the research (nondistributed) version of the tree list, we retained the original plot ID values.
Fidelity Compared to LANDFIRE Attributes and random chance
Another measure of the performance of this modified random forests method was to compare the characteristics of the imputed plots to the gridded target data. Specifically, we compared the forest cover, forest height, and vegetation group of the imputed plot data with those of the gridded target data and report the percentage agreement for each LANDFIRE zone and the US West as a whole. For the US West, we also calculated and reported Cohen's kappa statistic to account for agreement by random chance in forest cover, forest height, and vegetation group, with complete agreement being ĸ = 1 and agreement only by random chance being ĸ = 0 (Cohen ). Bar plots were used to compare the proportion of the data in each cover, height, and vegetation class across the three data sources (FIA plots, target LANDFIRE data, and imputed data). In addition, we calculated the physical distance from each pixel center to the center of the plot that was imputed, to assess whether random forests preferentially imputed plots from nearby.
Results
Plot identification numbers were imputed at 30 × 30 m resolution to 997,153,322 forested pixels in the US West. We found that plots tend to impute to a cluster of pixels, due to similarities in the topographic and biophysical predictor variables, as clustering is not imposed by the random forests method (Fig. ).
During fidelity assessment, we compared the values of the response variables (forest cover, forest height, and vegetation group) in the imputed plot data to the LANDFIRE target raster grids in order to obtain the estimated levels of agreement for the tree list. For all forested pixels in the US West, within‐class agreement was 79% for forest cover, 96% for forest height, and 92% for vegetation type. In addition, agreement for these variables is high in most zones (Table ).
Within‐class agreement in percentage between LANDFIRE target data and imputed plot data, summarized by LANDFIRE zones in the US West| Zones | Forest cover | Forest height | Vegetation group | No. of pixels |
| z01 | 68 | 90 | 92 | 78,355,830 |
| z02 | 73 | 82 | 91 | 39,468,642 |
| z03 | 60 | 91 | 94 | 35,678,822 |
| z04 | 57 | 94 | 82 | 19,058,444 |
| z05 | 65 | 95 | 88 | 4,431,257 |
| z06 | 70 | 95 | 92 | 59,959,878 |
| z07 | 73 | 96 | 92 | 73,522,690 |
| z08 | 40 | 93 | 87 | 3,897,792 |
| z09 | 86 | 97 | 84 | 44,138,635 |
| z10 | 80 | 96 | 97 | 113,675,398 |
| z12 | 92 | 99 | 94 | 32,167,916 |
| z13 | 90 | 85 | 96 | 2,311,150 |
| z14 | 92 | 100 | 97 | 695,799 |
| z15 | 85 | 98 | 93 | 58,799,790 |
| z16 | 84 | 100 | 93 | 42,812,230 |
| z17 | 95 | 99 | 95 | 23,553,327 |
| z18 | 86 | 92 | 63 | 7,734,012 |
| z19 | 87 | 99 | 96 | 57,959,737 |
| z20 | 59 | 96 | 83 | 11,108,613 |
| z21 | 81 | 96 | 92 | 45,645,598 |
| z22 | 88 | 98 | 48 | 5,777,871 |
| z23 | 92 | 100 | 96 | 38,898,791 |
| z24 | 91 | 98 | 97 | 36,193,270 |
| z25 | 70 | 98 | 91 | 18,027,523 |
| z27 | 85 | 97 | 95 | 12,839,794 |
| z28 | 78 | 98 | 93 | 103,288,833 |
| z29 | 77 | 97 | 91 | 27,151,679 |
| Overall | 79 | 96 | 92 | 997,153,322 |
Note
Results are reported for the three response variables: forest cover, forest height, and vegetation group.
Forest height
LANDFIRE maps forest height in five classes: 0–5 m, 5–10 m, 10–25 m, 25–50 m, and greater than 50 m. The LANDFIRE organization also computes the height of FIA forest plots in its LFRDB to tenths of a meter. We compared the height of each imputed plot to the height class mapped by LANDFIRE for the corresponding pixel. Overall within‐class agreement for height varied between 82% and 100% across the 27 completed zones (Table ). For the western United States as a whole, the overall percentage agreement was 96% and Cohen's kappa was 0.93. The imputation reproduced the patterns in the gridded LANDFIRE data (Fig. ).
The proportion of the landscape in each height class was similar across the LANDFIRE data, the imputed data, and the FIA plots (Fig. ). However, the distribution of height classes was more similar between the LANDFIRE and imputed data than in the FIA data, with the LANDFIRE and imputed data somewhat underrepresenting the lowest height class (0–5 m) and overrepresenting the 5‐ to 10‐m height class compared with the FIA plot data. As the FIA data constitute a random sample of forested points on the landscape, they likely represent the proportion of height classes present on the landscape quite well. It is not surprising that the imputed data would better match the LANDFIRE data, however, because the LANDFIRE gridded data were used as target data in this project. The close correspondence between the three distributions indicates that LANDFIRE data accurately capture the distribution of height classes present in a random sample of the landscape (as conveyed by the FIA data) and that the imputed data correspond closely with the LANDFIRE target data, an indication of high model accuracy. Within‐class agreement was related to the number of plots in a height class, with rarer classes having lower agreement rates (Fig. ).
Forest cover
Forest cover is mapped in nine classes by LANDFIRE: 10–19%, 20–29%, 30–39%, 40–49%, 50–59%, 60–69%, 70–79%, 80–89%, and 90–100%, with areas of tree cover less than 10% not considered forested. For FIA plots, forest cover is estimated to the nearest percentage in the LFRDB. In the 27 LANDFIRE zones in the western United States, overall within‐class agreement for cover varied between 40% and 95% (Table ). For the western United States as a whole, percentage agreement was 79% and Cohen's kappa was 0.75. In general, landscape patterns of forest cover in the target data were well reproduced by the imputed data (Fig. ). The proportion of the landscape in each cover class was similar for the LANDFIRE target data, the imputed data, and the FIA plots (Fig. ). The imputed data, however, underrepresented the lowest cover class (10–19%) compared with the FIA and LANDFIRE data. The number of plots in a cover class affected the within‐class agreement rates, with rarer cover classes having lower rates of agreement (Fig. ).
Vegetation group
The third response variable, vegetation group, is mapped to the gridded target data by LANDFIRE, and assigned to FIA forest plots in the LFRDB. Note that all vegetation groups appear in the LANDFIRE and imputed data appear in the FIA plots (n = 36), but not all vegetation groups represented by FIA plots appear in the LANDFIRE (n = 31) or imputed (n = 30) data. In other words, FIA plots could be keyed to a vegetation group that did not appear in the gridded LANDFIRE data (details are given in Table ). Vegetation group 701 (introduced riparian vegetation) appeared in the gridded LANDFIRE data, but only one FIA plot keyed to this vegetation group, and it was never used in the imputation, as it was presumably not a good match for the pixels where it appeared in the LANDFIRE data in terms of the other predictors (cover, height, x, y, and biophysical variables).
Existing vegetation group names and codes that were assigned to FIA plots in the western United States in the LANDFIRE Reference Database| Name | Code | Present in LANDFIRE gridded target data | Present in imputed data set |
| Unclassified Forest and Woodland | 201 | ||
| Unclassified Savanna | 204 | ||
| Aspen Forest, Woodland, and Parkland | 602 | x | x |
| Aspen–Mixed Conifer Forest and Woodland | 603 | x | x |
| Bigtooth Maple Woodland | 605 | x | x |
| Chaparral | 607 | x | x |
| Conifer–Oak Forest and Woodland | 610 | x | x |
| Deciduous Shrubland | 612 | ||
| Desert Scrub | 613 | ||
| Douglas‐fir Forest and Woodland | 614 | x | x |
| Douglas‐fir–Western Hemlock Forest and Woodland | 615 | x | x |
| Grassland and Steppe | 618 | ||
| Juniper Woodland and Savanna | 620 | x | x |
| Limber Pine Woodland | 621 | x | x |
| Lodgepole Pine Forest and Woodland | 622 | x | x |
| Douglas‐fir–Ponderosa Pine–Lodgepole Pine Forest and Woodland | 625 | x | x |
| California Mixed Evergreen Forest and Woodland | 626 | x | x |
| Mountain Hemlock Forest and Woodland | 627 | x | x |
| Mountain Mahogany Woodland and Shrubland | 628 | x | x |
| Western Oak Woodland and Savanna | 629 | x | x |
| Pinyon–Juniper Woodland | 630 | x | x |
| Ponderosa Pine Forest and Woodland and Savanna | 631 | x | x |
| Red Alder Forest and Woodland | 632 | x | x |
| Red Fir Forest and Woodland | 633 | x | x |
| Redwood Forest and Woodland | 634 | x | x |
| Western Riparian Woodland and Shrubland | 635 | x | x |
| Sitka Spruce Forest | 638 | x | x |
| Spruce‐Fir Forest and Woodland | 639 | x | x |
| Subalpine Woodland and Parkland | 640 | x | x |
| Western Hemlock–Silver Fir Forest | 642 | x | x |
| Douglas‐fir–Grand Fir–White Fir Forest and Woodland | 643 | x | x |
| Western Larch Forest and Woodland | 644 | x | x |
| Western Red‐cedar–Western Hemlock Forest | 645 | x | x |
| Bur Oak Woodland and Savanna | 659 | ||
| Juniper–Oak | 696 | x | x |
| Introduced Riparian Vegetation | 701 | x |
Note
Note that not all vegetation groups assigned to FIA plots are present in the forested pixels of the LANDFIRE gridded target data for the same region, or in the imputed tree‐list data set.
Overall agreement for vegetation group varied between 48% and 97% across the 27 zones (Table ). For the western United States, within‐class agreement was 92% and Cohen's kappa was 0.92. In general, the random forests imputation accurately reproduced the patterns in vegetation group in the LANDFIRE data (Fig. ), but the rates of agreement for the Western Riparian Woodland and Shrubland category (group 635) were low in many zones. This category had few plots (n = 135), and most of these plots were located in mesic coastal areas of Washington and Oregon. Hence, random forests rarely imputed them in the drier colder continental sections of the US West and instead tended to impute the plots with other vegetation types common to the area. Even with this limitation, the proportion of the landscape in each vegetation group was similar across the FIA plots, LANDFIRE, and imputed data (Fig. ). Vegetation group 615 (Douglas‐fir–Western Hemlock Forest and Woodland) was somewhat overrepresented in both the LANDFIRE and imputed data, while 622 (Lodgepole Pine Forest and Woodland) was somewhat underrepresented in both the LANDFIRE and imputed data compared with the FIA plots. Interestingly, the LANDFIRE data overrepresented 635 (Western Riparian Woodland and Shrubland) compared with the FIA data, while the proportion imputed by random forests was similar to that of the FIA plots.
Similar to the results for height and cover, agreement was lower in rarer classes of vegetation group, although there were exceptions (Fig. ). This result makes sense, because it is unlikely in rare types that random forests can match all three of the response variables (forest cover, height, and vegetation group) when choosing from a limited pool of candidate forest plots, and must in essence choose which of these response variables is most important to match.
Distance
In most cases, random forests chose nearby plots for imputation to a pixel. In some cases, likely when a rare plot was required, the plots were imputed from over 1500 km away (Fig. ). Distance from the pixel center to the imputed plot center is shown for a subset of the landscape in Fig. . Nearby plots are preferred for imputation not only because of similarity in the x and y coordinates, but because of the similarity in biophysical variables, which indicate the similarity in site descriptors including species composition, site productivity, disturbance history and regime, and climatic characteristics that were not directly included in the imputation.
Frequency of imputation
Plots tended to impute to a number of pixels, with counts between 10,000 and 100,000 being the most frequently occurring (Fig. ). It was rare that a plot imputed to fewer than 10 pixels, with the number of times this occurred being around 1500.
As mentioned above, of the 15,333 plots used in this imputation, 3679 (24%) of the plots imputed to the actual pixel where their centroid was located. There are many reasons why a plot might not impute to the same pixel as its centroid location. The footprint of a single plot covers approximately 13 pixels of a 30‐m grid, due to the splayed four‐subplot design seen in Fig. , even though the combined area of a single plot is less than that of a single 30‐m grid cell. We found that the characteristics of a pixel of LANDFIRE data often do not match the characteristics recorded by FIA for a plot centered on that pixel. This may result because the characteristics of the plot as a whole (including the cover, height, and vegetation type) will be driven more by the three subplots that are not located at the center pixel than by the one subplot located at the center pixel. Thus, the summary characteristics of a plot would not necessarily be expected to be a good representation of the characteristics of the center pixel, especially where landscape variability is high. In addition, LANDFIRE gridded data contain some level of error, as do the measurements taken at FIA plots. There may also be discrepancies between the plot characteristics and LANDFIRE data due to temporal mismatches between when the plot was measured and the year the LANDFIRE data were mapped. Two examples of this are as follows: (1) The forest at the pixel grew between the time the plot was measured and the year LANDFIRE data were mapped, resulting in higher cover values and a higher number of shade‐tolerant species, causing the vegetation group to change, and (2) the forest burned, resulting in changes in cover, height, and vegetation group. Changes in either type could cause the plot centered on that pixel to no longer be the best match for the pixel, and a different plot to be chosen by random forests. Temporal mismatches are not important to our analysis, because we wanted to choose the plot that best represented the characteristics of each pixel circa 2008, regardless of when the plot itself was measured.
Case study in zone 22: When random forests is wrong. Or is it?
In a departure from most other LANDFIRE zones, zone 22 had low agreement (48%) between the imputed data and target data for vegetation group. Examination of the confusion matrix for vegetation group for this zone suggests that misclassification for vegetation group number 621 (Limber Pine Woodland) is driving the low accuracies for the zone (Table ). Zone 22 comprises the Wyoming Basin, a high‐elevation inland cold plateau with few trees (Fig. ). Many, if not most, of the forested pixels in zone 22 were located in riparian areas, which LANDFIRE had generally classified as Western Riparian Woodland and Shrubland (shown in the table as vegetation group number 635). As noted above, most of the FIA plots available for imputation in this vegetation group were located in the milder mesic climate of coastal Oregon and Washington, and were not suitable matches in this environment, based on the biophysical predictor variables. Because there were no biophysically appropriate Western Riparian Woodland and Shrubland plots available for imputation, random forests tended to choose plots with a vegetation group of Limber Pine Woodland (621) for imputation in zone 22 instead (Fig. ). We checked the FIA plots in the area and found that the most frequent vegetation group in this area was in fact limber pine rather than riparian, as suggested by the imputation. In this case, where the imputation appeared to be “wrong” based on comparison with the LANDFIRE data, it in fact accurately reflected the attributes of FIA plots in the vicinity. Random forests was able to discern that in this case, the biophysical and location predictor variables were more important than the vegetation group predictor variable, and assign the appropriate plots from a limited population.
Confusion matrix for vegetation group in zone 22| Imputed plot vegetation group | ||||||||||||||
| 602 | 603 | 614 | 620 | 621 | 622 | 625 | 628 | 630 | 631 | 635 | 639 | 640 | Accuracy | |
| LANDFIRE (target) vegetation group | ||||||||||||||
| 602 | 283,552 | 1976 | 16,957 | 1080 | 214,371 | 3699 | 28,111 | 109 | 2149 | 38,398 | 22,699 | 1033 | 117 | 0.46 |
| 603 | 173 | 10,911 | 4425 | 0 | 1592 | 9083 | 4744 | 1 | 0 | 2678 | 0 | 4523 | 2 | 0.29 |
| 614 | 1989 | 965 | 26,626 | 3 | 7640 | 213 | 1163 | 715 | 729 | 1128 | 1419 | 145 | 0 | 0.62 |
| 620 | 13 | 4 | 239 | 2025 | 70,455 | 37 | 222 | 14 | 7570 | 982 | 910 | 20 | 8 | 0.02 |
| 621 | 1307 | 566 | 5746 | 628 | 1,228,828 | 2708 | 3350 | 154 | 1902 | 18,686 | 11,871 | 1963 | 2363 | 0.96 |
| 622 | 6 | 0 | 358 | 0 | 12 | 52,342 | 2 | 0 | 0 | 0 | 0 | 67 | 16 | 0.99 |
| 625 | 560 | 155 | 10,448 | 0 | 1246 | 1636 | 18,505 | 75 | 61 | 122 | 115 | 3654 | 95 | 0.50 |
| 628 | 637 | 131 | 12,741 | 340 | 234,846 | 14,103 | 9454 | 116,512 | 29,603 | 47,933 | 24,971 | 33,397 | 1414 | 0.22 |
| 630 | 42 | 3 | 176 | 92 | 4061 | 59 | 65 | 13 | 643,964 | 1107 | 1217 | 247 | 87 | 0.99 |
| 631 | 37 | 15 | 27 | 0 | 36,888 | 1211 | 31 | 397 | 1296 | 78,507 | 69 | 46 | 17 | 0.66 |
| 635 | 11,069 | 5062 | 42,199 | 3809 | 1,700,003 | 49,878 | 47,304 | 24,667 | 106,053 | 31,430 | 255,595 | 6292 | 700 | 0.11 |
| 639 | 57 | 18 | 961 | 0 | 2305 | 2018 | 726 | 21 | 111 | 585 | 227 | 31,434 | 0 | 0.82 |
| 640 | 170 | 132 | 526 | 0 | 1430 | 3970 | 1927 | 27 | 108 | 820 | 0 | 122 | 3195 | 0.26 |
| Accuracy | 0.95 | 0.55 | 0.22 | 0.25 | 0.35 | 0.37 | 0.16 | 0.82 | 0.81 | 0.35 | 0.80 | 0.38 | 0.40 | 0.48 |
Notes
Overall, agreement for this zone was the lowest for any zone in the western United States at 48%. The most plentiful vegetation group in this zone was 621, Limber Pine Woodland. The low agreement in that vegetation group is one of the major drivers of the low agreement for the zone as a whole, due to the large proportion of pixels mapped as 621. Italics denote the instances where the model predicted a given vegetation group correctly. (Note that the codes can be looked up in Table to yield the name of the vegetation group.)
Discussion
Where the sparseness of forest inventory data limits direct estimates of forest biomass (Blackard et al. ), a method such as imputation can fill in the interstitial data values. Our effort to employ a random forests imputation suggests that it holds advantages over other methods because it allows both categorical and continuous predictor variables and the fidelity of predicted multivariate response variables can be easily quantified. The technique is also repeatable when revisions or updates of target data or reference data become available.
While not new here, the use of multivariate response variables is relatively recent (Crookston and Finley ) and constitutes one of the strengths of the modified random forests method used here. In order to optimize the output for predicting risk to carbon from wildfire, we chose three response variables: forest cover, forest height, and vegetation group. We allocated the number of decision trees predicting each equally (83 for forest cover, 83 for forest height, and 83 for vegetation group). However, for other applications, if one of the response variables was considered more critical than the others, it could be weighted more heavily (by allocating more of the decision trees to it).
While we found strong agreement between the gridded target LANDFIRE data and the imputed data, agreement could be improved in the future by including more forest plots, especially those in rare types. For the purposes of this project, a “type” can be considered the combination of forest cover, forest height, and vegetation group for a plot. We were restricted in the number of plots by both the FIA and LFRDB databases. However, we considered the reliability of the data in the FIA plots to be paramount and superior to other sources. The population of possible predictor variables was severely restricted because the variables must appear in both the reference (plot) and target (gridded) data sets, and of this population, we deemed that forest cover, forest height, and vegetation type were necessary to our study. It was possible to obtain these variables for the FIA plots only in the LFRDB, so we were thus restricted in the number of plots available for imputation. Nonetheless, the 15,333 plots in our data set represent a large amount of the variability present in the western United States.
The data set described here will enable future research in a number of directions. Because the imputation in essence assigns the best statistically matching FIA plot to each 30 × 30 m pixel, the resulting gridded map of plot IDs can be linked back to the FIA databases, from which many characteristics of the plot can be extracted. Thus, the map of plot IDs can be used to generate maps of basal area, or lists of the trees present at any pixel or group of pixels, among many other applications. Because each pixel is linked to a tree list, this data set provides the necessary information for initializing models such as the Forest Vegetation Simulator, including tree number, species, size, and status (Dixon ). Although the tree list was designed for this type of use, we have recently become aware of a number of other potential uses, including initializing wind fields in WindNinja and initializing forest stands in FireBGC.
The original purpose of this research effort was to predict the carbon risk from wildfire. To this end, the tree list can be intersected with modeled burn probability and fire intensity distributions from the western United States (Finney et al. ) to estimate risk to terrestrial carbon resources from wildfire. In this process, mortality and carbon storage for each plot are modeled at each fire intensity class; summaries by vegetation type or geographic area are obtained by weighting the plot results by probabilities of each intensity. Stand and landscape effects of earlier wildfires and intentional fuel treatments on carbon or forest structure can be estimated by introducing changes to the landscape and re‐running the fire simulations (which changes the fireline intensity probability distributions). Thus, the implications of fuel treatments on risk to carbon from wildfire can be illuminated.
The imputations are also useful for examining the potential thinning volume associated with fuel treatments and alternative fire risk reduction activities. Thinning is a critical part of restoration treatments prior to the introduction of prescribed burning in many low‐ to mid‐elevation forests of ponderosa pine and mixed conifer in the western United States (Graham et al. , Agee and Skinner , Martinson and Omi ). Because this data set can be linked back to the number, size, and species of trees modeled for each pixel, the data set can be used to apply treatment prescriptions and estimate the characteristics of trees that would thus be removed. Estimating the potential volume of merchantable and nonmerchantable forest products available from treatment activities is helpful for treatment planning for national forests, as well as economic cost‐benefit analyses.
Conclusions
The modified random forest approach presented here produced high within‐class agreement with gridded landscape data for the western United States for our three response variables. Within‐class agreement for forest cover was 79%, agreement for forest height was 92%, and agreement for vegetation group was 96%. The methodology presented here is novel in that the three response variables served also as predictor variables for the reference data, with the expected result of producing high rates of agreement between the gridded target landscape data and the imputed data for these three variables. We chose this methodology in order to optimize the imputed data for the prediction of risk to carbon resources from wildland fire and for biomass calculations. In that sense, this tree list might not be the optimal data set for all applications (e.g., mapping of tree species envelopes). However, the methodology presented here is flexible and allows users to select predictor and response variables best suited to their research goals.
Because the spatial data set presented here can be linked to the extensive list of attributes recorded by FIA for each plot, a number of forest characteristics can be estimated directly from the data set. In addition, the data set can be used to initialize a number of forest simulation models. Because the data set in essence provides a tree‐level model of the forests in the western United States, it greatly augments the information available to researchers and managers who previously relied on data from sparse forest plots or stand inventories.
Acknowledgments
The Rocky Mountain Research Station and the National Fire Decision Support Center supported this effort. We are grateful to Nicholas Crookston for assistance in understanding and using his yaImpute package in R. Elizabeth Burrill at FIA and Chris Toney with the USFS Fire Laboratory in Missoula assisted us with obtaining FIA and LFRDB data via a Memorandum of Cooperation. We would also like to thank two anonymous reviewers for their comments on a previous draft.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2016. This work is published under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Maps of the number, size, and species of trees in forests across the western United States are desirable for many applications such as estimating terrestrial carbon resources, predicting tree mortality following wildfires, and for forest inventory. However, detailed mapping of trees for large areas is not feasible with current technologies, but statistical methods for matching the forest plot data with biophysical characteristics of the landscape offer a practical means to populate landscapes with a limited set of forest plot inventory data. We used a modified random forests approach with Landscape Fire and Resource Management Planning Tools (
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





