Comparing Machine Learning Approaches for

Full text

Turn on search term navigation

1. Introduction

Meeting continuously increasing food and fuel demands while protecting environmental integrity is a grand challenge. Agriculture as the primary stage for food and fuel production is associated with a range of environmental pollution issues ranging from global warming to nutrient degradation. Specifically, agriculture contributes to 8% of national greenhouse gases (GHGs) [1] and ranks as a leading contributor to nutrient pollution nationally and globally [2,3,4]. The estimated economic loss of environmental damage caused by nitrogen release alone already exceeds $210 billion/year in the United States (U.S.) [5]. In addition, the continuous increasing of food and fuel demands accompanied by population growth, energy and water shortages, and weather unpredictability, will further accelerate environmental pollution from agricultural expansion [6,7,8,9,10]. Ensuring environmentally friendly agricultural production is important to achieving multiple United Nations’ Sustainable Development Goals, such as zero hunger, climate action, clean water and responsible production. Globally, food production ranks as a top contributor to water quality degradation in the form of eutrophication and hypoxia. To effectively mitigate agricultural pollution, it is urgent to accurately and rapidly assess the environmental impacts of agriculture.

The life cycle assessment (LCA) is the most prominent research method for quantifying environmental releases of agricultural supply chains [11,12,13,14,15]. Agricultural LCA is capable of quantifying environmental releases from both on-farm and supply chain activities, such as fertilizer production. The systematic scope of LCA is necessary in order to avoid shifting environmental burdens, such as between global warming and eutrophication impacts [16]. The recent LCA studies indicated that life cycle environmental releases of crop production present substantial spatial heterogeneity [17,18,19,20], as influenced by weather, soil and farming practices. For example, the reported life cycle nutrient releases of corn in the Midwest region varied by a factor of 200, spanning from 0.001 to 0.2 kg N-eq/kg of corn [13,17,20,21,22,23,24,25,26,27,28,29]. The spatially-explicit life cycle environmental impacts from agricultural production serve as a scientific basis for locating spatial hot-spots and designing spatially targeted remediation strategies.

Various approaches exist to estimating the life cycle environmental impacts of agricultural production [30,31]. However, it remains challenging to efficiently estimate, spatially and temporally, explicit life cycle environmental impacts at the county scale over a large region. The existing approaches belong to two general groups, based on their fundamental differences in the structure of life cycle inventory. The first group features the integration of spatially and temporally explicit on-farm releases from biogeochemical models and supply chain releases from process-based LCA. For crop systems, the existing crop LCAs utilized biogeochemical models such as the daily CENTURY (DAYCENT) model [1,29,32], environmental policy integrated climate model (EPIC) [13], and the denitrification-decomposition model (DNDC) [33,34] to estimate GHG and nutrient releases of corn [22,35], soybeans [35], alfalfa [35], hybrid poplar [35], switchgrass [22,35], and miscanthus [13,33]. These biogeochemical models are capable of capturing spatial explicit environmental releases, as influenced by location-specific weather, soil, topography and farming practices at finer spatial scales, such as farm or county scales [1,36,37]. Despite being powerful, these biogeochemical models are data-intensive and time-consuming, necessitate expert inputs and often require supercomputer clusters to compute the spatially explicit environmental impacts over a large region (i.e., the Midwest) [38,39]. The second group utilizes the input-output structure for the life cycle inventory. The multi-regional input-output LCA approaches, such as EORA and EXIOBASE [40], are utilized to track the environmental impacts of economic interdependencies among regions, often at a sector level. Although the input-output LCA approaches are capable of investigating the environmental impacts of trades over a large geographic area, the life cycle environmental inventory is often aggregated at a sector level and reside at coarse spatial scales, such as regional or county scales. EXIOBASE does not track the influences of temporally dynamic factors such as weather, soil and farming practices on life cycle environmental impacts. Overall, methods capable of rapidly and accurately estimating spatially and temporally an explicit life cycle environmental inventory of agricultural production at the county scale over a large region are lacking.

Machine learning (ML) approaches were applied to various disciplines, such as predicting traffic speed in urban places and identifying fertilizer recommendation classes. While these studies are valuable, they demonstrate applications of ML in LCA [41,42]. Machine learning (ML) approaches may present promising alternatives to traditional approaches for accurately and efficiently estimating spatially and temporally explicit LCA at the county scale over a large region across multiple years. Distinct from traditionally process-based or input-output LCA approaches, ML models are capable of efficiently computing large datasets at the county scale across multiple years. Although ML techniques have promising potentials for transforming LCAs, the application of ML models to spatially and temporally LCA has been limited to date [43,44,45,46,47,48]. Previously, ML techniques were used to estimate either spatially generic life cycle release inventory of production processes [48,49], or life cycle toxicity impacts of chemicals [50]. A few recent studies applied neural networks to examine the life cycle GHG emissions associated with producing various crops based on literature sources or the spatially generic emission factors of the Intergovernmental Panel on Climate Change (IPCC) [6,7,51,52,53,54,55,56]. These valuable efforts greatly contributed to fusing LCA and ML. However, none of the existing studies assessed the possibility of using ML techniques for predicting spatially and temporally explicit life cycle environmental releases.

To fill in the aforementioned knowledge gaps, this study tested and compared five ML approaches for computing spatially and temporally explicit life cycle environmental impacts at the county scale, with corn production in the Midwest region as a case study. The research question this study addresses is which machine learning model will yield highest predictive accuracy and computational efficiency for estimating the spatially and temporally explicit life cycle environmental impacts of corn. The Midwest region contains more than 39 million hectares of land and produced more than 80 percent of U.S. corn in the year 2018 [57]. This study identified the most efficient the and accurate combination of ML approaches and datasets for predicting spatially explicit life cycle global warming (GW) and eutrophication (EU) impacts of corn production. To the authors’ best knowledge, this is the first study applying and comparing ML approaches for predicting the spatially explicit life cycle environmental impacts of agricultural production.

2. Models and Data Sets 2.1. Overview of Models

Five modeling approaches were tested and compared in this study, including linear regressison (LR), support vector machine regression (SVR) (Section 2.1.1), artificial neural network (ANN) (Section 2.1.2), two tree based model, gradient boosted regression tree (GBRT) (Section 2.1.3), and extreme gradient boosting (XGBoost) (Section 2.1.4) to determine the modeling approach which presented best predictive accuracy and the highest computational efficiency [58].

2.1.1. Support Vector Machine Regression

Support vector machines (SVMs) are a class of learning based non-linear modeling algorithms with proven performance in a wide range of practical applications in classification and regression [59,60]. SVMs were initially developed for qualitative modeling problems. With the introduction of anϵ-insensitive loss function, SVMs have been extended to solve quantitative modeling problems. In this study, the SVM regression (SVR) technique was employed to model the county-level life cycle GW and EU impacts influenced by climate, soil characteristics, and farming practices features.

Suppose that a training set S containing n data points is given as:

S=_x1,_y1,…,_xi,_yi,…,_xn,_yn

where_xi∈^Rd, i.e., d-dimensional space, and_yi∈R. The goal of SVR is to find a function f which can estimate all these data well. It is equivalent to the following constrained optimization problem:

min12||w||+C∑i=1n(_ξi+_ξi*)s.t._yi−<w,_xi>−b≤ϵ+_ξi<w,_xi>+b−_yi≤ϵ+_ξi*_ξi,_ξi*≥0

where w is the normal vector of the hyperplane,_ξiand_ξi*are the slack variables, and C is the coefficient of the regularization terms. In this case, it is set to equal to 1.^wT _xi+bis the regression line which is the center of a tunnel with radiusγ=1||w||. Points on the surface so-called support vectors make up the tunnel. Involving_ξi,_ξi*allows left certain points outside tunnel as errors in order to avoid over-fitting._ξi,_ξi*are set as 0.1.

The main purpose of SVR is to represent complicated relationships via non-linear mapping. The original input space is mapped into a high-dimensional feature space using kernel functions where it becomes linearly separable. Some common used kernel functions are linear, polynomial, sigmoid, and radial basis kernel functions (RBF) [61]. The RBF is the most often used kernel function in SVR.

k(_xi,_xj)=exp(−γ‖_xi−_xj^‖2)

The kernel function can be chosen according to data structure and the aim of project. Due to the small size of features, Gaussian kernel is used in this study.

Two parameters, C andγ, must be appropriately set in SVR. The accuracy will be very high in the training stage and very low in the testing stage, when the value of C is set too large, while an extremely small value of may result in under-fitting, and excessively large value of may lead to over-fitting.

2.1.2. Artificial Neural Networks

Artificial neural networks (ANN) represent a type of computing that is based on the way that the human brain performs computations [62], which has been extensively used for pattern recognition and classification [63,64]. It consists of simple computational units called neurons. The numbers of neurons in the input and output layers are typically fixed by the type of application. The number of hidden layers and their neurons is typically determined by trial and error [65]. ANN can model complex non-linear relationships. The extra layers enable the composition of features from lower layers, giving the potential of modeling complex data with fewer units than a similarly performing shallow network [66].

The multi-layer perceptron (MLP) is the most frequently used ANN method, which consists of layered feedforward networks typically trained with static back propagation (see Figure 1).

The neuron receives a set of inputs or signals (x), calculates a weighted average of them (z) using the summation function and weights (w), and then uses some activation function (f) to produce an output, where

z=∑i=1n_wi _xi

The connections between the input layer and the hidden layer contain weights which are usually determined through training the system. The hidden layer sums the weighted inputs and uses the transfer function to create an output value. The transfer function is a relationship between the internal activation level of the neuron (called activation function) and the outputs. The rectified linear unit (ReLU) is the most commonly used activation function in deep learning models.

f(x)=^x+=max(0,x)

where x is the input to a neuron. The MLP was trained with back-propagation algorithm (BPA), the most frequently used in the ANN literature, and was adopted in this study.

The values of the weights are changed by using stochastic gradient descent (SGD) to minimize a suitable function used as the training stopping criterion. One of the functions most commonly used in supervised learning is the sum-of squared residuals given by Equation (6):

Loss=_∑i=1N ^{(_yi−_y^i)2}n

where_yiand_y^iare the actual and predicted values, respectively.

The current weight change on a given layer is given by Equation (7).The weights will be updated by following the slope of the error surface downwards towards its minimum.

_winew←_wiold−η(∂Loss∂^wold)

whereη(usually between 10⁻⁶ and 1.0) is the learning rate.

A high learning rate corresponds to rapid learning which may push the training towards a local minimum or cause oscillation. In turn, when applying small learning rates, the time to reach a global minimum will be considerably increase [67].

To achieve faster learning and avoid local minima, an additional term, called “momentum,” is adopted to smooth out the weight changes, which helps to protect network learning from oscillation. To control the overfitting of a neural network,_L2regularization and dropout are also used in this study.

2.1.3. Gradient Boosting Regression Tree

Gradient boosting regression tree (GBRT) is an ensemble learning method that integrates the regression tree model with gradient boosting method. It builds a set of decision trees in a greedy manner at training time and makes a class prediction by majority voting. The algorithm is shown below.

Initialize

_F0(x)=argminγ∑i=1NL(_yi,γ)

where L is the loss function, defined by Equation (10)
Compute

_rim=−_{∂L(_yi,f(_xi))∂f(_xi)f=_fm−1(x)}

throughi=1,…,Nandm=1,…,M. The_rimare called pseudo-residuals.
Fit a regression tree to target_rimgiving terminal regions_Rmj, forj=1,…,J(_Rmjis a tree consisting of J leaf nodes).

_γjm=argminγ∑_xi∈_RmjL(_yi,_fm−1(_xi)+γ_hm(_xi))

The new multiplier_γjmis updated based on the tree_fm−1(x)and_γjm−1.
The new tree is updated by

_fm(x)=_fm−1(x)+a∑j=1Jm_γjmI(x∈_Rmj)

I is the indicator function which equals to 1 if x is in the region_Rmj. In this paper, we prevent over-fitting by controlling the learning rate and subsample. We set the learning rate to 0.01 and subsampling rate to 0.75.
Output_yi^=_FM(_xi)by updating the following equation until m reaches M.

_fM(_xi)=∑m=1M_fm(_xi)=∑m=1M∑j=1J_γjmI(x∈_Rmj)

The multiplier_γjmshrinks in small steps again to avoid over-fitting sufficiently. The maximum depth of the model is set to be 8 to constrain the function space. It controls the complexity of trees. In general, the GBRT model is voted by majority of M trees to decide the best step function to approximate the regression line. The total estimators is equal to 2000.

2.1.4. Extreme Gradient Boosting

Extreme gradient boosting (XGBoost) is an improved algorithm based on the gradient boosting decision tree and can construct boosted trees efficiently and operate in parallel [68]. Compared with GBRT, XGBoost provides four additional features to achieve better performance. They are: (1) The weights of each new tree can be scaled down by given constant leaves. It reduces the influence of a single tree on the final score. (2) In column sampling, which works in a similar way to random forests, each tree is built using only a column-wise sample from the training dataset. (3) The introduction of the regularized loss function to avoid overfitting. (4) The loss function is approximated by Taylor expansion which speeds up the optimization procedure. We want to minimize the following regularized objective function in order to choose_fk.

L=∑i=1nl(_yi^,_yi)+∑k=1KΩ(_fk)

l is the loss function. In order to prevent too large a complexity of the model, the regularization termΩis included as follows:

Ω(f)=γT+12^λ||w||2

whereγis the complexity of each leaf. T is the number of leaves in the decision tree, andλis a parameter to scale the penalty which is set to be 0.85 in this study.

2.2. Datasets and Modeling Scenarios

The predictor variables (Xs) included the features of temperature, precipitation, soil organic content, soil texture, application rates of nitrogen and phosphorus fertilizers, and farming practices. The outcomes variables (Ys) were defined as county-level life cycle GW and EU impacts. A total of 10 modeling scenarios reflecting various combinations of algorithms and features were tested and compared in this study (Table 1). Modeling scenarios S1A through S5A included total of 32 features, containing monthly average temperature and precipitation features. Modeling scenarios S1A through S5A included a total of 18 features, containing seasonally average temperature and precipitation features.

The data on climate, soil characteristics, and farming practices were collected from various governmental databases for 12 U.S. Midwest states during the years 2000–2008 (Table 2). The monthly and seasonal average temperature (°C) and precipitation (mm) were obtained from the National Oceanic and Atmospheric Administration (NOAA)’s [69] National Climatic Data Center (NCDC) [70]. Soil data such as soil organic content and soil texture were extracted from the U.S. Department of Agriculture Soil Survey Geographic Database (SSURGO) [71]. The nitrogen and phosphorus fertilizer application rates and tillage practices were obtained from the U.S. Department of Agriculture’s (USDA) [72] National Agricultural Survey Statistics and the Conservation Technology Information Center. The life cycle GW and EU impacts of corn production in Midwest states at the county-level during the years 2000-2008 are reported in our previous work (Lee et al. [17]). The total records are approximately 8000 over 9 years from 2000 through 2008. After removing records containing missing values, around 6000 complete records were left. To the authors’ best knowledge, this is the largest database for spatially explicit life cycle impacts of corn production. The data sources for each feature are described in Table 2.

2.3. Training, Validation, and Test Approaches

The modeling procedures include four sequential steps: data preprocessing, hyper-parameter tuning, validation, and testing. The process is shown in Figure 2.

To preprocess the raw data, we performed some trivial data cleaning, such as removing the missing values from the data using case-wise deletion. Categorical variables were one-hot encoded, while the numeric variables were normalized. The training step is equivalent to tuning the parameters for each model to seek the best fitted hyper-parameter combination. We call the parameters for the model hyper-parameters for distinguishing them from parameters of the data set. In order to find the best performing hyper-parameter combination, the grid search method was used. Two hyper-parameters, C and r, have to select the optimal value from two chooses of each kind. The grid, in this case, can be imagined as a 2 by 2 matrix. The names of hyper-parameters are taken to be the columns, and the two possible choices are taken to be the rows. The algorithm chooses one hyper-parameter from each kind at a time becoming a unique combination. Every model used in this study has different hyper-parameters. The different models should be assigned a distinct hyper-parameter grid. The grid search method would try every combination of preset parameters until it accessed all the combinations once. Preset parameters are given by experience. The algorithm would record every result from every hyper-parameter combination, and eventually return the best result and the related hyper-parameter combination. The best-performed parameter set would settle for the corresponding ML model. After the first round of grid search, the rough selection provides a range of high performing hyper-parameter. This happens only if the hyper-parameters are continuous. For higher precision, the range is divided into several instances depending on the computer’s power. In this study, we divided the hyper-parameters, such as the bag fraction for the GBRT model, into 10 instances. In order to address the overfitting concern, we constructed models using a stricter 3-fold cross-validation methodology, in which we divided the three folds into training (two folds) and test (one fold) sets. The construction of the folds for cross-validation was created using the same stratification procedure. The process repeated three times until every fold was used as a validation fold once. The output from the 3-fold cross-validation would be the average of all folds of cross-validation results. Three metrics were used to evaluate the predictive accuracy of machine learning models, including cross-validation correlation, mean square error, and coefficient of determination:

(i)

Cross-validation correlation coefficient (Corr, see Equation (15)),

corr=_{σ_yi} _{σ_yi^}_{σ_yi_yi^}=∑^(_yi−y¯)2∑^{(_yi^−y^¯)2}∑(_yi−y¯)(_yi^−y^¯)

was adopted to measure the strength of relationship between label_yiand prediction_yi^.

(ii)

Mean-square-error (MSE, see Equation (16))

MSE=1n∑i=1n^{(_yi−_yi^)2}

measures the average of the errors between_yiand_yi^. This metric was involved to be the measure of training, testing, and cross-validation performance.

(iii)

Coefficient of determination (^R2 , see Equation (17))

^R2=1−_∑i=1n ^{(_yi−_yi^)2}_∑i=1n ^{(_yi−_yi¯)2}

was used to evaluate how well the regression line represents the data. The computational efficiency was determined by the required time for training ML models and for using ML models for predicting spatially and temporally explicit life cycle environmental impacts, based on the computer configuration of Intel i7-4790k CPU, 8 GB memory at 2400 Mhz and Nvidia GTX 970 graphic card. Finally, we are able to calculate the scores under different metrics. The relative importance of input parameters such as weather, soil, and farming practices on life cycle GW and EU were ranked by sorting their corresponding coefficients. All the algorithms and metrics are adopted from Python package scikit-learn [73].

3. Results 3.1. Predictive Accuracy

3.1.1. Influences of Input Variables

The models built on monthly features generally yielded higher predictive accuracy than the models built on seasonal features for the same ML algorithms. For example, both cross-validation (CV) correlation and^R2 values of S1A (linear model based on monthly features) were larger than those of S1B (linear model based on seasonal features) for the life cycle GW. Similarly, both CV correlation and R-squared values of S1A were larger than those of S1B for the life cycle EU. The same trends of CV correlation and R-squared values were observed for the remaining modeling algorithms (Table 3), including XGBoost, GBRT, SVR, and MLP. Consistently, XGBoost and GBRT using monthly features achieved lower MSE values than their corresponding algorithms with seasonal features. The only exception was that LM, SVR, and MLP with monthly features presented slightly higher MSE values than their corresponding algorithms with seasonal features.

3.1.2. Influences of Modeling Algorithms

Based on Table 3, GBRT presented the superior predictive accuracy. GBRT showed the highest CV correlation value (0.8) and R squared score (0.63), and the lowest MSE value (0.27) among all five models for the life cycle GW. Similarly, GBRT showed the highest CV correlation (0.87) and R squared score (0.75), and the lowest MSE value (10) among all five models for the life cycle EU. XGBoost was the second-best choice in terms of predictive accuracy. When the XGBoost model was applied for predicting life cycle GW, it produced the second-highest CV correlation value (0.78), the second-highest R squared score (0.61), and the lowest MSE value (0.28). Similarly, when the XGBoost model was used for predicting life cycle EU, it produced the second-highest CV correlation value (0.86), the second-highest R squared score (0.73), and the lowest MSE value (9). In contrast, the benchmark linear model exhibited the lowest predictive accuracy for both GW and EU. For example, the linear model had the lowest CV correlation value (0.45) and^R2scores (0.2), and the highest MSE (0.58) among all five models for life cycle GW. Similarly, the linear model showed the lowest CV correlation (0.65) and R squared (0.63), and the lowest MSE (0.27) among all five models for the life cycle EU. Overall, XGBoost, GBRT, SVR, and MLP all showed higher predictive accuracy than the benchmark linear model, indicating that these models were capable of capturing non-linearity and complex interactions among features.

3.1.3. Influences of Sample Sizes

Figure 3 and Figure 4 presented the influences of training data size on models’ predictive capabilities. The CV correlation values for predicting life cycle GW exponentially increased for all model algorithms when the sample size increased from 100 to 4000. The CV correlation values for predicting life cycle GW reached the peak values at a sample size of round 4000. When the sample size continuously increased beyond 4000, the peak point, the CV correlation values slightly decreased. The CV correlation values for predicting life cycle EU dramatically increased for all model algorithms when the sample size increased from 100 to 5000. Additionally, the CV correlation values for GB and XGBoost exceeded 0.85 when 5000 input datasets were used to generate the models.

Figure 3 and Figure 4 also confirmed that GBRT and XGBoost presented the highest predictive accuracy while the linear model always had the lowest predictive accuracy. For the same sample sizes across different algorithms, the GBRT model consistently showed the highest predictive accuracy among all models for both the life cycle GW and EU. The XGBoost model always ranked as the second-best fitted model, whose CV correlation values were slightly lower than the GBRT model under the same sample sizes.

3.2. Predictive Efficiency

All five models were a million times more efficient than the traditional process-based LCA models for estimating spatially explicit life cycle GW and EU. With the configuration of an Intel i7-4790k CPU, 8 GB memory at 2400 Mhz, and a Nvidia GTX 970 graphic card, the training time for five distinct algorithms ranged from 0.02 to 112 s for each simulation of hyper-parameter combination. While the linear model required the shortest duration of 0.02 s/simulation, GBRT needed the longest duration of 112 s/simulation for model training(Table 4). The predictive time for all models was less than 1 s. The total training and prediction times for all five models were less than 4 h for estimating spatially explicit life cycle GW and EU at the county scale in the entire Midwest region. In contrast, the traditional process-based LCA model approach relying on detailed bio-geochemistry models, such as EPIC, DNDC, and DAYCENT, may require months to estimate spatially explicit life cycle GW and EU for a large region such as the entire Midwest [17,39].

The comparison across five algorithms indicated trade-offs existed between computational efficiency and predictive accuracy for both the life cycle GW and EU. The linear model was the most efficient approach but presented the lowest predictive accuracy for both the life cycle GW and EU. In contrast, GBRT required the longest training time, but yielded the highest predictive accuracy for both impacts. In addition, although the time required for training GBRT for each hyper-parameter combination was three times longer than for training XGBoost, the CV correlation value for GBRT was only improved over 0.2 than for XGBoost. 3.3. Key Influential Factors

The GBRT results showed that nitrogen and phosphorus application rates, soil organic content, and clay and silt types of soil were the top influencing factors for life cycle GW impacts (Figure 5). Nitrogen fertilizers contributed to life cycle GW impacts directly by causing on-farm nitrous oxide (N₂O) emissions from volatilization and denitrification processes in soil [74,75], and indirectly by emitting GHGs from supply chain activities such as manufacturing nitrogen fertilizers [76]. Phosphorus fertilizers contributed to life cycle GW primarily by generating GHGs from supply chain activities such as manufacturing phosphorus fertilizers [76]. Moreover, soil characteristics also had a significant role in regulating GHG emissions. Recent studies based on field experiments [77,78] and mechanistic model (i.e., DNDC) [70] suggested that soil organic matter increased the availability of nitrogen and carbon to soil microorganisms, consequently enhancing N₂O emissions. Furthermore, finer-textured soils, such as clay or silt soils, which tend to retain more water, often resulted in higher N₂O emissions due to higher denitrification rates in soil [79]. Additionally, meteorological factors such as temperature also had a significant contribution to life cycle GW. For instance, March’s temperature potentially influenced the shift of the N₂O:N₂ ratio [80], thereby increasing N₂O emissions during the freeze-thaw cycle [81].

The present study also found that the top influencing factors of life cycle EU impacts were precipitation in February, soil organic content, the temperature in May, and precipitation in May and June (Figure 6). Precipitation can promote the penetration of nutrients through soil layers and wash away the nutrients from topsoil, thereby increasing nutrient leaching and runoff [82]. The contributions of precipitation to life cycle GW impacts were consistent with previous studies examining N fluxes in the watersheds of the contiguous U.S. [83,84] and Lake Michigan Basin [85], and field-based studies in the U.S. Midwest [86], where N losses to water bodies were higher during wet seasons/years compared to dry seasons/years. Furthermore, few studies [87,88] explained that soil organic carbon can lead to the accumulation of carbon and nutrients in the soil, subsequently resulting in more nutrients available for leaching and runoff, which corroborates the findings of this study. Amery and Vandecasteele also reported that lower phosphate binding capacity to soil organic portion may increase susceptibility to phosphorus leaching [89].

4. Discussion 4.1. Syntheses of Machine Learning Modeling Comparison The comparison among all modeling scenarios indicated that the GBRT algorithm along with 4000 records of monthly features yielded the highest predictive accuracy for life cycle GW. The model based on the GBRT algorithm with 6000 records of monthly features showed the highest predictive accuracy for the life cycle EU. It is interesting to note that predictive accuracy was improved at the cost of simulation time. Although GBRT showed the highest predictive accuracy, GBRT required the longest training time. Compared with GBRT, the XGBoost model had slightly lower predictive accuracy but significantly lower computational time. Additionally, more training datasets (larger than 6000) will likely enable higher predictive accuracy for the life cycle EU. However, more training datasets (larger than 4000) cannot further improve the predictive accuracy for life cycle GW. Overall, when predictive accuracy is the key criterion, the GBRT model is the best choice. In contrast, when both predictive accuracy and model efficiency are top concerns, the XGBoost model is highly recommended. The majority of the existing machine learning applications in LCA used the ANN approach and did not compare performances between ANN and other ML approaches. However, our study found that BRT and XGBoost models showed higher predictive accuracy than the ANN approach for estimating corn’s life cycle global warming and eutrophication impacts. 4.2. Application of Machine Learning Based Modeling Approaches for Supply Chain Management and Environmental Policies Compared with traditional process-based models, the ML models discussed in this study present unique merits, such as faster execution and reduced storage needs to estimate modeling outputs, and more flexible integration into other processes and simulation platforms. Such advantages permit ML models to rapidly complete a high number of simulation runs, thereby making ML models superior approaches for a range of computationally intensive tasks, such as optimization, prediction, and validation. For example, it is often infeasible for traditional process-based models to identify the optimal corn supply chains for corn-based biorefineries in the U.S. Midwest region at a fine spatial resolution over a large region, due to billions of simulation scenarios reflecting various combinations of weather, soil, farming practices, and supply scenarios. In contrast, the ML models compressed the model simulation time from months to seconds, consequently enabling the identification of the optimal corn supply chains for corn users such as biorefineries and animal sectors at a fine spatial resolution over a large geographic area. The ML models in this study are valuable for policy assistance at the county, state, and regional scales. At the county and state scales, these models are capable of identifying the top polluting counties and states, and of suggesting spatially targeted remediation strategies. For example, different fertilizer application rates and tillage practices could be recommended for different counties to approximate an optimal trade-off level among life cycle GW and EU impacts. The results could also be aggregated into region scale assessments for supporting regional policies and planning. For example, to support regional biofuel and land use policies, our ML models could be used for corn supply chain optimization, corn-based biorefinery siting, and feedstock landscape optimization.

There are also several drawbacks associated with ML models. First, the computational burden for generating spatially explicit training datasets at fine spatial scales (such as a county) over a large geographic area (i.e., the Midwest) is substantial. The spatially explicit life cycle analyses at a fine spatial resolution over a large geographic area are lacking to date. The majority of LCAs that use spatially explicit biogeochemical models focused on a few farms or a small region due to data and computational constraints. This study obtained spatially explicit life cycle environmental impacts of corn at the county scale in the Midwest by integrating process-based LCA and the spatially explicit EPIC model with supercomputer clusters [17]. The computational burden for generating the spatially explicit training datasets likely will restrain the adoption of ML models, especially when the computing infrastructure is not available for a large number of process-based modeling simulations. However, this study showed that increasing sample size beyond 4000 would not improve model accuracy for life cycle GW. This indicated that reducing the sample size to 4000 would be appropriate for reaching desirable predictive accuracy for life cycle GW, and is capable of reducing computational burden associated with generating training datasets and fitting ML models. We recommend future studies to identify the minimal sample size for spatially and temporally explicit LCAs of other crop and animal systems with machine learning approaches. Second, besides inheriting the uncertainty of process-based LCA and the EPIC model, the training algorithms also introduce additional uncertainty to our ML models. The uncertainty of process-based LCA and EPIC models was extensively discussed in previous studies [17,20]. This study focuses on the uncertainty introduced by training algorithms. By comparing four different algorithms and cross-validation procedures, this study found that the GBRT algorithm was the best statistical approach to approximate the traditional process-based model at a lower computational cost. We recommend future studies to identify the minimal sample size for spatially and temporally explicit LCAs of other crop and animal systems with machine learning approaches. Additionally, caution should be taken when applying the ML models generated from this study to other regions, whose weather, soil, or farming practice values may fall out of the ranges of our training datasets in the Midwest. In these cases, one should follow the procedures described in Figure 4 to construct and test ML models before employing them for other regions.

5. Conclusions and Recommendations for Future Work While machine learning approaches were suggested as alternative approaches for predicting the life cycle environmental impacts of agricultural production, no studies yet identified the appropriate sample sizes and most suitable machine learning for predicting spatially and temporally explicit life cycle environmental impacts. This study comprehensively compared the influences of ML algorithms, training datasets, and sample sizes on the predicative accuracy and efficiency of ML models for estimating spatially explicit life cycle GW and EU impacts, with the Midwest corn as a case study. Among all LCA studies which used ML algorithms, this study included the largest training datasets, covering over 1000 counties in the Midwest over 9 years. This study found that GBRT and XGBoost models yielded the highest predictive accuracy for both the life cycle GW and EU. To further improve predictive accuracy, this study suggests expanding spatially and temporally explicit training datasets, particularly for spatially explicit life cycle EU. Moreover, the methods demonstrated in this study can be applied to other crop and animal systems. We recommend future studies to test these methods for predicting spatially explicit life cycle environmental impacts of soybean, beef, and dairy production, which are among the top contributors to the national GW and EU impacts. Additionally, the ML models open up new opportunities in solving computation-intensive challenges, such as optimization, prediction, and validation. Future interdisciplinary work, based on integrating the ML models in this study with optimization and economic models, will enable identifying the optimal supply chains, and supporting spatially and temporally explicit economic and policy analyses.

Scenarios	Algorithms	Model Inputs (Features)
S1A	LM	monthly temperature (12 covariates),
S2A	SVR	monthly precipitation (12 covariates),
S3A	MLP	soil organic content, soil texture (3 covariates),
S4A	GBRT	Nitrogen fertilizer rate, phosphorus fertilizer rate,
S5A	XGBoost	percentage of conventional, no tillage (2 covariates)
S1B	LM	seasonal temperature (4 covariates),
S2B	SVR	seasonal precipitation (4 covariates),
S3B	MLP	soil organic content, soil texture (3 covariates),
S4B	GBRT	Nitrogen fertilizer rate, phosphorus fertilizer rate,
S5B	XGBoost	percentage of conventional, no tillage (2 covariates)

Predictors (Xs)	Data Description	Data Sources
Temperature (°C)	Monthly mean temperature (Jan-Dec)	NOAA [69]
Precipitation (mm)	Monthly mean precipitation (Jan-Dec)	NOAA [69]
Soil organic content (%)	Percentage of soil organic content measured in soil depth up to 6 m	USDA SSURGO [71]
Soil type: Clay, Sand, Silt (%)	Percentage of soil types	USDA SSURGO [71]
Nitrogen and phosphorus fertilizers	fertilizer application rate (lbs/acre)	USDA NASS [72]
Farming practices (NT & CT)	Farming practices in fractions (No Tillage & Conventional Tillage)	USDA [72]
Outcomes (Ys)
Life cycle GW values	Life-cycle GW (kg CO2-eq. kg corn−1)	Lee et al. [17]
Life cycle EU values	Life-cycle EU (mg N-eq. kg corn−1)	Lee et al. [17]

Scenario A		S1A	S2A	S3A	S4A	S5A
Scenario A		LM	SVR	MLP	GBRT	XGBoost
GW	CV corr	0.45	0.68	0.64	0.80	0.78
	MSE	0.58	0.40	0.44	0.27	0.28
	^R2	0.20	0.45	0.40	0.63	0.61
EU	CV corr	0.65	0.80	0.74	0.87	0.86
	MSE	20	14	18	10	9
	^R2	0.42	0.60	0.50	0.75	0.73
Scenario B		S1B	S2B	S3B	S4B	S5B
Scenario B		LM	SVR	MLP	GBRT	XGBoost
GW	CV corr	0.35	0.63	0.64	0.78	0.76
	MSE	0.55	0.38	0.37	0.28	0.30
	^R2	0.11	0.39	0.40	0.61	0.58
EU	CV corr	0.63	0.74	0.74	0.84	0.83
	MSE	22	17	18	10	11
	^R2	0.39	0.53	0.49	0.71	0.62

		LM	SVR	MLP	GBRT	XGBoost
GW	Training	0.02	4.11	65.4	112	33.8
GW	Predicting	7 × 105	0.13	1 × 103	0.09	0.06
EU	Training	0.03	4.72	33.1	128	41.6
EU	Predicting	6 × 105	0.04	5 × 104	0.02	0.09

Author Contributions

Conceptualization, X.X.R.; data curation, X.X.R. and X.Z.; formal analysis, X.X.R., Z.G., and Y.P.; funding acquisition, X.X.R.; methodology, Z.G. and Y.P.; project administration, X.X.R.; resources, X.X.R.; supervision, X.X.R.; validation, X.X.R., Z.G. and Y.P.; visualization, Z.G.; writing-original draft, X.X.R.; writing-review and editing, X.X.R., Z.G., Y.P., E.K.L. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This document is the result of the research project funded by Presidential Innovation Award at University at Albany, State University of New York.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Word count: 8513

Show less

© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Agriculture ranks as one of the top contributors to global warming and nutrient pollution. Quantifying life cycle environmental impacts from agricultural production serves as a scientific foundation for forming effective remediation strategies. However, methods capable of accurately and efficiently calculating spatially explicit life cycle global warming (GW) and eutrophication (EU) impacts at the county scale over a geographic region are lacking. The objective of this study was to determine the most efficient and accurate model for estimating spatially explicit life cycle GW and EU impacts at the county scale, with corn production in the U.S.’s Midwest region as a case study. This study compared the predictive accuracies and efficiencies of five distinct supervised machine learning (ML) algorithms, testing various sample sizes and feature selections. The results indicated that the gradient boosting regression tree model built with approximately 4000 records of monthly weather features yielded the highest predictive accuracy with cross-validation (CV) values of 0.8 for the life cycle GW impacts. The gradient boosting regression tree model built with nearly 6000 records of monthly weather features showed the highest predictive accuracy with CV values of 0.87 for the life cycle EU impacts based on all modeling scenarios. Moreover, predictive accuracy was improved at the cost of simulation time. The gradient boosting regression tree model required the longest training time. ML algorithms demonstrated to be one million times faster than the traditional process-based model with high predictive accuracy. This indicates that ML can serve as an alternative surrogate of process-based models to estimate life-cycle environmental impacts, capturing large geographic areas and timeframes.

Details

Title

Comparing Machine Learning Approaches for Predicting Spatially Explicit Life Cycle Global Warming and Eutrophication Impacts from Corn Production

Author

Xiaobo Xue Romeiko; Guo, Zhijian; Pang, Yulei; Lee, Eun Kyung

; Zhang, Xuesong

First page

1481

Publication year

2020

Publication date

2020

Publisher

MDPI AG

e-ISSN

20711050

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/su12041481

ProQuest document ID

2443896801

Comparing Machine Learning Approaches for Predicting Spatially Explicit Life Cycle Global Warming and Eutrophication Impacts from Corn Production

Jump to:

Full text

Abstract

Details

Suggested sources