1 Introduction
Atmospheric chemistry is central to many environmental problems, including climate change, air quality degradation, stratospheric ozone loss and ecosystem damage. Atmospheric chemistry models are important tools to understand these issues and to formulate policy. These models solve the three-dimensional system of coupled continuity equations for an ensemble of species concentrations expressed as number density (molec. cm) via operation splitting of transport and local processes: denotes the wind vector, are the local chemical production and loss, is the emission rate, and is the deposition rate of species . We ignore here molecular diffusion as it is negligibly slow compared to advection. The first term of Eq. () is the transport operator and involves no coupling between the chemical species. The second term is the chemical operator, which connects the chemical species through a system of simultaneous ordinary differential equations (ODEs) that describe the chemical production and loss:
2 The numerical solution of Eq. () is computationally expensive as the equations are numerically stiff and require implicit integration schemes such as Rosenbrock solvers to guarantee numerical stability . As a consequence, 50 %–90 % of the computational cost of an atmospheric chemistry model such as GEOS-Chem can be spent on the integration of the chemical kinetics .
Previous efforts to increase the efficiency of the integration (with an associated reduction in accuracy) have involved dynamical reduction in the chemical mechanism (adaptive solvers) , separation of slow and fast species , quasi-steady state approximation or approximation of the chemical kinetics using polynomial functions (repro-modelling) . Other approaches have attempted to simplify the chemistry leading to a reduction in the number of reactants and species . However, none of these approaches have been transformative in their reduction in time spent on chemistry.
We discuss here the potential of a machine learning algorithm (in this case random forest regression) as an alternative approach to explicitly solving Eq. () with a numerical solver in the chemistry model GEOS-Chem. Figure illustrates the approach: during each model time step, GEOS-Chem sequentially solves a suite of operations relevant to the simulation of atmospheric chemistry. In the original model, solving the chemistry is the computationally most expensive step. Our aim is to replace it with a machine learning algorithm while keeping all other processes unchanged. Conceptually, this approach is comparable to previous efforts to speed up the solution of the chemical equations through more efficient integration.
Figure 1Schematic overview of the use of a random forest regression algorithm as an alternative to the chemistry solver. The original numerical model (GEOS-Chem) sequentially solves the operations relevant to atmospheric chemistry, with the chemical integrator being the computationally most expensive step (left side). Using training data produced from the full model, we generate a machine learning emulator that can then be used instead of the chemical integrator (right side). All other model processes are the same as in the original model.
[Figure omitted. See PDF]
Machine learning is becoming increasingly popular within the natural sciences and specifically within the Earth system sciences to either simulate processes that are poorly understood or to emulate computationally demanding physical processes (notably convection) . Machine learning has also been used to replace the chemical integrator for other chemical systems such as those found in combustion and been shown to be faster than solving the ODEs . Recently, found order-of-magnitude speed-ups for an atmospheric chemistry box model using a neural network emulator, although their solution suffers from rapid error propagation when applied over multiple time steps. Machine learning emulators have also been explored to directly predict air pollution concentration in future time steps , as well as for chemistry–climate simulations focusing on model predictions of time-averaged concentrations for selected species such as ozone () and the hydroxyl radical (OH) over timescales of days to months . In contrast, the algorithm presented here is optimized to capture the small-scale variability of the entire chemical space within a timescale of minutes, with only a small loss of accuracy when used repeatedly over multiple time steps. To do so, we use the numerical solution of the GEOS-Chem chemistry model to produce a training data set of output before and after the chemical integrator (Sect. and ), train a machine learning algorithm to emulate this integration (Sect. , and ), and then describe and assess the trained machine learning predictors (Sect. , , and ). Section describes the results of using the machine learning predictors to replace the chemical integrator in GEOS-Chem. In Sect. we discuss potential future directions for the uses of this methodology and in Sect. we draw some conclusions.
2 Methods2.1 Chemistry model description
All model simulations were performed using the NASA Goddard Earth Observing
System Model, version 5 (GEOS-5) with version 10 of the GEOS-Chem chemistry
embedded . GEOS-Chem (
The model chemistry scheme includes detailed tropospheric chemistry of oxides of hydrogen, nitrogen, bromine, volatile organic compounds and ozone (–––VOC–ozone), as originally described by , with the addition of halogen chemistry by plus updates to isoprene oxidation as described by . Photolysis rates are computed online by GEOS-Chem using the Fast-JX code of as implemented in GEOS-Chem by and . The gas-phase mechanism comprises 150 chemical species and 401 reactions and is solved using the kinetic pre-processor (KPP) Rosenbrock solver . There are 99 (very) short-lived species which are not transported, and we seek to emulate the evolution of the other 51 transported species.
While the GEOS model with GEOS-Chem chemistry can be run as a chemistry–climate model where the chemical constituents (notably ozone and aerosols) directly feed back to the meteorology, we disable this option here and use prescribed ozone and aerosol concentrations for the meteorology instead. This ensures that any differences between the reference model and the machine learning model can be attributed to imperfections in the emulator rather than changes in meteorology due to chemistry–climate feedbacks.
2.2 Training data
To produce our training data set we run the model for 1 month (July 2013). Each hour we output the three-dimensional instantaneous concentrations of each transported species immediately before and after chemical integration, along with a suite of environmental variables that are known to impact chemistry: temperature, pressure, relative humidity, air density, cosine of the solar zenith angle, cloud liquid water and cloud ice water. In addition, we output all photolysis rates since those are an essential element for chemistry calculations. Alternatively, one could also envision directly embedding the (computationally demanding) photolysis computation into the machine learning model, such that the emulator takes as input variables additional environmental variables relevant to photolysis (e.g. cloud cover, overhead ozone and aerosol loadings) and then emulates photolysis computation along with chemistry.
Each grid cell 1 h output constitutes one training sample, consisting of 126 input “features”: the 51 transported species concentrations, 68 photolysis rates and the 7 meteorological variables. We restrict our analysis to the troposphere (lowest 25 model levels) since this is the focus of this work. Each hour thus produces a total of 327 600 training samples, and so an overall data set of samples is produced over the full month. We withhold a randomly selected 10 % of the samples to act as validation data while the remaining samples act as training data.
2.3 Random forest regression
We use the random forest regression (RFR) algorithm to emulate the integration of atmospheric chemistry. Figure shows a schematic of RFR. It is a commonly used, and conceptually simple, supervised learning algorithm that consists of an ensemble (or forest) of decision trees. Each tree contains a tree-like sequence of decision nodes, based on which the tree splits into its various branches until the end of the tree (“the leaf”) is reached. This leaf is the prediction of the decision tree. Each decision node is based on whether one of the input features is above a certain value. An important aspect of the random forest is that each tree of the forest is trained on a subset of the full training data, thus providing a slightly different approximation of the model. A prediction is then made by averaging the predictions of the individual trees.
Figure 2
Schematic of random forest algorithm. For each species , we use a random forest consisting of 30 individual decision trees, each up to 12 layers deep (only the first four layers are shown). All decision trees take the same inputs (e.g. species concentration vector at given location, photolysis rates J, temperature , humidity ) and each decision tree node uses one of the input features plus a threshold value to determine the tree path for the given set of input features. The final prediction is made by averaging the 30 individual tree predictions ().
[Figure omitted. See PDF]
The RFR algorithm is less prone to over-fitting and produces predictions that are more stable than a single decision tree . Random forests are widely used since they are relatively simple to apply, suitable for both classification and regression problems, do not require data transformation, and are less susceptible to irrelevant or highly correlated input features. In addition, random forests allow for easy evaluation of the factors controlling the prediction, the decision structure and the relative importance of each input variable. Analysing these features can offer valuable insights into the control factors of the underlying mechanism, as discussed later. We discuss the potential for other algorithms in Sect. .
2.4 ImplementationFor each of the 51 chemical species transported in the chemistry model, we generate a separate random forest predictor. This predictor can be applied to all model grid cells, i.e. it captures all chemical regimes encountered by the respective target species. Conceptually, one can imagine that each tree path represents a different chemical regime, so it is important to generate trees that are large enough to encompass the entire solution space. We find a good compromise between computational complexity and accuracy of the solutions for random forests consisting of 30 trees with a maximum of 10 000 leaves (prediction values) per tree. These hyper-parameter were determined by trial and error, and we find very little sensitivity of our results to changes ( %) to the number of trees and/or number of leaves. Each tree is trained on a different sub-sample of the training data by randomly selecting 10 % of the training sample. In order to balance the training samples across the full range of model values, the training samples are evenly drawn from each decile of the predictor variable. This prevents over-sampling of ocean grid cells, which are typically characterized by very uniform chemistry. Our results show very little sensitivity to the size of the training sample as long as it covers the full solution space.
The Python software package scikit-learn
(
The forests were then embedded as a Fortran 90 subroutine into the GEOS-Chem chemistry module. Using an ad-hoc approach, the module first loads all tree nodes (archived after the training) into local memory and then evaluates each of the 1530 trees in series upon calling the random forest emulator. Each grid cell calls the same random forest emulator separately, passing to it all local information required to evaluate the trees (species concentrations, photolysis rates, environmental variables). No attempts were made to optimize the prediction algorithm beyond the existing Message Passing Interface grid-domain splitting.
2.5 Choice of predictor
We find that the quality of the RFR model (as implemented back into the GEOS-Chem model) depends critically on the choice of the predictor. Most simplistically, we could predict the concentration of a species after the integration step. However, many of the species in the model are log-normally distributed in which case predicting the logarithm of the concentration may provide a more accurate solution; we could also predict the change in the concentration after the integrator, the fractional change in the concentration, the logarithm of the fractional change, etc. After some trial and error, and based on chemical considerations, we choose two types of prediction: the change in concentration after going through the integrator, and the concentration after the integrator. We describe the first as the “tendency”. This fits with the differential equation perspective for chemistry given in Eq. (). However, if we incorporate only this approach we find that errors rapidly accrue. This is due to errors in the prediction of short-lived species such as NO, and Br. For these compounds, concentrations can vary by many orders of magnitude over an hour, and even small errors in the tendencies build up quickly when they are included in the full model. For these short-lived compounds, we use a second type of prediction where the RFR predicts the concentration of the compound after the integrator. We describe this as a prediction of the “concentration”. From a chemical perspective, this is similar to placing the species into steady state, where the concentration after the integrator does not depend on the initial concentration but is a function of the production () and loss () such that . We imitate this process by explicitly removing the predictor species from the input features, which we find improves performance.
The choice between predicting the tendency or the concentration is based on the standard deviation of the ratio of the concentration after chemistry to the concentration before chemistry () in the training data. This ratio is relatively stable and close to 1.00 for long-lived species but highly variable for short-lived species. Based on trial and error, we use a standard deviation threshold of 0.1 to distinguish between long-lived species () and short-lived species (). Table lists the prediction type used for each species. We discuss the treatment of NO and species in Sect. .
Table 1
Overview of the performance of the RFR model with the family treatment. Shown are the Pearson correlation , normalized root mean square error (NRMSE) and normalized mean bias (NMB). Comparison against the validation data set (10 % of training data withheld from training) are indicated with a “V”. Comparisons between the RFR simulation and the full GEOS-Chem model for July 2014 at 00:00 UTC after the 1st, 5th and 30st simulation day are indicated with “D1”, “D5” and “D30”, respectively. Prediction type of each species (concentration, tendency, family treatment) is given in the prediction column.
Nr | ID | Name | Prediction | NRMSE (%) | NMB (%) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
V | D1 | D5 | D30 | V | D1 | D5 | D30 | V | D1 | D5 | D30 | ||||
1 | ACET | Acetone | Tend | 0.98 | 1 | 1 | 1 | 15 | 0.88 | 2.3 | 3.7 | 0.17 | |||
2 | ALD2 | Acetaldehyde | Tend | 0.93 | 0.99 | 0.98 | 0.93 | 26 | 12 | 16 | 27 | 3.7 | |||
3 | ALK4 | alkanes | Tend | 0.98 | 1 | 1 | 1 | 13 | 2.6 | 5.8 | 7.6 | 0.06 | |||
4 | Br | Atomic bromine | Conc | 0.82 | 0.26 | 0.18 | 0.063 | 45 | 130 | 250 | 410 | 73 | 120 | 170 | |
5 | Br2 | Molecular bromine | Conc | 0.84 | 0.87 | 0.84 | 0.47 | 40 | 38 | 49 | 82 | ||||
6 | BrNO2 | Nitryl bromide | Conc | 0.87 | 0.82 | 0.84 | 0.76 | 40 | 46 | 45 | 57 | 8.5 | 46 | 54 | 39 |
7 | BrNO3 | Bromine nitrate | Conc | 0.42 | 0.33 | 0.4 | 0.42 | 110 | 150 | 140 | 140 | 110 | 190 | 170 | 160 |
8 | BrO | Bromine monoxide | Conc | 0.83 | 0.48 | 0.29 | 0.05 | 47 | 73 | 110 | 250 | 23 | 52 | 120 | |
9 | C2H6 | Ethane | Tend | 0.98 | 1 | 1 | 1 | 13 | 1.4 | 4.8 | 9.1 | 0.0082 | |||
10 | C3H8 | Propane | Tend | 0.97 | 1 | 1 | 0.98 | 17 | 4.1 | 5.1 | 15 | 0.85 | 0.8 | ||
11 | CH2Br2 | Dibromomethane | Tend | 0.97 | 1 | 1 | 1 | 19 | 0.86 | 2.9 | 7.1 | 0.0036 | |||
12 | CH2O | Formaldehyde | Tend | 0.93 | 0.97 | 0.95 | 0.95 | 26 | 17 | 24 | 27 | 3.4 | 17 | 12 | |
13 | CH3Br | Methyl bromide | Tend | 0.97 | 1 | 1 | 1 | 17 | 0.26 | 0.97 | 1.8 | 0.0013 | |||
14 | CHBr3 | Bromoform | Tend | 0.99 | 1 | 1 | 1 | 8.4 | 0.86 | 2.3 | 4.1 | ||||
15 | CO | Carbon monoxide | Tend | 0.98 | 1 | 1 | 1 | 13 | 0.89 | 2.2 | 2.4 | 0.09 | 0.017 | ||
16 | DMS | Dimethylsulfide | Tend | 0.98 | 0.99 | 0.89 | 0.87 | 12 | 11 | 38 | 58 | ||||
17 | GLYC | Glycoaldehyde | Tend | 0.97 | 0.99 | 0.99 | 0.98 | 17 | 11 | 14 | 16 | ||||
18 | H2O2 | Hydrogen peroxide | Tend | 0.96 | 0.97 | 0.91 | 0.86 | 20 | 19 | 31 | 45 | 0.1 | 3.5 | ||
19 | HAC | Hydroxyacetone | Tend | 1 | 0.99 | 0.99 | 0.98 | 0.95 | 8.4 | 15 | 16 | 0.025 | |||
20 | HBr | Hydrobromic acid | Conc | 0.68 | 0.74 | 0.72 | 0.6 | 56 | 52 | 53 | 66 | 1.7 | 9.8 | 8.9 | 19 |
21 | HNO2 | Nitrous acid | Conc | 0.91 | 0.85 | 0.96 | 0.76 | 34 | 48 | 43 | 64 | 23 | 37 | 50 | |
22 | HNO3 | Nitric acid | Conc | 0.88 | 0.88 | 0.87 | 0.77 | 37 | 36 | 39 | 55 | 2.3 | 12 | 27 | 37 |
23 | HNO4 | Peroxynitric acid | Conc | 0.71 | 0.72 | 0.74 | 0.69 | 55 | 60 | 56 | 64 | 4.2 | 40 | 50 | 65 |
24 | HOBr | Hypobromous acid | Conc | 0.7 | 0.59 | 0.54 | 0.47 | 57 | 73 | 73 | 86 | 12 | 23 | 16 | 28 |
25 | IEPOX | Isoprene epoxide | Tend | 0.98 | 0.98 | 0.97 | 0.97 | 15 | 17 | 21 | 19 | 0.06 | |||
26 | ISOP | Isoprene | Tend | 0.99 | 0.94 | 0.93 | 0.88 | 12 | 31 | 31 | 38 | ||||
27 | ISOPN | Isoprene nitrate | Tend | 0.94 | 0.94 | 0.92 | 0.78 | 24 | 28 | 30 | 48 | ||||
28 | MACR | Mathacrolein | Tend | 0.97 | 0.98 | 0.96 | 0.88 | 17 | 18 | 27 | 38 | 2.3 | |||
29 | MAP | Peroxyacetic acid | Tend | 0.96 | 0.99 | 0.98 | 0.98 | 20 | 8.6 | 17 | 15 | 0.27 | |||
30 | MEK | Methyl ethyl ketone | Tend | 0.91 | 0.98 | 0.98 | 0.96 | 31 | 15 | 14 | 25 | 28 | |||
31 | MMN | MACR MVK nitrate | Tend | 0.97 | 0.98 | 0.95 | 0.89 | 17 | 14 | 22 | 38 | 0.61 | |||
32 | MOBA | 5C acid from isoprene | Conc | 0.98 | 0.95 | 0.93 | 0.87 | 15 | 25 | 29 | 37 | ||||
33 | MP | Methylhydroperoxide | Tend | 0.89 | 0.97 | 0.8 | 0.8 | 33 | 19 | 54 | 48 | ||||
34 | MPN | Methyl peroxy nitrate | Conc | 0.85 | 0.62 | 0.4 | 0.43 | 50 | 87 | 130 | 140 | 26 | 100 | 160 | 130 |
35 | MSA | Methanesulfonic acid | Tend | 0.99 | 0.99 | 0.97 | 0.92 | 11 | 9.4 | 19 | 34 | ||||
36 | MVK | Methylvinylketone | Tend | 0.96 | 0.98 | 0.96 | 0.83 | 19 | 17 | 27 | 42 | 1.3 | |||
37 | N2O5 | Dinitrogen pentoxide | Conc | 0.69 | 0.02 | 0.02 | 0.041 | 56 | 390 | 490 | 340 | 28 | 1700 | 2400 | 1800 |
38 | NO | Nitric oxide | tend | 0.95 | 0.89 | 0.86 | 0.79 | 26 | 34 | 40 | 47 | 23 | 31 | 17 | |
39 | NO2 | Nitrogen dioxide | tend | 0.94 | 0.9 | 0.9 | 0.91 | 28 | 34 | 33 | 31 | 2.2 | 19 | 28 | 29 |
40 | NO3 | Nitrate radical | Conc | 0.74 | 0.064 | 0.065 | 0.095 | 60 | 690 | 620 | 470 | 30 | 780 | 840 | 850 |
41 | O3 | Ozone | Tend | 0.95 | 0.99 | 0.9 | 0.75 | 23 | 8.3 | 35 | 67 | 0.19 | 4.2 | 13 | |
42 | PAN | Peroxyacetylnitrate | Tend | 0.91 | 0.95 | 0.89 | 0.77 | 30 | 22 | 35 | 59 | 1.3 | 8.3 | 23 | |
43 | PMN | Peroxymethacroyl nitrate | Tend | 0.86 | 0.92 | 0.89 | 0.86 | 38 | 36 | 46 | 47 | 19 | 33 | 32 | |
44 | PPN | Peroxypropionyl nitrate | Tend | 0.92 | 0.95 | 0.91 | 0.36 | 29 | 24 | 32 | 610 | 1.9 | 10 | 700 | |
45 | PROPNN | Propanone nitrate | Tend | 0.89 | 0.99 | 0.97 | 0.97 | 33 | 11 | 17 | 31 | 0.05 | 9.8 | ||
46 | PRPE | alkenes | Tend | 0.96 | 0.99 | 0.95 | 0.88 | 20 | 11 | 22 | 36 | ||||
47 | R4N2 | alkylnitrates | Tend | 0.88 | 0.94 | 0.94 | 0.84 | 35 | 26 | 27 | 90 | 2.4 | 7.4 | 60 | |
48 | RCHO | aldehydes | Tend | 0.85 | 0.95 | 0.89 | 0.0 | 39 | 23 | 35 | 4900 | 1.3 | 4.1 | 13 000 | |
49 | RIP | Peroxide from RIO2 | Tend | 0.97 | 0.95 | 0.94 | 0.95 | 17 | 24 | 27 | 23 | ||||
50 | SO2 | Sulfur dioxide | 0.99 | 1 | 1 | 1 | 12 | 0.49 | 1.3 | 2.9 | 8.6 | 0.53 | 0.79 | ||
51 | SO4 | Sulfate | Tend | 0.99 | 1 | 0.99 | 0.95 | 12 | 6.4 | 9.3 | 23 | 0.03 | 0.34 | 2.3 | |
52 | NOx | Tend | 0.96 | 0.98 | 0.98 | 0.95 | 21 | 14 | 16 | 22 | 0.28 | 20 | 28 | 26 |
The importance of different input variables (features) for making a prediction of tendency is shown in Fig. a. The importance metric is the fraction of decisions in the forest that are made using a particular feature, with the variability indicating the standard deviation of that value between the trees. Consistent with our understanding of atmospheric chemistry, features such as NO, formaldehyde (), the cosine of the solar zenith angle (“SUNCOS”), bromine species and nitrogen reservoirs all appear within the top 20. From a chemical perspective, these features make sense given the global sources and sinks of in the lower to middle troposphere.
Figure 3
Characteristics of random forest trained to predict tendencies of due to chemistry. (a) Importance of input variables (features) for random forests trained to predict tendency of ozone due to chemistry. Shown are the 20 most important features for the entire random forest, as averaged over all 30 decision trees. The black bars indicate the standard deviation for each feature across the 30 decision trees. The arrows indicate photolytic conversion (i.e. N photolyses to plus O). (b) Validation of random forest prediction skill for ozone: comparison of ozone tendency validation data ( axis) vs. predicted values ( axis). Number of validation points (), correlation coefficient (), normalized root mean square error (NRMSE) and normalized mean bias (NMB) are given in the inset. (c) Same validation but with tendency added to the concentration before integration.
[Figure omitted. See PDF]
For ozone prediction, 6 out of the 20 most important input features are related to photolysis. Most of the photolysis rates are highly correlated, and the individual decision trees use different photolysis rates for decision making. This results in very large standard deviations for the photolysis input features across the 30 decision trees, as indicated by the black bars in Fig. a.
Note that the concentration of is not among the 20 most important input features for the prediction of tendency. If, instead, the random forest model is trained to predict the concentration of , the initial concentration dominates the input feature importance, explaining more than 99 % of the prediction. However, when predicting the ozone tendency, the random forest algorithm is more sensitive to availability of , VOCs, photolysis, etc., rather than the initial concentration of . For regions producing ozone (dominated by the reaction) the concentration is not the primary source of variability. Similarly, for regions loosing ozone the dominant source of variability is the variability in photolysis rates (multiple orders of magnitude) rather than the variability in concentration (less than an order of magnitude).
Fig. b shows the performance of the tendency predictor against the validation data. The predictor is not perfect, with an of 0.95 and a normalized root mean square error (NRMSE) of 23 %, but it is essentially unbiased with a normalized mean bias (NMB) of % (descriptions of the metrics can be found in Sect. ). However, as shown in Fig. c, the model becomes almost perfect when the tendency is added to the initial concentration – which is the operation to be performed by the chemistry model.
2.7Prediction of
For NO and we find that the random forest has difficulties predicting the species concentrations independently of each other. This can result in unrealistically large changes in total (). Given the central role of for tropospheric chemistry, a quick deterioration of model performance occurs (see Sect. ). For these species we thus adopt a different methodology: instead of making predictions for the species individually, we predict the tendency for a family comprising their sum (NO ) and then predict the ratio of NO to . is then calculated by subtracting NO from . Thus, the overall number of forests that needs to be calculated does not change. This has the advantage of treating as a long-lived family “species” and includes a basic conservation law, but it allows the NO and concentration to still vary rapidly.
Figure shows the feature importance and the comparison with the validation data for the prediction of the family tendency. The features make chemical sense, with and NO but also acetaldehyde (a tracer of PAN chemistry) and H, a short-lived nitrogen species, playing important roles. The importance of may reflect heterogeneous chemistry, with being a proxy for available aerosol surface area (note that we do not provide any aerosol information to the RFR). As shown in Fig. b, the predictor gives the “true” tendencies from the validation data with an of 0.96, NRMSE of 21 % and NMB of 0.28 %. While the NRMSE is relatively high, we find that the ability of the model to produce an essentially unbiased prediction is more critical for the long-term stability of the model. As for , the skill scores become almost perfect when adding the tendency perturbations to the concentration before integration (Fig. c).
Figure 4As Fig. but for (NO ).
[Figure omitted. See PDF]
Figure shows the feature importance and performance of the predictor for the ratio of NO to . Again the features make chemical sense with the top three features (photolysis, temperature and ) being those necessary to calculate the NO-to- ratio from the well known Leighton relationship . The performance of the NO-to- ratio predictor is very good, and the prediction is also unbiased.
Figure 5Characteristics of random forest trained to predict the ratio after chemistry. (a) 20 most important features for the random forest, as averaged over all 30 decision trees. The black bars indicate the standard deviation of the feature importance; (b) Comparison of predicted ratios ( axis) vs. true ratios ( axis) for the validation data (not used for training). Number of validation points (), correlation coefficient (), normalized root mean square error (NRMSE) and normalized mean bias (NMB) are given in the inset.
[Figure omitted. See PDF]
2.8 Evaluation metricsWe now move to a systematic evaluation of the performance of the RFR models, both against the validation data and when implemented back into the GEOS-Chem model. We use three standard statistical metrics for this comparison. For each species , we compute the Pearson correlation coefficent (),
3 the root mean square error normalized by the standard deviation (NRMSE), 4 and the normalized mean bias (NMB): 5 where denotes the concentration predicted by the RFR model, is the concentration calculated by GEOS-Chem, and N is the total number of grid cells.
2.9 Performance against the validation dataTen percent of the training data was withheld to form a validation data set. Columns “V” in Table provide an evaluation of each predictor against the validation data for the three metrics discussed in Sect. . For most species the RFR predictors do a good job of prediction: values are greater than 0.90 for 35 of the 51 species, NRMSEs are below 20 % for 21 species and NMBs are below 1 % for 29 species, respectively. Those species which do less well are typically those that are shorter lived, such as inorganic bromine species or some nitrogen species (, ). The performance of NO and after implementing the family and ratio methodology is consistent with other key species.
Although we do not have a perfect methodology for predicting some species, we believe that it does provide a useful approach to predicting the concentration of the transported species after the chemical integrator. We now test this methodology when the RFR predictors are implemented back into GEOS-Chem.
3 Long-term simulation using the random forest model
To test the practical prediction skill of the RFR models, we run four simulations of GEOS-5 with GEOS-Chem for the same month (July) but a different year (2014) than was used to train the RFR model. This simulation differs from the training simulation not only in meteorology but also in emissions, with local differences in , CO and VOC emissions of up to 20 %. As such, this experiment also evaluates the ability of the RFR model to capture the sensitivity of chemistry to changes in emissions.
The first simulation is a standard simulation where we use the standard GEOS-Chem integrator; the second is a simulation where we replace the chemical integrator with the RFR predictors described earlier (with the family treatment of ); the third uses the RFR predictors but directly predicts the NO and concentrations instead of ; the fourth has no tropospheric chemistry and the model just transports, emits and deposits species. In all simulations the stratospheric chemistry uses a linearized chemistry scheme . This buffers the impact of the RFR emulator over the long-term since all simulations use the same relaxation scheme in the stratosphere. For the time frame of 1 month considered here, we consider this impact to be negligible in the lowest 25 model levels.
We evaluate the performance of the second, third and fourth model configuration against the first. We first focus on the statistical evaluation of the best RFR model configuration (second model configuration) for all species and then turn our attention to the specific performance of surface and , two critical air pollutants.
3.1 Statistics
Table and summarizes the prediction skill of the random forest regression model (using the family method) for all 51 species plus . We sample the whole tropospheric domain at three time steps during the 2014 test simulation: after 1 simulation day (“D1”), after 5 simulation days (“D5”) and after 30 simulation days (“D30”). For each time slice, we calculate a number of metrics (Sect. ) for the RFR model performance.
The model with the RFR predictors shows good skill (, root mean square error (RMSE) %, NMB %) for key long-lived species such as , CO, , and and for most VOCs, even after 30 days of integration. The NRMSEs can build up to relatively large numbers over the period of the simulation, with getting up to 67 % after 30 days, but the mean bias remains relatively low at 13 %. For the stability of the simulation, it is more important to have an overall unbiased estimation, as this prevents systematic buildups or drawdowns in concentrations that can eventually render the model unstable. For 36 of the 52 species (including ), the NMB remains below 30 % at all times. The model has more difficulties with shorter-lived species such as inorganic bromine species (e.g. atomic bromine, bromine nitrate) and nitrogen species such as and . These species show poor performance with values below 0.1 even after the first day.
The hourly evolution of the metrics for over a 30-day simulation is shown in Fig. . We show here the performance of the model with the family treatment of (solid line), with separate NO and (dashed line), and with no chemistry at all (dotted line). For all metrics, the random forest simulation predicting family treatment of performs better than a simulation predicting NO and independently and a simulation with no chemistry. We use the latter as a minimum threshold to compare the RFR methodology. The metrics of the RFR model decrease over the course of the first 15 simulation days (1440 integration steps) but stabilize with an of 0.8, an NRMSE of % and an NMB of less than %. The simulation with the chemistry switched off degrades rapidly, highlighting the comparative skill of the RFR model to predict ozone over the entire 30-day period. The simulation with NO and predicted independently of each other closely follows the family simulation during the first 2–3 days but quickly deteriorates afterwards, as the compounding effect of NO and prediction errors leads to an accelerated degradation of model performance.
Figure 6
Thirty-day evolution of (a), NRMSE (b) and NMB (c) for three different model simulations of run for July 2014 compared to the full GEOS-Chem simulation. The solid line represents the standard random forest (RF) simulation using the family prediction of . The dashed line uses RF predictors for NO and individually (this simulation becomes unstable after 23 days). The dotted line represents a simulation with no chemistry. The grey line in (c) indicates a 0 value.
[Figure omitted. See PDF]
Although there are some obvious issues associated with the RFR simulation, it is evident that for many applications, the model has sufficient fidelity to be useful. We now focus on the model's ability to simulate surface and , two important air pollutants.
3.2Surface concentrations of and
Figure compares concentration maps of surface at 00:00 UTC calculated by the full-chemistry model (upper row), the RFR model (middle row) and their ratio (bottom row) after 1, 5, 10 and 30 days of simulation. After 1 day there are only small differences between the full model and the RFR model. However, these differences grow over the period of the simulation as errors accumulate. By the time the model has been run for 10 days, the model has become significantly biased over clean background regions, in particular over the Pacific Ocean. The differences between the reference model and the RFR simulation grow more slowly after 10 days (see also Fig. ), resulting in the model differences between day 10 and day 30 being small relative to the difference between day 1 and day 10. It appears that the RFR model finds a new “chemical equilibrium” for surface on the timescale of a few days. This new equilibrium overestimates in clean background regions such as the tropical Pacific and underestimates in the Arctic.
Figure 7Concentration maps of surface mixing ratio after 1 simulation day (column 1), 5 simulation days (column 2), 10 simulation days (column 3) and 30 simulation days (column 4), as calculated by the full GEOS-Chem model (row 1) and the standard random forest (RF) model with the family treatment (row 2). Row 3 shows the percentage difference between the RF simulation and GEOS-Chem (GC).
[Figure omitted. See PDF]
Figure similarly compares concentration maps of surface . Reflecting the shorter lifetime of , the errors here grow more quickly compared to but level off after 5 days as a new chemical equilibrium is reached. The RFR model shows large differences compared to the GEOS-Chem model in regions where concentrations are low and remote from recent emission, with being highly overestimated in the tropics and underestimated at the poles. This pattern is highly consistent with the ones seen for , suggesting that the relative change in drives the change in , as would also be the case in a full-chemistry model.
Figure 8Concentration maps of surface (NO ) after 1 simulation day (column 1), 5 simulation days (column 2), 10 simulation days (column 3) and 30 simulation days (column 4), as calculated by the full GEOS-Chem model (row 1) and the standard random forest (RF) model with the family treatment (row 2). Row 3 shows the percentage difference between the RF simulation and GEOS-Chem (GC).
[Figure omitted. See PDF]
Figure 9Comparison of surface concentration of at four locations (New York, Delhi, London and Beijing) for the GEOS-Chem reference simulation (black), the RFR model with the N family treatment (red) and a simulation with no chemistry (blue).
[Figure omitted. See PDF]
Figures and show time series
of and mixing ratios at four polluted
locations (New York, Delhi, London and Beijing) as generated by the full-chemistry model (black line), the RFR model (red), and the model with no
chemistry (blue). The RFR model closely follows the full model at these
locations and captures the concentrations patterns with an accuracy of
10 %–20 %. Especially for it is hard to distinguish
the RFR model from the full model, whereas the simulation without any
chemistry shows a distinctly different pattern. These differences are
significantly less than one would expect from running two different chemistry
models for the same period
Comparison of surface concentration of nitrogen oxides ( NO ) at four locations (New York, Delhi, London and Beijing) for the GEOS-Chem reference simulation (black), the RFR model with the N family treatment (red) and a simulation with no chemistry (blue).
[Figure omitted. See PDF]
Although our analysis has not provided a complete analysis of the RFR model performance, we have shown that it is capable of providing a simulation of many key facets of the atmospheric chemistry system (, ) on the timescale of days to weeks. We now discuss future routes to improve the system and some applications.
4 DiscussionWe have shown that a machine learning algorithm, here random forest regression, can simulate the general features of the chemical integrator used to represent the chemistry scheme in an atmospheric chemistry model. This represents the first stage in producing a fully practical methodology. Here we discuss some of the issues we have found with our approach, potential solutions, some limitations and where we think a machine learning model could provide useful applications.
4.1 Speed, algorithms and hardware
The current RFR implementation takes about twice as long to solve the chemistry as the currently implemented integrator approach. While the evaluation of a single tree is fast (average execution time is on the Discover computer system), calculating them all for every forest and for every transported species () in series results in a total average execution time of , which is 85 % slower than the average execution time of using the standard model integrator.
We emphasize that this implementation is a proof of concept. Unlike for the chemical integrator, little work has been undertaken to optimize the algorithm parameters (e.g. optimizing the number of trees or the number of leaves per tree) or the Fortran90 implementation of the forests. For example, random forest have relatively large memory footprints that scale linearly with number of forests and trees. Efficient access of these data through optimal co-location of related information (e.g. grouping memory by branches) could dramatically reduce CPU register loading costs, as could moving from double precision to single precision or even integer maths. In the current implementation, we load all tree data onto every CPU separately without attempts of memory sharing. Thus we believe that different software structures, algorithms and memory management may allow significant increases in the speed achieved.
A fundamental attractiveness of the random forest algorithm is its almost perfect parallel nature, even among species within the same grid cell: the nodes of all trees (and across all forests) solely depend on the initial values of the input features and thus can be evaluated independently of one another (in contrast, the system of coupled ODEs solved by the chemical solver requires coupling between the species). This would readily allow for parallelization of the chemistry operator, which has up to this point not been possible. This may allow other hardware paradigms (e.g. graphical processing units) to be exploited in calculating the chemistry.
We have implemented the replacement for the chemical integrator using the
random forest regression algorithm. Our choice here was based on the
conceptual ease of the algorithm. However, other algorithms are capable of
fulfilling the same function. Neural networks have been used extensively in
many Earth system applications
4.2 Training data
We have trained the random forest regression models on a single month of data. For a more general system the models will need to be trained with a more temporally extensive data set. Models are, however, able to generate large volumes of data. A year's worth of training data over the full extent of the model's atmosphere would result in a potentially very large () training data set. Applying this methodology to spatial scales relevant to air quality applications (on the order of 10 km) will result in even larger data sets (). However, not all items from the training data are of equal value. Much of the atmosphere is made up of chemically similar air masses (e.g. central Pacific, remote free troposphere) which are highly represented in the training data but are not very variable. Most of the interest from an air quality perspective lies in small regions of intense chemistry. If a way can be found to reduce the complete training data set such that the sub-sample represents a statistical description of the full data, the amount of training data can be significantly reduced and thus the time needed to train the system.
The features being used to train the predictors could also be reconsidered. The current selection reflects an initial estimate of the appropriate features. It is evident that different and potentially better choices could be made. For example, we have included all photolysis rates, but these correlate very strongly and so a greatly reduced number of photolysis inputs (potentially from a principal components analysis) could achieve the same results but with a reduced number of features. Including other parameters such as the concentrations of the aerosol tracers may also improve the simulation.
4.3 Conservation laws and error checking
One of the fundamental laws of chemistry is the conservation of atoms. One interpretation of that has been applied here to the prediction of the change in together with predictions for . Since the concentration of changes much more slowly than the change in concentration of either NO or , this approach attempts to improve the prediction of these short-lived nitrogen species, which are difficult to predict. Our results show that this does indeed increase the stability of the system, and it represents a first step towards ensuring the conservation of atoms in machine-learning-based chemistry models. A larger nitrogen family (NO, ,N, , HONO, , etc.) might increase stability further, as could other chemical families such as , which showed significant errors both compared to the validation data and the evaluation of the chemistry model.
The solution space of a chemistry model is constrained by mass-balance requirements, and chemical concentrations tend to mean-revert to the equilibrium concentration implied by the chemical boundary conditions (emissions, deposition rates, sunlight intensity, etc.). A successful machine learning method should have the same qualities in order to prevent runaway errors that can arise from systematic model errors, e.g. if the model constantly over- or under-predicts certain species or if it violates the conservation of mass balance. Because each model prediction feeds into the next one, small errors compound and quickly lead to systematic model errors. Possible solutions for this involve prediction across multiple time steps, which have shown to yield more stable solutions for physical systems , or the use of additional constraints that measure the connectivity between chemical species, e.g. through the consideration of the stoichiometric coefficients of all involved reaction rates.
4.4 Possible implementations
The ability to represent the atmospheric chemistry as a set of individual machine learning models (one for each species) rather than as one simultaneous integration has numerous advantages. In locations where the impact of a (relatively short-lived) molecule is known to be insignificant (for example isoprene over the polar regions or dimethyl sulfide (DMS) over the deserts), the differential equation approach continues to solve the chemistry for all species. However, with this machine learning methodology, there would be no need to call the machine learning algorithm for a species with a concentration below a certain threshold or for certain chemical environments (e.g. nighttime): the chemistry could continue without updating the change in the concentration of these species. Thus it would be easy to implement a dynamical chemistry approach which uses a simple lookup table with predefined threshold rates to evaluate whether the concentration of a compound needs to be updated or not. If it did, the machine learning algorithm could be run; if it did not, the concentration would remain untouched and the evaluation of the random forest emulator is skipped (for this species). This approach could reduce the computational burden of atmospheric chemistry yet further.
The machine learning methodology could also be implemented to work seamlessly with the integrator. For example, the full numerical integrator can be used over regions of particular interest (populated areas for an air quality model or a research domain for a research model), while outside of these regions (over the ocean or in the free troposphere for an air quality model or outside of the research domain for a research model) the machine learning could be used. This would provide a “best of both worlds” approach which provides higher chemical accuracy where necessary and faster but lower-accuracy solutions where appropriate.
Our methodology uses the output from the atmospheric chemistry model to generate the training data set.
Another approach would be to use a series of box model simulations using
initial conditions covering the appropriate chemical concentration ranges to
generate the training data. This could allow the chemical complexity that is
known to exist
4.5 Limitations
This is the first step in constructing a new methodology for the representation of chemistry in atmospheric models. There are a number of limitations that should be explored in future work. Firstly, the machine learning methodology can only be applied within the range of the data used for the training. Applying the algorithm outside of this range would likely lead to inaccurate results. For example, the model here has been trained for the present-day environment. Although the training data set has seen a range of atmospheric conditions, it has only seen a limited range of methane () concentrations or temperatures. Thus applying the model to the pre-industrial period or the future, where the concentration and temperature may be significantly different than in the present day, would likely result in errors. Similarly, exploring scenarios where the emissions into the atmosphere change significantly (for example large changes in emissions vs. VOC emissions) again will likely ask the model to make predictions outside of the range of training data. A simple check would be to evaluate the (surface) ratios observed in the new model and compare them against the ranges used in the training: if the ratios in the updated model are significantly different from the training data, the RFR model likely needs to be retrained.
The same limitations also apply to model resolution: due to the non-linear nature of chemistry, the numerical solution of chemical kinetics is resolution-dependent, and a machine learning algorithm may not capture this. Thus, care should be taken when applying these approaches outside of the range of the training data.
4.6 Potential uses
Despite the limitations discussed here, there are a number of potential, exciting applications for this kind of methodologies.
The meteorological community has successfully exploited ensembles of
predictions to explore uncertainties in weather forecasting
The data assimilation methodology itself could benefit from a machine learning representation of atmospheric chemistry. Data assimilation is often computationally intense, requiring the calculation of the adjoint of the model or running large numbers of ensemble simulations . The ability to run these calculations faster would offer significant advantages.
Another potential application area for machine-learning-based chemistry emulators are chemistry–climate simulations. Unlike air quality applications, which focus on small-scale variations in air pollutants over comparatively short periods of time of days to weeks, chemistry–climate studies require long simulation windows of the order of decades. Because of this, machine learning models used for these applications need to be optimized such that they accurately reproduce the (long-term) response of selected species – e.g. ozone and OH – to key drivers such as temperature, photolysis rates and . The method presented here could be optimized for such an application by simplifying the problem set, with the model trained to reproduce daily or even monthly averaged species concentrations.
5 Conclusions
We have shown that a suitably trained machine-learning-based approach can replace the integration step within an atmospheric chemistry model run on the timescale of days to weeks. The application of some chemical intuition, by which we separate long-lived from short-lived species, and a basic application of the conservation of atoms to the family leads to significant improvements in model performance. The machine learning implementation is slower than the current model, but very little optimization and software development has thus far been applied to the code.
Methodologies similar to this may offer the potential to accelerate the calculation of chemistry for some atmospheric chemistry applications such as ensembles of air quality forecasts and data assimilation. Future work on both the algorithm and the methodology is necessary to produce a useful solution, but this first step shows promise.
Code and data availability
The GEOS-Chem model output used for training and
validation is available in netCDF format via the data repository of the
University York
(10.15124/e291fdb4-f035-419c-948e-c8c7c978f8d6, ). A copy of the random forest training code (written in
Python) and the model emulator (Fortran) is available upon request from Christoph Keller.
GEOS-Chem (
Author contributions
MJE and CAK came up with the concept and together wrote the paper. MJE developed the algorithm and CAK implemented it into the GEOS model. Both authors devised the experiments.
Competing interests
The authors declare that they have no conflict of interest.
Acknowledgements
Christoph A. Keller acknowledges support by the NASA Modeling, Analysis and
Prediction (MAP) Program. Resources supporting the model simulations were
provided by the NASA Center for Climate Simulation at the Goddard Space
Flight Center (
Review statement
This paper was edited by David Topping and reviewed by Johannes Flemming.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2019. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Atmospheric chemistry models are a central tool to study the impact of chemical constituents on the environment, vegetation and human health. These models are numerically intense, and previous attempts to reduce the numerical cost of chemistry solvers have not delivered transformative change.
We show here the potential of a machine learning (in this case random forest regression) replacement for the gas-phase chemistry in atmospheric chemistry transport models. Our training data consist of 1 month (July 2013) of output of chemical conditions together with the model physical state, produced from the GEOS-Chem chemistry model v10. From this data set we train random forest regression models to predict the concentration of each transported species after the integrator, based on the physical and chemical conditions before the integrator. The choice of prediction type has a strong impact on the skill of the regression model. We find best results from predicting the change in concentration for long-lived species and the absolute concentration for short-lived species. We also find improvements from a simple implementation of chemical families (
We then implement the trained random forest predictors back into GEOS-Chem to replace the numerical integrator. The machine-learning-driven GEOS-Chem model compares well to the standard simulation. For ozone (
This proof-of-concept implementation takes 1.8 times more time than the direct integration of the differential equations, but optimization and software engineering should allow substantial increases in speed. We discuss potential improvements in the implementation, some of its advantages from both a software and hardware perspective, its limitations, and its applicability to operational air quality activities.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 NASA Global Modeling and Assimilation Office, Goddard Space Flight Center, Greenbelt, MD, USA; Universities Space Research Association, Columbia, MD, USA
2 Wolfson Atmospheric Chemistry Laboratories, Department of Chemistry, University of York, York, YO10 5DD, UK; National Centre for Atmospheric Sciences, University of York, York, YO10 5DD, UK