Learning Coagulation Processes With Combinatorial

Full text

Turn on search term navigation

Introduction

The solving of population balance equations is key to many disciplines in the geosciences where collections of particles or droplets suspended in a fluid evolve over time. Frequently, the aggregation of particles needs to be considered, called coagulation in aerosol science (Seigneur et al., 1986), collision-coalescence in cloud physics (Gillespie, 1972), or flocculation in marine sciences (Xu et al., 2008). For this paper, we will mainly focus on applications in aerosol science and will use the terminology accordingly.

In the large-number limit, neglecting fluctuations and correlations between particles of different sizes (Gillespie, 1972), the population balance equation of a multi-species aerosol undergoing coagulation is (von Smoluchowski, 1916a, 1916b) [Image Omitted. See PDF]where $\vec{\mu }$ is the vector for the particles' constituent masses ( $\vec{\mu }\in {\mathbb{R}}^{A}$ , with A being the number of chemical species), $n\left(\vec{\mu },t\right)$ is the particle number distribution, and $K\left(\vec{{\mu }_{1}},\vec{{\mu }_{2}}\right)$ is the coagulation rate between particles with constituent masses $\vec{{\mu }_{1}}$ and $\vec{{\mu }_{2}}$ .

This reduces to the well-known one-dimensional version if the system only consists of one species or if only the average composition within particle size ranges needs to be tracked: [Image Omitted. See PDF]where μ is now the total particle mass. Equation 2 has been traditionally solved using the method of moments (Hulburt & Katz, 1964; McGraw, 1997), modal modeling techniques (Whitby et al., 1991), finite volume methods (Jacobson et al., 1994) or stochastic particle-based methods (Riemer et al., 2009).

The goal of our study is to develop a machine learning model that can represent the size distribution evolution of an aerosol that is undergoing coagulation. This is motivated by two particular challenges. First, solving Equation 1 directly is computationally very expensive as it requires particle-resolved methods (Riemer et al., 2019). It is therefore desirable to develop a coarse-grained method that captures the evolution of the particle population without the use of high-detail particle-resolved simulations. Second, both Equations 1 and 2 require knowledge of the coagulation kernel. Unless the aggregation process can be captured by one of the few well-known coagulation kernels (e.g., that due to Brownian motion) the kernel may not be accurately known (e.g., due to phoretic effects, hydrodynamic interactions between particles, electric charges, impacts of turbulent flow), and this introduces considerable uncertainties in the prediction. We propose here a proof-of-concept machine-learning framework that can handle both challenges. This requires the creation of a novel neural network architecture and several related algorithmic advances.

To accomplish our goals, the architecture of the neural net is important. Many recent uses of neural networks in the geosciences focus on embedding invariances, structure, and prior knowledge into the network construction (Reichstein et al., 2019). One such example was demonstrated by Chen et al. (2018) who incorporated ordinary differential equation constraints into the neural network architecture, which parameterizes the derivative of hidden states. Another notable use of structure in neural networks was the development of the ConvLSTM by Shi et al. (2015), a neural network architecture that can take advantage of spatial invariance as well as temporal information to predict precipitation using grid altimeter data.

While certain subgrid processes in atmospheric sciences have been successfully represented with fully connected (FC) neural networks (Brenowitz & Bretherton, 2018; Rasp et al., 2018), Seifert and Rasp (2020) showed that a naive, FC neural network will not work to simulate warm rain processes, which is a closely related problem to the application in our study. Instead, we introduce here the idea of a “Combinatorial Neural Network” (CombNN). Unlike the traditional convolutional neural network, which regulates the learning manifold by using spatial translations of a shared network structure, our CombNN operates on a combinatorial space which uses a shared structure over all combinations of the input elements (in our case, combinations of particle size bins). Specifically, our CombNN differs in two aspects from a fully-connected network that operates on all bins: (a) Recognizing that the architecture should be “sparse” in the sense that the interaction between bins i and j should be symmetric and only impact the output for bins j and j + 1 (details in Section 2.4.1). (b) Re-using the same network for all bin pairs, albeit with the input of the bin “locations” into the network. A key finding of our work is that successfully minimizing the single-step-forecast mean squared error as is common for a traditional neural network model does not guarantee that a successful forecast in the long run, nor is it even necessary. Instead, the key for successfully predicting the size distribution of a coagulating aerosol is to introduce a combinatorial structural constraint.

An important question in this context is the data generation process. Throughout this paper we will use two types of data that were both generated with the stochastic particle-resolved aerosol model PartMC (Riemer et al., 2009): The first type is the statistics of coagulating particles at any given point in time. This kind of data needs to be generated by high-detail models (unless one can experimentally quantify the statistics of coagulation events) and is of use for coarse-graining. It is not of use to learn new kernels from measured data. For reasons that will be explained in more detail in Section 2.1, we call this data “detailed-bin data.” The second type is size-distribution data, that is, observations of the states of the particle population at different points in time. We call this kind of data “state-only data.” It could come from experimental measurements and may therefore be useful to actually learn a kernel.

The paper is organized as follows. Section 2 introduces the CombNN architecture, the training data set, the model construction, and the training process. Results are presented in Section 3, and Section 4 provides a summary and conclusions.

Methods

We introduce a novel CombNN architecture for the coarse-grained simulation of a multi-component aerosol. The “Notation” section at the end of the paper summarizes the notation used throughout. The organization of the rest of this section is as follows: We start by explaining our data set in Section 2.1 and give a high-level overview of our proposed CombNN in Section 2.2. Sections 2.3 and 2.4 cover the CombNN (Detailed-Bin) and CombNN (State-Only) models, respectively. Finally error metrics are detailed in Section 2.5 and comparison models are described in Section 2.6.

Training Data

We generated our training data set via high-fidelity particle-resolved simulations with the PartMC model (Riemer et al., 2009), which represents the aerosol by discrete computational particles. With varying initial conditions, we simulated scenarios that only included coagulation and emission. The latter was included to replenish the particle number concentration and thereby increase the probability of capturing coagulation interactions at each time step. Other processes such as new particle formation and gas-particle partitioning were not included. We next projected our data into 20 logarithmically-spaced size bins between 1 nm and 50 μm. For this paper, we limited our scenarios to only a single particle type with a density of 1,800 kgm⁻³, representative of ammonium sulfate. For the main results of our algorithm we used simulations with a time step of 6 sec, although different time steps were also trained and analyzed in later sections.

We used a training data set ${\mathcal{D}}_{\mathrm{T}}$ which was comprised of 2,000 simulation runs or time series $(\mathcal{T})$ , and a validation data set ${\mathcal{D}}_{\mathrm{V}}$ which was comprised of 100 simulation runs. To construct the set of scenarios, we varied the input parameters for two initial log-normal modes. For each initial mode, the geometric median diameter ranged from 1 nm to 50 μm, the geometric standard deviation ranged from 1.05 to 1.15, and the initial number concentration ranged from 1 × 10⁵ to 1 × 10¹¹ m⁻³. Particle emissions were used to replenish the depleted particle number concentration due to coagulation. Emission fluxes were uniform within the size range that contained particles from the initial condition based on a cutoff of three geometric standard deviations (i.e., ±3σ_g). The total emission flux was scaled based on the initial number concentration of each scenario. The temperature was held constant at 290 K. The simulation time was 10 min in all cases as running for longer only yielded redundant data.

Using the full particle-resolved PartMC simulation output, we generated two types of lower-dimensional (coarse-grained) projected data which will be referred from here on as detailed-bin data and state-only data. Detailed-bin data consists of individual coagulation statistics/events between different size bins for each time step, while state-only data contains only the coarse-grained state data (the binned size distribution) at each time step. Based on these two types of data, we separately trained two different CombNNs.

Concretely, for each time series $\mathcal{T}$ , at time step t, we evolve the system in PartMC and project it into our coarse-grained bin representation yielding state vectors (binned size distribution) n_t, n_t+1, as well as detailed coagulation data X_t. Doing this for each time step yields our detailed-bin data consisting of $\left({X}_{1},{X}_{2},\dots ,{X}_{T-1}\right)\in {X}^{\mathcal{T}}$ and our state-only data consisting of $\left({n}_{1},{n}_{t},\dots ,{n}_{T}\right)\in \mathcal{T}$ where T is the length of time series $\mathcal{T}$ (environment variables are omitted here for conciseness). The detailed-bin data entries, X_t, consist of the per-bin-pair coagulation number concentration ${\Delta}{n}_{t}\in {\mathbb{R}}^{{N}_{\text{bin}}^{3}}$ together with the per-bin number concentration ${n}_{t}\in {\mathbb{R}}^{{N}_{\text{bin}}}$ and the environmental state e_t. In all examples in this paper, we take the environment state to be the temperature, so N_env = 1 and e_t = T_t, where T_t is the temperature at time t. For bins (i, j, k) the entry ${\Delta}{n}_{t}^{(i,j,k)}$ is the number concentration of coagulated particles that are added to bin k from coagulations between particles from bins i and j in time step t. As we will discuss further in Section 2.4.1 we will in practice assume that k = j or k = j + 1 for i ≤ j. We thus have that ${X}_{t}\in {\mathbb{R}}^{{N}_{\text{bin}}^{2}\times 2}\times {\mathbb{R}}^{{N}_{\text{bin}}}\times {\mathbb{R}}^{{N}_{\text{env}}}$ where X_t = (Δn_t, n_t, e_t). The sequence of detailed-bin data entries gives the detailed bin data $\left({X}_{1},{X}_{2},\dots ,{X}_{T-1}\right)\in {X}^{\mathcal{T}}$ . Finally note that for any X_t one can reconstruct the coagulation from n_t to n_t+1 so the detailed-bin data contains all the information of the state-only data. For a schematic representation of this process refer to Figure 1.

$View Image - Figure 1. Schematic of the two different types of training data and their relation to raw PartMC model output. We begin by evolving a system using the particle-resolved model PartMC (Riemer et al., 2009). Then at each time step we project it into a coarse-grained state consisting of the number distribution, nt. Then using the next time step we also obtain the next projected state, nt+1, along with the detailed-bin coagulation data in the projection space, Xt. Then our detailed-bin data consists of X1,X2,…,XT−1∈XT $\left({X}_{1},{X}_{2},\dots ,{X}_{T-1}\right)\in {X}^{\mathcal{T}}$ while our state-only data consists of n1,n2,…,nT∈T $\left({n}_{1},{n}_{2},\dots ,{n}_{T}\right)\in \mathcal{T}$, where T is the length of time series T $\mathcal{T}$ (environment variables are omitted here for conciseness). Note that because with only the coagulation data we have an implied evolution from nt to nt+1, all the information contained in the state-only data is also present in the detailed-bin data.$

Figure 1. Schematic of the two different types of training data and their relation to raw PartMC model output. We begin by evolving a system using the particle-resolved model PartMC (Riemer et al., 2009). Then at each time step we project it into a coarse-grained state consisting of the number distribution, nt. Then using the next time step we also obtain the next projected state, nt+1, along with the detailed-bin coagulation data in the projection space, Xt. Then our detailed-bin data consists of X1,X2,…,XT−1∈XT $\left({X}_{1},{X}_{2},\dots ,{X}_{T-1}\right)\in {X}^{\mathcal{T}}$ while our state-only data consists of n1,n2,…,nT∈T $\left({n}_{1},{n}_{2},\dots ,{n}_{T}\right)\in \mathcal{T}$, where T is the length of time series T $\mathcal{T}$ (environment variables are omitted here for conciseness). Note that because with only the coagulation data we have an implied evolution from nt to nt+1, all the information contained in the state-only data is also present in the detailed-bin data.

Combinatorial Neural Network (CombNN)

Our goal is to predict the forward evolution of the interaction of particles by projecting them onto a fixed binned representation and predicting the coarse-grained interactions between these bins or groups. In this sense, we use the same state information as a binned aerosol model (Jacobson et al., 1994). Our model formulation shares weights along each unique combination of particle groups, allowing for the use of structural priors. This method also provides a strong regulation effect as well as significantly reduces the parameter space, reducing both training time and memory usage.

For this paper, we focused on modeling a particle population containing only a single species as a proof of concept, with a projection onto bins along a single mass axis. However, the following equations can be readily generalized to handle particle populations that contain multiple species, using either one- or multiple-dimensional bins. Formally, at each time t we project the particle population onto N_bin size bins, giving a size distribution vector ${n}_{t}\in {\mathbb{R}}^{{N}_{\text{bin}}}$ . We also consider a vector of N_env environmental variables at each time, so we can write our evolution model as [Image Omitted. See PDF]where f_θ(n_t, e_t) is the prediction of the size distribution n_t+1 at the next time step, θ is the vector of model (neural network) parameters, and e_t is the environment state. Denote the ith element of n_t as ${n}_{t}^{(i)}$ , which is the number concentration in bin i at time t. The collection of state vectors gives the state-only data $\left({n}_{1},{n}_{2},\dots ,{n}_{T}\right)\in \mathcal{T}$ .

To describe the evolution, f_θ, we first recognize that the size distribution n_t+1 can be represented by a sum of the previous size distribution n_t and a function of many bin-bin combinatorial interactions, the set of which we denote as $X=\left\{(i,j)\in {\mathbb{Z}}_{{N}_{\text{bin}}}^{2}\vert i\le j\right\}\times {\mathbb{R}}^{2}\times {\mathbb{R}}^{{N}_{\text{env}}}$ . An element x ∈ X describes all the information in a single bin-bin coagulation interaction. As an example, take x = (1, 3, 1,000, 10,000, 275). This describes an interaction between bin 1 and bin 3, which have number concentrations of 1,000 m⁻³ and 10,000 m⁻³, respectively, with an environmental temperature of 275 K. The learned function, f_θ, is parameterized by a neural network with two outputs: ${f}_{\text{ker},\theta }(x)=\left(\right.{K}_{\theta },{\lambda }_{\theta }$ ). See Sections 2.3 and 2.4 for further details.

CombNN (Detailed-Bin)

Using the detailed-bin data as described in Section 2.1, we can directly learn the coagulation kernel through what we will call the CombNN (Detailed-Bin) model.

Model Construction

Given a state n_t from time series $\mathcal{T}$ we want to be able to extract all possible interactions at time t. To do this we define the set of possible interactions by ${X}_{t,\text{poss}}^{\mathcal{T}}=\left\{\left(i,j,{n}_{t,i},{n}_{t,j},{n}_{t,\text{env}}\right)\in \left({\mathbb{Z}}_{{N}_{\text{bin}}}^{2}\times {\mathbb{R}}^{2}\times {\mathbb{R}}^{{N}_{\text{env}}}\right)\vert i\le j,{n}_{t,i}\ne 0,{n}_{t,j}\ne 0\right\}$ . Then we define a neural network f_ker,θ(x) = (K_θ, ⋅) which takes as input some $x\in {X}_{t,\text{poss}}^{\mathcal{T}}$ and outputs a predicted log-kernel value K_θ. We devise this network such that $1{0}^{{K}_{\theta }}$ represents the kernel. For a schematic of this process, refer to Figure 2.

$View Image - Figure 2. Schematic of our combinatorial neural network (Detailed-Bin). As seen in the left panel we start with the decomposition of nt into the set Xt,possT ${X}_{t,\text{poss}}^{\mathcal{T}}$. Then every x∈Xt,possT $x\in {X}_{t,\text{poss}}^{\mathcal{T}}$ is passed into our neural network which yields the log-predicted kernel Kθ.$

Figure 2. Schematic of our combinatorial neural network (Detailed-Bin). As seen in the left panel we start with the decomposition of nt into the set Xt,possT ${X}_{t,\text{poss}}^{\mathcal{T}}$. Then every x∈Xt,possT $x\in {X}_{t,\text{poss}}^{\mathcal{T}}$ is passed into our neural network which yields the log-predicted kernel Kθ.

For the neural network f_ker,θ we use approximately 4,000 parameters, with the sequence of layers: Linear(3, 32), 1D Batch Normalization, Tanh, Linear(32, 32), 1D Batch Normalization, Tanh, Linear(32, 32), 1D Batch Normalization, Tanh, and Linear(32, 2). While we ran a small ablation study of our neural network to determine the size of our model (See Section 3), we did not extensively explore the space of possible neural network constructions. Such a task can be performed through Neural Architecture Search (Zoph & Le, 2016), which would be necessary for practical use in large-scale applications. Furthermore, many modern methods and algorithms for optimizing inference speed (Neill, 2020) could additionally be applied.

Training

Having access to the individual detailed-bin events per time step allows the direct (noisy) estimation of the kernel corresponding to each coagulation event: [Image Omitted. See PDF]where x = (i, j, n_i, n_j, e) represents coagulation between bins i and j containing number concentrations n_i and n_j with environment state e, and Δn_k(x) is the observed change in bin k, for k = j or k = j + 1, after interaction $x\in {X}^{\mathcal{D}}$ where ${X}^{\mathcal{D}}$ is the set of all detailed-bin interactions that occur in data set $\mathcal{D}$ . In terms of the notation from Section 2.2, ${\Delta}{n}_{k}={\Delta}{n}_{t}^{(i,j,k)}$ from the detailed-bin observation data, X_t. Note that since n_i and n_j are both positive (as a coagulation cannot happen if the number concentration of particles is zero), ${\hat{K}}_{x}$ is always well defined.

From here, we train the neural net to approximate the observed kernel ${\hat{K}}_{x}$ . Specifically, we can define a direct supervision of our kernel function with the detailed-bin loss function: [Image Omitted. See PDF]where for x = (i, j, ⋅, ⋅, e), K_θ = f_ker,θ(i,j,e)⁽¹⁾ is the first index of the output of the neural network. However because of the range of magnitudes in our data, this loss function was extremely unstable in our numerical experiments and unable to converge in practice. Thus, we introduced an adaptive normalization scheme that will appropriately scale our model output to regularize the gradients. This is given in Algorithm 1.

1 AlgorithmAdaptive Normalization Procedure for Detailed Bin Data

[IMAGE OMITTED. SEE PDF.]

This algorithm introduces a new hyperparameter β that controls the smoothness of the normalization kernel. In practice we found that a very high value of 0.99 works well. Training the loss function (Equation 5) without this scheme is highly unstable and weights the loss function to focus on large values of the kernel without paying attention to the smaller values. Furthermore, this problem can not be solved with a log-scale because of the stochastic nature of the data, as analyzed in more depth in Appendix A. Algorithm 1 uses an adaptive linear scaling so that the converged kernel will approximate the average interaction rate throughout the data set.

Inference

Since the CombNN (Detailed-Bin) directly learns the average kernel (or data kernel), it can be used for forward inference with any conventional sectional coagulation method. For the current paper we use the semi-implicit sectional aerosol model of Jacobson et al. (1994) because of its efficiency and accuracy. We run this sectional model using our learned kernel $1{0}^{{K}_{\theta }}$ as input, in place of the analytic brownian kernel.

CombNN (State-Only)

Next we introduce the CombNN (State-Only) model, which, as the name suggests, is trained using only the state-only data described in Section 2.1. Unlike the CombNN (Detailed-Bin) model, which uses a conventional sectional model for inference, CombNN (State-Only) is an end-to-end method that directly performs inference to obtain the next state of the system, as well as allowing extraction of the learned kernel values.

We start by defining combinatorial models ${f}_{\text{gain},\theta }:X\times {\mathbb{Z}}_{{N}_{\text{bin}}}\to \mathbb{R}$ and ${f}_{\text{loss},\theta }:X\times {\mathbb{Z}}_{{N}_{\text{bin}}}\to \mathbb{R}$ . Here $X=\left\{(i,j)\in {\mathbb{Z}}_{{N}_{\text{bin}}}^{2}\vert i\le j\right\}\times {\mathbb{R}}^{2}\times {\mathbb{R}}^{{N}_{\text{env}}}$ is all the information about a single bin-bin combinatorial interaction. The integer vector ${\mathbb{Z}}_{{N}_{\text{bin}}}^{2}$ component of X gives the indexes of the two bins that are interacting, the vector ${\mathbb{R}}^{2}$ component of X gives the information about each bin, and the real vector ${\mathbb{R}}^{{N}_{\text{env}}}$ component of X represents the environmental variables. The combinatorial models f_gain,θ and f_loss,θ take an interaction event in X and a target index, $i\in {\mathbb{Z}}_{{N}_{\text{bin}}}$ , and output the resulting gain and loss in bin i, respectively. As a concrete example with N_env = 1, f_gain,θ((1, 2, 3.2 × 10³ cm⁻³, 1.0 × 10⁴ cm⁻³), 295 K, 3) would output the gain in bin 3 due to the resulting interaction between bins 1 and 2, which have contain number concentrations of 3.2 × 10³ cm⁻³ and 1.0 × 10⁴ cm⁻³, respectively, when the environment has a temperature of 295 K.

For the sake of conciseness let ${f}_{\text{gain},\theta }(x)\in {\mathbb{R}}^{{N}_{\text{bin}}}$ be a vector such that each element f_gain,θ(x)⁽ⁱ⁾ = f_gain,θ(x, i) and similarly for f_loss,θ. Our CombNN (State-Only) model expresses the evolution by adding the predicted gain of number concentration in each bin, and subtracting the corresponding predicted loss vector. This can be represented fully in the following equation: [Image Omitted. See PDF]where ${X}_{t,\text{poss}}^{\mathcal{T}}=\left\{\left(i,j,{n}_{t,i},{n}_{t,j},{n}_{t,\text{env}}\right)\in \left({\mathbb{Z}}_{{N}_{\text{bin}}}^{2}\times {\mathbb{R}}^{2}\times {\mathbb{R}}^{{N}_{\text{env}}}\right)\vert i\le j,{n}_{t,i}\ne 0,{n}_{t,j}\ne 0\right\}$ is the set of all possible interactions at a time step. Note that, as seen in Figure 3, the input vector n_t is first decomposed into all possible interactions at time step t in time series $\mathcal{T}$ , ${X}_{t,\text{poss}}^{\mathcal{T}}$ , then it is input into each of the gain and loss networks, and finally summed. This final difference is then summed with the previous step n_t to attain n_t+1.

$View Image - Figure 3. Schematic of our combinatorial neural network (CombNN) (State-Only). Just like the CombNN (Detailed-Bin) model as shown in Figure 2, we begin by decomposing our nt into Xt,possT ${X}_{t,\text{poss}}^{\mathcal{T}}$. Then fgain,θ and floss,θ as defined in Section 2.4 are neural networks which are constrained by Equation 7. Note that while models fgain,θ and floss,θ would usually also input an environmental variable as apart of x ∈ X, we omitted it here for the sake of conciseness.$

Figure 3. Schematic of our combinatorial neural network (CombNN) (State-Only). Just like the CombNN (Detailed-Bin) model as shown in Figure 2, we begin by decomposing our nt into Xt,possT ${X}_{t,\text{poss}}^{\mathcal{T}}$. Then fgain,θ and floss,θ as defined in Section 2.4 are neural networks which are constrained by Equation 7. Note that while models fgain,θ and floss,θ would usually also input an environmental variable as apart of x ∈ X, we omitted it here for the sake of conciseness.

Every coagulation event consists of losing the two coagulating particles and gaining the single coagulated particle. Thus the total number of particles lost must be twice the number gained. We will enforce the following constraint on the model by requiring: [Image Omitted. See PDF]where the exchange ratio during interactions is 2 (i.e., for each coagulation event, two particles combine into exactly one).

Model Construction

For the problem of aerosol coagulation we formulated our model for the problem of predicting the evolution of a binned number distribution. This classical problem provides a suitable baseline measure of our algorithm and its efficacy. As outlined in Section 2.1, we used N_bin = 20 bins with the state being the number concentration in each bin, together with an environment of N_env = 1 (containing temperature in our tests).

We define our gain model ${f}_{\text{gain},\theta }:X\times {\mathbb{Z}}_{20}\to \mathbb{R}$ with $X=\left\{(i,j)\in {\mathbb{Z}}_{20}^{2}\vert i\le j\right\}\times {\mathbb{R}}_{+}^{3}$ by [Image Omitted. See PDF] [Image Omitted. See PDF]where (K_θ, λ_θ) = f_ker,θ(i, j, e) is the output of neural network ${f}_{\text{ker},\theta }:{\mathbb{R}}^{3}\to {\mathbb{R}}^{2}$ . Here K_θ is the learned logarithmically-scaled coagulation kernel between bins i and j, and λ_θ is the proportion of the number concentration of particles that go to the lower and higher bin. Note that because of the spacing of our bin-grid the only possible destination bins after a coagulation between bins j and i ≤ j are j and j + 1(This will always be true as long as $\sqrt{2}{d}_{j}\le {d}_{j+1}$ where d_j is the diameter of bin j. Finer bin grids could be accommodated by increasing the dimensionality of the neural network output). Trivially, we can then define our loss model ${f}_{\text{loss},\theta }:X\times {\mathbb{Z}}_{20}\to \mathbb{R}$ as [Image Omitted. See PDF]

Furthermore, as seen in the above equation, because our number concentration is perfectly conserved by a factor of two within individual interaction events (for each coagulation event exactly two particles form into one), our gain and loss models are both specified in terms of the single neural network f_ker,θ. This neural network f_ker,θ has the same architecture as specified in Section 2.3.1 for the CombNN (Detailed-Bin) model.

Training

The more practically interesting training situation is when we do not have access to the individual detailed-bin events and instead only have access to the state n_t and environmental variable e_t at each time step. In our study, this state data comes from simulations, but it could also be obtained from measurements in an aerosol chamber.

Equation 6 is inherently non-injective, since many different outputs of f_gain,θ could result in the same f(n_t, e_t). This makes training using state-only data much more difficult than training with detailed-bin data. However, we can still apply a weak supervision by enforcing Equation 6. This gives the loss function: [Image Omitted. See PDF] [Image Omitted. See PDF]

We will also call this loss function the Single-step-forecast Mean Squared Error (SMSE). To train other comparison models (see Section 2.6), such as the conventional FC neural network, we rely solely on Equation 11 as it does not require the structural prior that f_θ can be decomposed into f_gain,θ and f_loss,θ.

Inference

To predict the evolution of n_t with the CombNN (State-Only) model we use Equation 6. Because the CombNN algorithm is a “patched based” neural network approach, in other words it processes data by splitting up the input into combinatoral chunks, it benefits from being highly parallel. Furthermore, this permits the restriction of the input and output dimensions of the network itself (as seen in Section 2.3.1). While this can significantly improve computation speed, it presents a downside in the imposition of constraints. Namely, because the interaction between each combination of two bins is independently calculated, it becomes harder to enforce Equation 7 as we cannot have each combinatorial pair depend on another. This contrasts the CombNN (Detailed-Bin) model which relies on the semi-implicit section aerosol model for coagulation presented by Jacobson et al. (1994), which calculates each bin-pair interaction sequentially.

In fact, because there is no communication between each individual predicted coagulation event, our model might sometimes over-draw from a bin, leading to a negative prediction. To fix this we introduce a balancing algorithm (see Algorithm 2) that loops through all predicted events to ensure that the number concentration of particles removed from each bin cannot be more than the value in the bin. We iteratively check each bin, and if the predicted number concentration is negative, we normalize all predicted events leaving the bin by the number concentration in the bin. This forces the bin to fully empty while preserving the structural constraint (Equation 7).

2 AlgorithmBalancing Algorithm for Forward Inference

[IMAGE OMITTED. SEE PDF.]

Error Metrics

We used three different metrics to evaluate model performance: (a) Single-step-forecast Mean Squared Error (SMSE), (b) Kernel log Mean Squared Error (Kernel Mean Square Error (KMSE)), and (c) multi-step-Forecast Root Mean Squared Error (FRMSE). The SMSE is defined in Equation 11 and is essentially the forward loss of a model, while the KMSE is defined in Equation 14 and is effectively the inverse loss. The former shows how well a model can solve the (single-step) forecast task while the latter shows how well a model was able to learn the underlying dynamics of the system. This notion of KMSE relies on a “data kernel” $\bar{K}$ defined in Equation 13 which is the average observed kernel over the entire data set. [Image Omitted. See PDF]

Using this, we can define the error in the predicted kernel by [Image Omitted. See PDF]where ${K}_{\theta }={f}_{\text{ker},\theta }{\left(i,j,\bar{e}\right)}^{(1)}$ with $\bar{e}$ as the average environment variable (temperature in our case) throughout the entire data set, and N_comb = 210 is the total number of valid bin combinations. We use this definition of KMSE to monitor the progress of our learning algorithm as KMSE is a proxy for L_adaptive in Algorithm 1 (see Appendix B for a more in-depth explanation).

To measure the error propagation of our model we perform a forward forecast, ${\hat{n}}_{t}$ , for each simulation run $\mathcal{T}\in \mathcal{D}$ , starting from the initial state, n₀, of the run. This gives: [Image Omitted. See PDF] [Image Omitted. See PDF]

We then define the forecast error at step t to be [Image Omitted. See PDF]where ${n}_{t}^{k}$ is the state at time t in the kth time series in $\mathcal{D}$ . Note that we used a relative metric due to the large variance of magnitudes between different simulation runs. Using an absolute metric would cause the error to be dominated by simulations with a large number concentration of particles. Furthermore, we choose to use root mean square error rather than MSE as it is conventional in multi-step forecasting tasks.

Comparison Models

To evaluate the performance of our new models, we compare them to a simple persistence model (state is held constant) and a state-of-the-art semi-implicit sectional aerosol model for coagulation (Jacobson et al., 1994), henceforth known as the Sectional Model. Furthermore, to show the advantage of our novel combinatorial priors, we compare the performance of our models to that of a conventional FC neural network and a Long Short-Term Memories (LSTM) neural network, both trained on the SMSE loss function.

The baseline fully-connected network takes in the previous state and appends on the environmental variable (temperature in our case), and then outputs the change in state. The network architecture for the fully-connected network is ${f}_{\text{FC}}:{\mathbb{R}}^{21}\to {\mathbb{R}}^{20}$ , where f_FC is sequentially composed of the layers: Linear(21, 128), Tanh, Linear(128, 256), Tanh, 1D Batch Normalization, Linear(256, 128), Tanh, and Linear(128, 20). This amounts to approximately 60,000 parameters, which is about 15 times the number of parameters in our CombNN (which has about 4 000 parameters). Changing the number of parameters did not have an appreciable affect on the neural network performance, thus we used 60,000 parameters to ensure enough network complexity.

The long short-term memory network takes in the previous four states with the appended environmental variable (temperature in our case), and outputs the change in state. The network architecture for the LSTM network is ${f}_{\text{LSTM}}:{\mathbb{R}}^{\mathit{21x4}}\to {\mathbb{R}}^{20}$ , where f_LSTM is sequentially composed of the layers: LSTM(21, 64), LSTM(64,64), LSTM(64,64), and Linear(64,20), with 90,000 parameters. Note that the LSTM structure (Hochreiter & Schmidhuber, 1997) contains tanh nonlinearities.

Because our data contains simulated runs which vary significantly in magnitude, each simulated run was normalized to mean 0 and standard deviation 1 for training. During our error metric calculation, the values were denormalized to allow a fair benchmark.

The performance of the fully-connected model and LSTM models can only be evaluated with the FRMSE and the SMSE metric. The KMSE metric is not applicable for these models because the lack of structural priors means that they do not explicitly learn a kernel representation.

All models were trained using Adam over 1,000 epochs with a learning rate of 5 × 10⁻⁴ and batchsize of 512. The learning rate was modulated with a decay on plateau with a factor of 0.1 and patience of 10 epochs.

Results Forecast Error

We begin by showing the most important error metric, FRMSE, which measures forward prediction accuracy. The FRMSE values for the baseline and CombNN models are shown in Table 2 and Figure 4. These values are generated from validation data which consists of 100 stochastic simulated runs of length 100 time steps each (more details in Section 2.1), forecasted from the initial condition. The FRMSE is then calculated by averaging the errors of each time step and 95% confidence intervals are calculated as well. We calculated this error metric for the four baseline comparison models: persistence, a standard FC neural network, an LSTM model, and the sectional model.

Table 1 Overview of Different Data Types Used in This Study

Training data	Runs	Steps per run	Datapoints (values) per step
Particle simulation	2,000	100	100,000
Detailed-bin	2,000	100	230
State-only	2,000	100	20
Validation data
Particle simulation	100	100	100,000
Detailed-bin	100	100	230
State-only	100	100	20

Note. The coarse-graining is evident from the decreases in Datapoints (values) per step which gives the number of floating-point scalar values used to represent each time step. Note that the detailed-bin data includes the datapoints of the state-only data, hence totaling 210 combinatorial pairs and the 20 datapoints given by the state-only data. Finally, there are 20 scalars used to denote the state-only data since we coarse-grain the data into 20 bins, with each scalar representing the number concentration of particles in a bin.

Table 2 Multi-Step Forecast Root Mean Squared Error (FRMSE) Versus Simulated Time Step (Mean and 95% Confidence Interval)

	Time step index
Model	1	10	20	30	50	75	99
Fully connected	0.0007	0.0030	0.0066	0.0109	0.0215	0.0484	0.0655
Error	±0.0010	±0.0032	±0.0081	±0.0136	±0.0238	±0.0554	±0.0737
Persistence	0.0009	0.0037	0.0073	0.0111	0.0205	0.0462	0.0621
Error	±0.0012	±0.0034	±0.0072	±0.0108	±0.0182	±0.0514	±0.0672
Long short-term memory	0.0007	0.0026	0.0059	0.0098	0.0192	0.0437	0.0586
Error	±0.0009	±0.0029	±0.0080	±0.0137	±0.0233	±0.0546	±0.0713
Sectional model	0.0003	0.0015	0.0025	0.0033	0.0047	0.0065	0.0068
Error	±0.0002	±0.0016	±0.0022	±0.0024	±0.0034	±0.0044	±0.0040
CombNN (State-only)	0.0003	0.0010	0.0015	0.0020	0.0030	0.0045	0.0040
Error	±0.0004	±0.0010	±0.0012	±0.0014	±0.0024	±0.0040	±0.0066
CombNN (Detailed-bin)	0.0002	0.0003	0.0006	0.0010	0.0020	0.0036	0.0029
Error	±0.0004	±0.0002	±0.0004	±0.0008	±0.0022	±0.0040	±0.0025

View Image - Figure 4. Multi-step Forecast Root Mean Squared Error (FRMSE) for baseline models (persistence model, fully connected neural network, LSTM model, and sectional model) and our new combinatorial neural network models (using the two different types of training data), measured on the validation data set. The corresponding 95% confidence intervals are given in Table 2.

Figure 4. Multi-step Forecast Root Mean Squared Error (FRMSE) for baseline models (persistence model, fully connected neural network, LSTM model, and sectional model) and our new combinatorial neural network models (using the two different types of training data), measured on the validation data set. The corresponding 95% confidence intervals are given in Table 2.

As seen in both Table 2 and Figure 4, the baseline fully-connected neural network performs well in the first few time steps (as it is trained to minimize the single-step SMSE) but quickly accumulates error and becomes worse than the persistence model (which always predicts the initial condition). A large part of this is due to the lack of any structural priors, so there is no conservation enforcement like that in Equation 7. This means that a slight drift will quickly take the predictions into an infeasible domain which the neural network has not been trained on.

The LSTM model also performs well initially because it minimizes SMSE but suffers from error accumulation similarly to the fully-connected network. The inclusion of history information in the LSTM enables it to slightly outperform the persistence model at later times, but it is not able to produce a reasonable estimate of the system evolution.

The CombNN trained with either detailed-bin or state-only data out-performs the baseline fully-connected neural network despite using only one fifteenth of the number of parameters. The CombNN trained with state-only data performs about as well as the reference sectional model, while the CombNN trained with detailed-bin data performs better. This shows that the CombNN was in fact able to more accurately fit the particle-resolved training data than the sectional model, despite the fact that the sectional model was using the analytic kernel. This could be due, for example, to the fact that the sectional model only evaluates the kernel at the midpoint of each bin, while the CombNN training data correctly averages the kernel rates over the entire bin width.

Finally, we can observe that the CombNN model trained with the detailed-bin data outperformed the model trained with state-only data. This is expected, because the detailed-bin training scheme contains around 10 times more information as the state-only data. When training the CombNN with state-only data, the model thus needs to solve a less-well-conditioned inverse problem.

Kernel Learning

In fact this gap between the CombNN (State-Only) and CombNN (Detailed-Bin) points to a fundamental difference in the learned kernel between these two methods. Figure 5 shows how these kernels evolve over time through training. For ease of analysis from here on, we define the final learned kernel from CombNN (State-Only) and CombNN (Detailed-Bin) as K_state-only and K_detailed-bin respectively. Note that this would correspond to the epoch 1,000 images in Figure 5.

$View Image - Figure 5. Evolution of combinatorial neural network (CombNN) (Detailed-Bin) kernel, Kdetailed-bin, and CombNN (State-Only) kernel, Kstate-only, through 1,000 epochs of training on the left and right, respectively. The Kernel Mean Square Error as presented in Equation 14 is given in parentheses above each learned kernel. On the bottom of the figure are the data kernel, K¯ $\bar{K}$, and analytic Brownian kernel, K, respectively.$

Figure 5. Evolution of combinatorial neural network (CombNN) (Detailed-Bin) kernel, Kdetailed-bin, and CombNN (State-Only) kernel, Kstate-only, through 1,000 epochs of training on the left and right, respectively. The Kernel Mean Square Error as presented in Equation 14 is given in parentheses above each learned kernel. On the bottom of the figure are the data kernel, K¯ $\bar{K}$, and analytic Brownian kernel, K, respectively.

Notice that while K_detailed-bin seems to converge to the data kernel $\bar{K}$ , which is the average of $\hat{K}$ defined in Equation 4 over the training data set, K_state-only does not. This points to the fact that SMSE, the loss function that CombNN (State-Only) is minimizing, is not a perfect proxy for KMSE which directly minimizes the error between the learned-kernel and data-kernel.

The main reason for this is a difference in training algorithm and inference. In the true underlying simulation, coagulation events are calculated in series, thus if a coagulation event happens and depletes a singular bin of all material, it can no longer be a target of coagulation in the future. In CombNN (State-Only), because of the structure of our forward inference, we calculate all possible coagulation events in parallel. This means that there may be cases where we are depleting a singular bin past its maximum contents, which would require rebalancing as described in Section 2.4.3. Because optimization of K_state-only directly minimizes SMSE (which uses the parallel forward inference algorithm in Section 2.4.3), it will end up learning a slightly different kernel than $\bar{K}$ . On the other hand, K_detailed-bin does not suffer from this issue because it directly learns from $\hat{K}$ in parallel.

Effect of Different Loss Functions

We confirm that these two networks are not optimizing the same objective function by plotting the loss of the CombNN trained using detailed-bin data and state-only data, respectively, in Figure 6. The two methods do not converge to the same model as is evident by the differences between the two loss functions. When using detailed-bin data, the KMSE is minimized (as it is a proxy for L_adaptive which is directly optimized) while the SMSE is minimized when using state-only data (as it is directly optimized). The reason for this large difference is due to the inaccuracy during the forward evolution of our model. Namely, the forward evolution of CombNN (State-Only) does not perfectly approximate the proper dynamics given a kernel function. Thus, for the results of FRMSE the errors when using state-only data consist of the proper evolution of our CombNN, while the errors when using detailed-bin data consist of using the sectional model on the learned kernel (f_ker,θ) extracted from our neural network. In fact, if we were to instead allow the CombNN (Detailed-Bin) to directly evolve our system, we would find that the FRMSE is actually higher than that with CombNN (State-Only). In practice however, the actual difference in FRMSE over time is somewhat small. Finally, it is notable that the CombNN (State-Only) learns a good approximation to the data kernel, despite needing to solve a noisy inverse problem with highly non-injective data.

View Image - Figure 6. Evolution of the Single-step Mean Square Error (SMSE, upper panel) and Kernel Mean Square Error (KMSE, lower panel) during training of the baseline fully connected, baseline LSTM and combinatorial neural network models on the training data set.

Figure 6. Evolution of the Single-step Mean Square Error (SMSE, upper panel) and Kernel Mean Square Error (KMSE, lower panel) during training of the baseline fully connected, baseline LSTM and combinatorial neural network models on the training data set.

Similar to CombNN (State-Only) and in contrast to CombNN (Detailed-Bin), the baseline FC model and the baseline LSTM model also successfully minimize the SMSE, as expected. This tells us that minimizing SMSE does not guarantee good predictions in the long run. This is particularly evident for the LSTM in the upper panel of Figure 6, which shows that the LSTM approaches the SMSE of the CombNN (State-Only) very well. In fact, minimizing the SMSE is not necessary, as the SMSE for the CombNN (Detailed-Bin) shows. What is a essential is the combinatorial structural constraint (Equation 7), which both CombNN fulfill but not the FC or LSTM models.

Example Simulated Scenario

To visualize the performance of the CombNN models, we show results from a representative 12 hr simulation (7,200 time steps of six seconds each) in Figure 7. Both of the CombNN models (trained on state-only and detailed-bin data) track the evolution of the state accurately for the entire simulation duration. Unlike other neural network-based models (see the “baseline FC” results in Figure 4 and Table 2), the CombNN models do not suffer from error accumulation due to evolving into untrained regions of the state space, because of the enforced constraints in Equation (7).

Figure 7. An example of a forecast run for 12 hr duration (7,200 time steps of six seconds each).

Inference Speed

The CombNN requires more computation than the FC or LSTM networks (in that it must split the input into 210 batches or in general n² complexity). However, pushing the computation into the batch dimension allows us to compute the operation in parallel and, using modern GPUs, we can take advantage of parallel tensor operations. Table 3 shows the computation time (wall clock time) to evaluate the FC, LSTM, and CombNN models on a GTX 3070 with 8 GB video random access memory. As we can see, the CombNN is faster than both the FC model (p = 0.008) and the LSTM model (p < 0.001).

Table 3 Average Time and 95% Confidence Interval to Compute a Single Batch (of Size 256) per Network Averaged Over 100,000 Trials

Model	Computation time	95% CI
Fully connected	0.453	±0.155
Long short-term memory	0.609	±0.197
CombNN	0.405	±0.181

Time-Discretization Error

We expect some difference between K_state-only and K_detailed-bin to be caused by discretization error. For example, consider two different events, x₁ and x₂, for the same pair of bins and assume a six-second time step (Δt). Now suppose that x₁ had lower initial concentrations of particles and so only 2 sec of coagulation were needed before one bin was emptied of particles, while x₂ had high concentrations and the full 6 sec elapsed without emptying out either bin. Because x₁ had two thirds of its time step with no chance of coagulations, we expect that the observed kernel value, ${\hat{K}}_{{x}_{1}}$ , will be lower than that corresponding to x₂, namely ${\hat{K}}_{{x}_{2}}$ . The average estimated kernel, $\bar{K}=\frac{{\hat{K}}_{{x}_{1}}\,+\,{\hat{K}}_{{x}_{2}}}{2}$ would thus tend to be larger than ${\hat{K}}_{{x}_{1}}$ .

Continuing the example, note that K_detailed-bin is the result of optimizing toward the value of $\bar{K}$ , while K_state-only uses a different loss function. Specifically, applying the standard parallel forward inference of the state-only algorithm (see Section 2.4.3 for more details) with x₁ as input would produce a negative bin-value (without rebalancing). This would create a large loss value and instead push the CombNN (State-Only) model to learn a smaller kernel value than $\bar{K}$ , effectively making the optimization targets of CombNN (Detailed-Bin) and CombNN (State-Only) different. Thus we expect that as we reduce the amount of bin-emptying events (by decreasing the time step), the discrepancy between K_state-only and K_detailed-bin should decrease.

We confirm this by plotting the log of the KMSE versus the log of Δt = 1, 6, 15, 30, 60, 90 s, where each step in the simulation is of length Δt, to show the trend of time-discretization error (see Figure 8).

View Image - Figure 8. The log of Kernel Mean Square Error of the data kernel to the analytic kernel shown in Figure 5, the learned kernel to the data kernel, and learned kernel to the analytic kernel respectively. Note that the learned kernel here is the one found by minimizing Equation 11.

Figure 8. The log of Kernel Mean Square Error of the data kernel to the analytic kernel shown in Figure 5, the learned kernel to the data kernel, and learned kernel to the analytic kernel respectively. Note that the learned kernel here is the one found by minimizing Equation 11.

Sensitivity to Amount of Training Data

To better understand the behavior of our models, we performed a sensitivity analysis on the amount of training data, as shown in Figure 9. We trained the CombNN network with 1–2,048 simulated scenarios of 10 min each (earlier results were trained with 2,000 scenarios, as reported in Table 1) and observed the validation data FRMSE at the final time step (step 99). As seen in Figure 9, the state-only training of the CombNN is slightly more data-hungry. Both state-only and detailed-bin training achieve good FRMSE with about 256 training runs, and excellent error with about 1,000 training runs.

View Image - Figure 9. The validation FRMSE at the final time step for combinatorial neural network models (state-only and detailed-bin) trained on varying numbers of simulated scenarios. The error bars show 95% confidence intervals over 10 repeated test/train cycles for each number of training scenarios.

Figure 9. The validation FRMSE at the final time step for combinatorial neural network models (state-only and detailed-bin) trained on varying numbers of simulated scenarios. The error bars show 95% confidence intervals over 10 repeated test/train cycles for each number of training scenarios.

Sensitivity to Model Size (Number of Parameters)

Finally, to understand the sensitivity of our results to the size of the CombNN neural network, we trained different CombNN models with varying numbers of parameters, as shown in Figure 10. To vary the number of parameters we kept the number of neural network layers fixed at three hidden layers (see Section 2.3.1 for details) and varied the width of the layers from 8 to 64 nodes. This resulted in the total number of parameters varying from 242 to 8,930. As we see in Figure 10, the state-only training does not improve substantially beyond about 2,000 parameters, while the detailed-bin data benefits from adding parameters up to about 4,000.

View Image - Figure 10. The validation FRMSE at the final time step for combinatorial neural network models (state-only and detailed-bin) with varying number of parameters. The error bars show 95% confidence intervals over 10 repeated train/test cycles for each number of parameters.

Figure 10. The validation FRMSE at the final time step for combinatorial neural network models (state-only and detailed-bin) with varying number of parameters. The error bars show 95% confidence intervals over 10 repeated train/test cycles for each number of parameters.

Summary and Conclusions

Developing a method for accurate coarse-grained prediction of coagulation processes allows a reduction of reliance on computationally expensive particle-resolved simulations and a way to model aggregation processes that follow non-standard kernels. While previous work in this area used neural networks, they relied on conventional FC networks which did not achieve substantial results. In this paper we introduced a CombNN architecture (CombNN) that can incorporate structural priors of the coarse-grain coagulation problem. Using data generated from PartMC, a high-fidelity particle resolved simulation model, we generated a low dimensional projection into two types of data, “detailed-bin data” which records the coarse-grained interactions between every pair of size bins at each time step, as well as “state-only data” which only records binned size distribution at each time step. Naturally, the detailed-bin data contained all the information present in the state-only data, totaling to approximately 10 times as much information.

We trained CombNN networks using two optimization schemes to obtain CombNN (Detailed-Bin) and CombNN (State-Only) models which were trained using the detailed-bin data and state-only data, respectively. While the training algorithm for the CombNN (State-Only) used a standard stochastic gradient descent optimization algorithm with a fixed loss function, the CombNN (Detailed-Bin) utilized a novel adaptive loss function to accommodate the large range of values in the target kernel. Concurrently, we also trained a baseline conventional FC network and a LSTM using four-step history to predict the change in state-only data for comparing with the CombNNs.

We used three error metrics: single step FRMSE (SMSE), kernel log mean squared error (KMSE), and multistep FRMSE. These metrics showed that the different loss functions led to different optimized behavior for the CombNN (Detailed-Bin) and CombNN (State-Only). An important finding of our work is that minimizing the SMSE does not guarantee a low FRMSE, nor is it necessary for obtaining a low FRMSE. Instead, it is the combinatorial structural constraint that is essential for successfully predicting the evolution of our system. This is the reason why traditional neural network models fail for these problems.

We compared the simulated prediction accuracy (FRMSE) of our models to the baseline FC network, the LSTM as well as to a state-of-the-art semi-implicit sectional model. The CombNN models out-performed both comparison models, with the CombNN (Detailed-Bin) performing somewhat better than the CombNN (State-Only). We also showed that this forecast accuracy persisted for long times (up to 12 hr, 7,200 time steps). We also performed a series of sensitivity studies to understand the effects of time discretization, model size, and training data volume. Using the CombNN models with 6 s timesteps in a regional simulation, where chemistry time steps are typically on the order of 30 s, would require sub-timestepping. It is an open question whether using fewer bins and a more restricted diameter range, as is typical for regional aerosol models (e.g., WRF-Chem with Model for Simulating Aerosol Interactions and Chemistry (Fast et al., 2006) uses eight bins from 39 nm to 10 μm), would allow longer timesteps with high accuracy

Our proposed CombNN architecture provides a proof of concept for using neural networks for solving coarse-grained coagulation processes. While we used our network architecture for single-species aerosols with a Brownian kernel, the potential for this method lies in its application to coagulation with multi-species aerosols, for which existing methods must make simplifying assumptions on the mixing state (Riemer et al., 2019), or with non-standard kernels without known analytical expressions. CombNN models for these systems could be trained on data either from particle-resolved simulations (in the case of multi-species aerosols) or experimental data (for non-standard kernels).

A AppendixMotivation for the Adaptive-Loss Training Algorithm

In this appendix we explain in more detail why Algorithm 1 is needed to learn the highly multi-scale kernel function, rather than a conventional logarithmic scaling. Briefly, it is because we need to learn an average from highly noisy observations. Because our optimization target is the L₂ norm, performing a logarithmic transform would incorrectly shift the minimizer to no longer represent the mean of the data. Specifically, a logarithmic transform discounts the weight of very large observations.

Concretely, one can imagine a sequence of incoming observations from, say, a random variable X with probability mass function $\mathbb{P}[X=100]=\frac{1}{2}$ and $\mathbb{P}[X=10000]=\frac{1}{2}$ . Now imagine that we have a data stream of observations of this variable and we are trying to learn its expectation. Clearly if we perform a logarithmic transformation, then using an L₂-norm loss function would yield an expectation of 1,000, while the true expectation is $\mathbb{E}[X]=5050$ .

This then raises the question: why not just directly optimize without any loss transformations? If we were only optimizing and solving for the value of the kernel for a single bin pair, then indeed direct optimization would be sufficient to learn the expectation. However, because we are learning the entire kernel at once with a single network, and the kernel spans multiple magnitudes (from 10⁻¹⁵ to 10⁻⁸ m³ s⁻¹), the loss function will be skewed to give more weight to larger kernel values. In reality, we want all values of the kernel to be learned with equal relative accuracy, making a non-transformed optimization infeasible.

Thus, we want both large and small kernel values to receive equal relative priority, but we cannot do logarithmic scaling as it will skew the expectation incorrectly. The only way to solve this is to do a linear scaling. However, to do a proper linear scaling, we would need to divide by the underlying kernel, but this is exactly the value we need to learn! Thus, we propose Algorithm 1, which adaptively learns an appropriate per-bin linear scaling that simultaneously produces equal relative accuracy across bins, convergence to the expected value of the kernel, and stable learning.

B AppendixMotivation for the Definition of KMSE

The KMSE (Kernel Mean Square Error) function defined in Equation (14) is a proxy for the adaptive loss function L_adaptive that is directly optimized in Algorithm 1. Because L_adaptive is divided by K_norm, which changes every iteration, it's necessary to create a constant proxy function for understanding the algorithmic performance as in Figure 6.

We chose the definition of the KMSE because it is a first-order approximation to the adaptive loss function. More precisely, assume that the kernel function K_θ at any training epoch is a perturbation around the true mean kernel value $\bar{K}$ , so that ${K}_{\theta ,ij}={\bar{K}}_{ij}+{\Delta}{K}_{ij}$ for a (relatively small) perturbation ΔK_ij in each bin pair (i, j). Then we can write: [Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF]

Taylor expanding about ΔK_ij = 0 then yields: [Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF]

We thus see that the KMSE limits exactly to the adaptive loss function as the learned kernel limits to the exact mean kernel. Note that in the above analysis we have neglected overall normalization constants in the expressions.

Notation

A
Dimension of composition space

β
Hyperparameter to control smoothing of approximation kernel

$\mathcal{D}$
Data set of time series $\left({\mathcal{T}}_{1},\dots ,{\mathcal{T}}_{{N}_{\text{data}}}\right)$

e, e_t
Environment value, environment vector at time t respectively (temperature)

f_θ
Abstract model with parameters θ such that ${f}_{\theta }\left({n}_{t},{e}_{t}\right)={\hat{n}}_{t+1}$

f_gain,θ
Gain network denoting number concentration to gain given interaction

f_ker,θ
Neural network such that f_ker,θ(x) = (K_θ, λ_θ)

f_loss,θ
Loss network denoting number concentration to lose given interaction

f_FC
Standard FC neural network such that ${f}_{\text{FC}}\left({n}_{t},{e}_{t}\right)={\hat{n}}_{t+1}$

K
Coagulation kernel

$\bar{K}$
Average kernel of the entire data set (also known as the “data kernel”)

K₀
Initial kernel guess

K_norm,0, K_norm
Initial normalization kernel, normalization kernel

$\tilde{K}$
Noisy estimate of kernel

K_θ
Learned logarithmically-scaled coagulation kernel between bins i and j

${\hat{K}}_{x}$
Estimated kernel from one coagulation event x

K_state-only
Learned kernel from the CombNN (State-Only) algorithm

K_detailed-bin
Learned kernel from the CombNN (Detailed-Bin) algorithm

λ_θ
Scalar denoting how much of a coagulation event goes into bin j and j+1 respectively

$\vec{\mu }$
Vector for particle constituent masses

$\hat{n}$
Predicted state/number concentration vector

n_i
Number concentration of particles in bin i

n_j
Number concentration of particles in bin j

n_t
Vector of number concentration at time t

Δn_k(x)
Change in bin k due to an event x ∈ X

${\Delta}{n}_{t}^{(i,j,k)}$
Number concentration of coagulated particles that are added in bin k from coagulations between particles from bins i and j during time step t

N_bin
Number of discretized bins (20)

N_comb
Number of valid bin combinations (390)

N_env
Number of environmental factors (1)

σ
Sigmoid function where $\sigma (x)=\frac{1}{1\,+\,{e}^{-x}}$

$\mathcal{T}$
Time series $\left(\left({n}_{1},{e}_{1}\right),\dots ,\left({n}_{M},{e}_{{N}_{\text{time}}}\right)\right)$

θ
Parameters of neural network

x
Tuple (i, j, n_i, n_j, e) denoting an interaction event

X
Set of all possible interactions given the problem space of $\mathcal{D}$

${X}^{\mathcal{D}}$
Set of all interactions that occur in $\mathcal{D}$

${X}_{t}^{\mathcal{T}}$
Set of all interactions that occur during time t in time series $\mathcal{T}$

${X}_{t,\text{poss}}^{\mathcal{T}}$
Set of all possible interactions that could occur during time t in time series $\mathcal{T}$

Acknowledgments

JHC, NR, and MW acknowledge funding from DOE ASR Grant DE-SC0019192 and DE-SC0022130.

Data Availability Statement

The PartMC model used to generate the data set, model inputs, output data and machine learning code can be accessed at https://doi.org/10.13012/B2IDB-3904737%5FV1 (Wang et al., 2021).

Word count: 8820

Show less

© 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Simulating the evolution of a coagulating aerosol or cloud of droplets in a key problem in atmospheric science. We present a proof of concept for modeling coagulation processes using a novel combinatorial neural network (CombNN) architecture. Using two types of data from a high-detail particle-resolved aerosol simulation, we show that CombNN models outperform standard neural networks and are competitive in accuracy with traditional state-of-the-art sectional models. These CombNN models could have application in learning coarse-grained coagulation models for multi-species aerosols and for learning coagulation models from observed size-distribution data.

Details

Title

Learning Coagulation Processes With Combinatorial Neural Networks

Author

Wang, Justin L¹

; Curtis, Jeffrey H²; Riemer, Nicole³

; West, Matthew⁴

¹ Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
² Department of Atmospheric Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA; Department of Mechanical Science and Engineering, University of Illinois at Urbana Champaign, Urbana, IL, USA
³ Department of Atmospheric Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA
⁴ Department of Mechanical Science and Engineering, University of Illinois at Urbana Champaign, Urbana, IL, USA

Section

Research Article

Publication year

2022

Publication date

Dec 2022

Publisher

John Wiley & Sons, Inc.

e-ISSN

19422466

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1029/2022MS003252

ProQuest document ID

2758460640

Learning Coagulation Processes With Combinatorial Neural Networks

Jump to:

Full text

Abstract

Details

Suggested sources