Content area
Abstract
This paper explores Machine Learning (ML) parameterizations for radiative transfer in the ICOsahedral Nonhydrostatic weather and climate model (ICON) and investigates the achieved ML model speed‐up with ICON running on graphics processing units (GPUs). Five ML models, with varying complexity and size, are coupled to ICON; more specifically, a multilayer perceptron (MLP), a Unet model, a bidirectional recurrent neural network with long short‐term memory (BiLSTM), a vision transformer (ViT), and a random forest (RF) as a baseline. The ML parameterizations are coupled to the ICON code that includes OpenACC compiler directives to enable GPU support. The coupling is done with the PyTorch‐Fortran coupler developed at NVIDIA. The most accurate model is the BiLSTM with a physics‐informed normalization strategy, a penalty for the heating rates during training, a Gaussian smoothing as postprocessing and a simplified computation of the fluxes at the upper levels to ensure stability of the ICON model top. The presented setup enables stable aquaplanet simulations with ICON for several weeks at a resolution of about 80 km and compares well with the physics‐based default radiative transfer parameterization, ecRad. Our results indicate that the compute requirements of the ML models that can ensure the stability of ICON are comparable to GPU optimized classical physics parameterizations in terms of memory consumption and computational speed.
Full text
Introduction
The computation of atmospheric radiation is a central part of every Earth system model (ESM). It models the solar energy absorbed by the Earth, the complex interactions between radiation and greenhouse gases, clouds, aerosols, scattering processes, and the energy radiated back as thermal (longwave) radiation. The radiation parameterization in the ICOsahedral Nonhydrostatic weather and climate model (ICON) (Prill et al., 2023), is ecRad (Hogan & Bozzo, 2018). ICON is the new operational weather forecasting model of the Swiss weather service (MeteoSwiss) and German weather service (DWD). EcRad is actively developed at the European Centre for Medium-Range Weather Forecasts (ECMWF) where a graphics processing units (GPUs) port is under development. The general outline of ecRad is as follows: it first computes the gas, aerosols and clouds optics and passes those to a solver which predicts the atmospheric radiation fluxes based on which the driving model computes the convergence of the fluxes to obtain the corresponding heating rates, which are subsequently added as a temperature update. In ICON, the atmospheric radiation is operationally computed less often than the other physics parameterizations and on a coarser horizontal grid. The large radiation time step, and to a lesser extent the coarser grid considered for radiation, are known to reduce the quality of the prediction (Hogan & Bozzo, 2018). However, this issue may be alleviated by methods such as approximate updates at every time step and grid point (Hogan & Bozzo, 2015).
As climate modelers prepare their codes for kilometer-scale global simulations, reducing the computational cost of the radiation parameterization becomes increasingly relevant. Machine Learning (ML) methods have been presented as a promising approach to accelerate the computation of the radiative fluxes, and thus allow more frequent updates of the radiative fluxes at higher resolution. There has been a wealth of research in recent years to replace physical parameterizations in weather and climate models with data-driven parameterizations (Brenowitz & Bretherton, 2018, 2019; Gentine et al., 2018; Kashinath et al., 2021; O’Gorman & Dwyer, 2018; Yuval et al., 2021) and in the present work, we explore ML parameterizations of simulated radiative transfer in the atmosphere.
ML emulation of radiative transfer has a long history in atmospheric modeling. Previous research includes studies that emulate the entire radiative transfer code (Chevallier et al., 1998; Hafner et al., 2024, 2025; Krasnopolsky et al., 2005; Lagerquist et al., 2021; Pal et al., 2019; Roh & Song, 2020; Ukkonen, 2022), while others focus on replacing individual components of existing radiative schemes (Ukkonen et al., 2020; Veerman et al., 2021), and others use ML to predict the effect of complex phenomena on the radiation that are expensive to be modeled physically (Meyer et al., 2022). Some studies use radiative fluxes as a training target (Chevallier et al., 1998; Ukkonen, 2022), while others directly predict the resulting heating rates (Krasnopolsky et al., 2005; Lagerquist et al., 2021; Pal et al., 2019; Roh & Song, 2020). The advantages and disadvantages of emulating the shortwave and longwave radiative fluxes or directly predicting the resulting heating rates are known: direct prediction of the heating rates allows the calculation of radiative fluxes' convergence to be omitted, thus avoiding numerical stability problems, especially when the predicted fluxes' profile is not smooth. However, the prediction of fluxes seems to be a necessity for an ESM, since the radiative fluxes serve as inputs to other components, such as the land model. It is also a relevant output variable as it serves as an input to impact models (e.g., solar energy production) and can be compared with measurements, which is important for validation. This study employs a single ML model to predict both short- and longwave up- and downward fluxes, using the resulting heating rates as an additional penalty to the loss function during the training.
Based on the state of the art, we aim to first deliver a systematic review of the performance of different classes of ML methods (multilayer perceptron (MLP), Unet, bidirectional recurrent neural network with long short-term memory (BiLSTM), vision transformer (ViT) and random forest (RF)) and discuss how physical knowledge can be incorporated into their training and affect their performance. We investigate and discuss specific data preprocessing approaches and architectural design choices. For the systematic review, we chose an idealized aquaplanet simulation for the training as it appears reasonable for such a comparison to be performed in a controlled and simple environment. We then present a detailed description of how the offline trained ML models, which are developed using PyTorch (Paszke et al., 2017) are incorporated into the Fortran code (Kedward et al., 2022) of ICON (Giorgetta et al., 2022; MPI-M et al., 2024), enabled for GPUs using OpenACC (OpenACC, version 3.2, 2022), and the steps required to ensure efficient performance of the radiation emulator and ICON on GPUs on the Balfrin machine at the Swiss National Supercomputing Centre (CSCS).
The BiLSTM architecture with order parameters is the most accurate of the ML models that we tested, and it provides stable climate simulations when coupled to ICON. A modification of the prediction above 25 km height and a smoothing of the predicted fluxes is performed to obtain accurate heating rates at the upper levels. The predictions suffer from a bias at the surface around degrees latitude and around ° latitude. The runtime of the BiLSTM is 0.92% slower than the physical solver at 12 GPU counts. Smaller BiLSTM models are tested down to order parameters. While the smallest BiLSTM runs 13 times faster than ecRad and provides relatively small offline global error, it produces a global mean temperature that decreases unrealistically fast at the upper levels.
The structure of this paper is as follows: Section 2 describes the implementation details for the ML radiative transfer parameterizations. Section 3 presents offline and online results. We conclude the paper with a summary in Section 4.
Methods
We first describe the data set used for training and testing the ML models. We then discuss the architecture choices for the RF and NN emulators, including the normalization of inputs and outputs, the loss function, the Gaussian smoothing of the outputs and a modification to the computation of fluxes above 25 km height for the longwave fluxes and 40 km height for the shortwave fluxes to ensure ICON's stability at height levels with very low pressure.
Data Set
A 2-year ICON aquaplanet simulation is performed with a physics time-step of 3 min and a horizontal resolution of approximately 80 km (ICON grid R02B05), resulting in a total of 81,920 atmospheric columns. The radiation parameterization ecRad is called every time-step and at full spatial resolution. The ecRad scheme is called with its default configuration in ICON (see Section Appendix C for more details). The simulation uses 70 vertical levels and thus 71 vertical half levels, ranging from index 70 at the surface to index 0 at the top of the atmosphere (TOA) (65 km height). The solar zenith angle is held constant at the equinox; that is, there are no seasons.
The first year of the simulation is considered as a spin-up phase during which no data are stored. Starting at the beginning of the second year, the required inputs and outputs are stored every 3 hr and 3 min (183 min output interval). The strategy allows for a more complete coverage of different Sun positions over time in the data set, which proved beneficial for the training. After 61 simulated days (480 time-steps stored every 183 min), the data set contains all possible angles based on the 3 min time-step interval.
The data set is split into two parts: the initial 270 days for model training and the last 88 days for testing. A 7-day gap is included between these data sets to allow the test set distribution to vary from the training set. Within the initial segment, two subsections are established. The first and last 20 days of the training data set serve as a validation data set for the ML models during training, helping determine when to halt the training process using an early stopping criterion. The remaining 230 days are designated as the actual model training data set.
In ICON, the dynamical core, the horizontal diffusion, the tracer advection, the fast physics, and then the slow physics processes are solved sequentially. The radiative transfer parameterization is part of the slow physics processes. It is therefore essential to store the states of the inputs after they have been updated by the fast physics but before the slow physics update. For this reason, we modify the ICON code to extract the states of the input variables to ecRad in the middle of the sequential time splitting. The stored input and output features are given in Table 1. This yields a total of input variables and output variables.
Table 1 Inputs and Outputs for the Machine Learning Emulations
| Inputs | Outputs | |
| 2d | 3d | 3d |
| Surface temperature | Temperature | Shortwave down |
| Surface pressure | Pressure | Shortwave up |
| Specific humidity at surface | Specific humidity | Longwave down |
| Cosine of solar zenith angle | Cloud cover | Longwave up |
| Water content | ||
| Ice content |
Radiation Emulation: ML Architecture
We explore different ML architectures as possible emulators of ecRad. First, an RF emulator composed of 10 trees serves as a baseline model and is used to assess the performance of the NNs. While RFs can achieve high accuracy, their memory footprints quickly exceed 100 GB and they thus become prohibitively memory-intensive. For this reason, RF size is constrained by imposing a minimum leaf size equal to of the training data set size.
Second, five NN architectures are explored. A baseline feed-forward MLP serves to evaluate the performance of more complex architectures. A convolutional NN is constructed, more precisely a Unet, which allows us to reduce the number of parameters significantly. In order to have a comparison with a state-of-the-art architecture with self-attention modules, a ViT is constructed. Lastly a BiLSTM is constructed, which is the most accurate model. The BiLSTM closely imitates ecRad, which solves the effect of each atmospheric layer sequentially. However, as discussed in Section 3.2, the BiLSTM emulator is significantly slower than the Unet emulator and requires more memory. Four different BiLSTMs are trained with a decreasing number of parameters to test the performance of the BiLSTM architecture as a function of its size. In Table 2, the exact number of layers and number of neurons for each BiLSTM are listed. To improve the accuracy of the emulators, normalization strategies and custom loss functions are employed, as described next.
Table 2 Description of the Experimented Neural Network Architectures
| Model (number of parameters) | Architecture overview |
| MLP (4,764,428) | The 2D features are passed through a dedicated MLP layer with 16 and 32 hidden units, while the 3D features are processed by separate MLP layers for each height, with no weight sharing, each with 16 and 32 hidden units. The 3D features are then flattened and concatenated with the processed 2D features. This combined tensor is fed into an MLP head with 256, 512 and 1,024 hidden units, which maps the last hidden units to a flattened vector that is reshaped into the desired output shape |
| Unet (10,820,619) | This model processes 2D and 3D features by first broadcasting the 2D features and concatenating them with the 3D features along the height axis, creating a tensor with dimensions . This combined tensor is then passed through a Unet architecture, which consists of convolutional units [128, 256, 512, 1,024] and corresponding pooling layers [1, 2, 5, 7]. After the Unet processing, a 1D convolutional layer is applied along the column axis to promote smoother varying output. The tensor is then processed by a convolutional layer across the height with a single output channel to create the output for the last half-level height. This layer's output is concatenated with its input to form a tensor with dimensions |
| BiLSTM large (16,990,492) | The 2D features are broadcast and concatenated with the 3D input along the height axis (size 70). The combined features are then concatenated with a constant vector of ones, adjusting the height to match the output height (71). Each height level's feature vector is passed to separate MLP layers for each height, with no weight sharing, with units [128, 256]. This is followed by BiLSTM layers with units [128, 256, 512] running along the height axis. Finally, a separate dense layer for each height level is applied to map the hidden units to the output features |
| BiLSTM medium (4,366,620) | Similar to the BiLSTM large model with [64, 128, 256] and [64, 128] hidden units for the BiLSTM and MLP layers |
| BiLSTM small (102,172) | Similar to the BiLSTM large model with [16, 32] and [16, 32] hidden units for the BiLSTM and MLP layers |
| BiLSTM tiny (11,300) | Similar to the BiLSTM large model with 16 BiLSTM units and an MLP layer with shared weights with 16 units |
| ViT (2,128,644) | The 3D features are passed through an embedding layer with shared weights and concatenated with the embedded 2D features along the height axis. After adding positional encoding, the resulting tensor is processed by a transformer architecture. The transformer consists of multiple layers, each containing a self-attention mechanism and a feed-forward network. The attention mechanism helps the model focus on different parts of the input, while the feed-forward network processes the attended features. The transformer has 6 heads and 4 layers with an embedding dimension of 256 |
Physics-Informed Normalization Strategy
For each atmospheric column, the model outputs are normalized to be approximately in the range [0, 1] as in Ukkonen (2022). To this end, each atmospheric column's shortwave fluxes are divided by the cosine of the solar zenith angle multiplied by the solar constant (approx. 1,400 W m−2). Atmospheric columns whose cosine of solar zenith angle is smaller than are not normalized in the shortwave because these are truncated in ICON for numerical stability reasons at the day-night interface. For columns at night, that is, cosine of solar zenith angle smaller or equal to 0, the model returns zero upward and downward shortwave fluxes at each height. Following the Stefan-Boltzmann law (Petty, 2006) [Chapter 6.1.3] for the emission of a black body, the longwave fluxes are divided by the fourth power of the surface temperature multiplied by the Stefan-Boltzmann constant. We do not divide by the surface emissivity since it is chosen constant and close to 1 for the aquaplanet simulation. Note that the longwave normalization may not scale exactly the outputs to the [0, 1] range since in some atmospheric conditions, the surface is not the warmest, because of temperature inversion for example. This normalization strategy improves the results of all NNs and of the RF.
Each input field is normalized by subtracting its mean and dividing by its standard deviation computed from the training data set. Note that the input features are normalized across all heights, which means that for each field (temperature, cloud cover, etc.), only a single mean and standard deviation are computed. This approach is adopted because, at higher atmospheric levels, certain fields that should be zero physically end up numerically near, but not equal to zero. Normalizing across all heights prevents the unnatural scaling that would occur if they were normalized independently at each height.
Penalty for the Heating Rates, Smoothing, and Simplified Equation at the Top
Obtaining accurate heating rates from the predicted fluxes is essential for updating the thermodynamic equation in ICON. Therefore, the mean squared error of the heating rates calculated from the emulated fluxes is added to the loss function of the NNs. The loss function thus consists of two parts: the mean squared error of the fluxes and the mean squared error of the heating rates. The loss function for an input vector , corresponding to a single atmospheric column, can thus be written as
To further improve the accuracy of the heating rates by promoting smoother varying flux predictions along the height axis, and therefore increasing the numerical stability when the NN is coupled to ICON, a height-dependent Gaussian filter is applied along the vertical flux profile. The smoothed fluxes at height index are given by
The denominator in Equation 4 is chosen empirically to have an increasingly larger smoothing window above the cloud region and no smoothing below. This mild smoothing of the vertical fluxes' profile helps to avoid spurious spikes in the estimated heating rates by ICON.
During online inference, it turned out that a challenge of the ML emulation of ecRad is obtaining accurate heating rates at atmospheric levels above 35 km height. Near the model top, the air mass between two model levels is extremely small and small errors in the height profile of the fluxes can result in large heating rates' errors which can then break ICON simulations. For this reason, the computation of the fluxes above 25 km height for the longwave fluxes and 40 km for the shortwave fluxes is modified. The fluxes above the chosen level (25 km height) for the longwave fluxes or (40 km height) for the shortwave fluxes are constructed by multiplying the fluxes at level by a set of constants , which are optimized based on the training set:
Results
We separate the discussion of the results into two parts. First, we discuss the offline performance of the ML models, that is, the ML models are applied to the test set described in Section 2.1 and thus do not interact directly with ICON. We then discuss the online performance, that is, the ML models are coupled to ICON and ICON simulations are performed with ML radiative transfer parameterizations. Unless specified otherwise, the BiLSTM model refers to the large BiLSTM in Table 2.
Offline Performance
Figure 1 shows the mean absolute error (MAE) of the ML models over the test set described in Section 2.1. The RF shows great performance in the upper levels. It is the only model to predict exactly the shortwave downward flux at the TOA. This is possible because the normalized shortwave downward flux at the TOA is equal to 1 if there is daylight and to 0 if it is night. Note that RF outputs, in contrast to the NN's, are not smoothed with Equation 3 and the simplified Equation 5 is not used at the upper levels. Since the RF predicts averages of observations, those techniques are not needed to ensure a smooth fluxes' profile providing accurate values for the heating rates at the upper levels. The RF is however the least accurate model below 15 km height, where accuracy is the most important. The most accurate NN overall is the BiLSTM model. It is only outperformed in the shortwave heating rate by the MLP above 25 km height. Both the Unet and the ViT have significantly larger errors for the heating rates above 15 km height. On pressure coordinates, we observe that the increase in the heating rates' error from levels 30–25 to level 0 at the TOA, is concentrated in a small pressure interval from around level 100 hPa up to the model top. It also indicates that there is a trade off between accuracy near the model top and or below approximately 20 hPa. This compares well with the results of the literature shown in pressure coordinates (Chevallier et al., 1998; Liu et al., 2020; Ukkonen, 2022) or height coordinates with a logarithmic scale (Lagerquist et al., 2021), where the heating rates' error increases significantly over a small pressure interval near the model top.
[IMAGE OMITTED. SEE PDF]
In Section 2.4, we discussed techniques to improve the accuracy of the heating rates, in particular at the upper height levels where all models except the RF struggle to produce accurate heating rates. We discuss the effect of adding a penalty term for the heating rates in the loss (Equation 1) and the effect of the Gaussian smoothing (Equation 3) and of the simplified Equation 5 used to compute the fluxes in the upper atmospheric levels. In Figure 2, the MAE of the BiLSTM is shown without a penalty for the heating rates in the loss, without Gaussian smoothing and without a simplified equation for the upper levels. We then incrementally add the heating rates' penalty term in the loss, the Gaussian smoothing and the simplified equation to compute the upper levels. The heating rates' penalty does not affect the MAE offline error, but as shown in Section 3.2 in Figure 9a, this is in contrast with the online performance, where the heating rates' penalty improves the stability of ICON. The Gaussian smoothing improves the heating rates' accuracy by two orders of magnitude at the model TOA. It increases however the longwave downward flux error, which is exactly zero for all other models since the longwave downward flux is fixed to be exactly 0 in ICON at the TOA. The simplified Equation 5, further reduces the heating rates' error at the upper levels. The final BiLSTM, which contains all the above techniques, is three orders of magnitude more accurate at the TOA than the BiLSTM that has none of the above. Without the post-processing techniques, the error reaches almost 50 K/day at the TOA for both the shortwave and longwave heating rates, making such ML models unusable for ICON simulations.
[IMAGE OMITTED. SEE PDF]
In preparation for the discussion on the computational speed of the different models, we investigate the performance of the BiLSTM architecture, which provided the best offline results as shown in Figure 1, as a function of its size, that is, its number of trainable parameters. In Figure 3, the offline performance of the BiLSTM with a decreasing number of parameters is shown. The models are described in Table 2. The large BiLSTM, corresponding to the BiLSTM shown in Figure 1 largely outperforms all models for the heating rates above 15 km height. The medium BiLSTM follows closely the large BiLSTM below 15 km height but it has a significantly larger error above. For the longwave heating rate, it is the less accurate model above 25 km height and it has a larger error there than the tiny BiLSTM. The good performance over the shortwave and large error over the longwave may happen because the loss function from Equation 1 penalizes both the shortwave and longwave fluxes simultaneously and nothing prevents the NN to favor one output over the others. Except for the unexpected larger error of the medium model, the error behaves as expected with increasing model size, resulting in decreasing MAE. From Figure 3 alone, it is unclear whether smaller models can be used instead of the large BiLSTM. The small BiLSTM seems a good contender for the longwave prediction but has a significantly larger error in the shortwave heating rate while the medium model, as discussed above, matches the performance of the large BiLSTM below 15 km height but under-performs above. As discussed in Figure 8, only the large BiLSTM results in a reasonable global mean temperature evolution.
[IMAGE OMITTED. SEE PDF]
To complement the discussion of the MAE expressed as a function of height, shown in Figures 1–3, we discuss the zonal climatology (Figure 4) and MAE (Figure 5) of the BiLSTM over the test set. The zonal mean in Figure 4 is taken over all time steps and all columns in one-degree latitude intervals. The zonal mean of the BiLSTM is similar, for both fluxes and heating rates, to the zonal mean of ecRad's prediction, and hence the emulator does not produce any mean state bias. Differences appear in the longwave heating rate climatology where the BiLSTM does not capture the slight tropopause warming happening between levels 20 (25 km) and 30 (15 km). Furthermore, the shortwave warming happening between levels 10 (39 km) and 0 (65 km) appears deeper for the BiLSTM and spans over a smaller range of latitude coordinates. In the zonal MAE shown in Figure 5, most of the fluxes' error appears in the tropical region. It is particularly large for the shortwave fluxes, where the error reaches 10 . In contrast, the zonal MAE for the longwave fluxes never exceeds 4.5 . The presence of clouds produces a large error above at 1 km height for the upper fluxes and a large error below for the downward fluxes. A larger heating rates' error is observed at the surface for both the shortwave and longwave fluxes and at the TOA for the shortwave as already observed in the MAE profile in Figure 1.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
To better understand the larger heating rates' error of the BiLSTM model at the surface we investigate the surface climatology (Figure 6) and MAE (Figure 7) over the test set. The surface fluxes produced by the BiLSTM and ecRad have a similar climatology as seen in Figure 6 (Figure E1 shows the bias). The shortwave heating rate's climatology produced by the BiLSTM is rather too uniform. When ecRad produces shortwave heating rates concentrated in the tropics, the BiLSTM models' heating rates are close to uniform in the [−50, 50] latitude band. The longwave heating rate underestimates the cooling in the mid-latitudes. Those observations are confirmed in the MAE surface in Figure 7. A uniform error is observed in the [−50, 50] latitude interval for the shortwave heating rate and a large error in the midlatitude appears for the longwave heating rate. Concerning the fluxes, the error is concentrated in the tropics for the shortwave and in the polar regions for the longwave. The heating rates bias and MAE is larger in latitude with higher cloud cover. In those regions, the cooling effect of clouds in the shortwave is overestimated at the surface while the longwave surface warming due to clouds is overestimated (see Figure E1). The surface heating rates negative bias at 30° latitude and positive bias around 50° latitude could potentially affect the meridional temperature gradient which could in turn change the dynamic of the jet stream for example, The mean surface net heating rates bias is close to −0.3 K/day and does not appear so large for the levels above (see Figure D2). This could potentially inhibit cloud formation and thus affect the dynamic of the global circulation.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Online Performance
Next, we couple the emulator to ICON. The implementation of the ML models with Python into the Fortran code of ICON using PyTorch-Fortran is discussed here. In this paper, we experiment with the ICON version running fully on GPUs. ICON simulations are performed over 3 weeks with the different ML emulators. Each simulation is restarted from the end of the 23rd month of the 2-year simulation that produced the training and testing data sets. Recall that the ML models are trained with months 13–21. The online experiments related to this section are performed on the Balfrin CSCS machine with NVIDIA A100 Tensor Core GPU (NVIDIA, 2022). Since we cannot afford to run every configuration in an ensemble mode to test the significance of the difference between the ML and reference runs, we focus the discussion on the stability at the model top and the runtime of the different ML architectures.
In Figure 8, the global mean temperatures of ICON simulations with different radiation emulators are compared to the reference simulation. The ML models correspond to the models in Figure 1 and are thus trained and post-processed with the techniques presented in Section 2.4. The global mean temperature is averaged over three height level intervals: levels 29 to 0 (above 15 km), levels 49 to 30 (between 4 and 16 km) and levels 69 to 50 (below 4 km). Above 15 km height, the global mean temperature of the reference run is decreasing slightly with an almost constant tendency. The RF provides the global mean temperature evolution with the slope closest to the reference run, followed closely by the BiLSTM run. The temperature of the Unet and of ViT runs is decreasing faster and is 1.8 and 2 K below the reference run, respectively, after the 3 simulated weeks. Figure 8 suggests that the BiLSTM model is a good candidate for ICON climate simulations. While the other models provide realistic temperature evolution below 16 km height, they tend to accelerate the cooling of the model top.
[IMAGE OMITTED. SEE PDF]
Pursuing the discussion started in Figures 2 and 9a shows the effect of the penalty for the heating rates in the loss, of the Gaussian smoothing and of the simplified Equation 5 on the global mean temperature. The heating rates' penalty, which has close to no effect on the offline MAE as shown in Figure 2, stabilizes the simulation at the TOA for the first 2 simulated weeks after which the global temperature starts to increase. Furthermore, the heating rates' penalty corrects the positive temperature bias that appears in the simulation without the heating rates' penalty between 4 and 16 km height. In contrast, the Gaussian smoothing, which improves the offline accuracy of the heating rates by 2 orders of magnitude at the TOA as shown on Figure 9a, counteracts the stabilizing effect of the heating rates' penalty at the top. This highlights that offline performance cannot be translated one to one to online performance. When the fluxes are smoothed, the global temperature increases with a constant slope starting from the beginning of the simulation. Additionally solving the shortwave fluxes above level 10 and the longwave fluxes above level 20 with the simplified Equation 5 stabilizes the simulation at the top.
[IMAGE OMITTED. SEE PDF]
We discuss next the online performance of the BiLSTM model with varied numbers of parameters. In Figure 9b, the global mean temperature of the BiLSTM with decreasing size is shown. The BiLSTM models are described in Table 2 and their offline MAE is shown in Figure 3. The large BiLSTM, corresponding to the BiLSTM shown in Figure 8, is the only BiLSTM with a stable global mean temperature at the upper levels. For the smaller BiLSTMs, the global mean temperature is decreasing in the upper levels. The tiny BiLSTM produces a global temperature that decreases too fast in the middle levels. This results in a 0.4–0.3 K error after 2 weeks of simulations, after which the global temperature remains constant. On the lower levels, the global mean temperature of the tiny model is almost constantly 0.1 K too small while the small model produces a mean temperature almost constantly 0.1 K too large. The medium model follows closely ecRad's temperature below 16 km, similar to the large model.
Figure 10 presents meridional vertical cross sections of the net heating rate along the prime meridian 2 d, 4 d, and 6 d into the simulation using ecRad and the BiLSTM model. Both simulations are started from the same restart file after 23 months of simulation with ecRad. After 2 days, both parameterizations exhibit nearly identical heating rate's profile. By day 4, differences emerge near the equator. By day 6, they have become more pronounced above 600 hPa, which is expected as small differences between the simulations start to grow over time, similar to ensemble runs. The mean net heating rate align in both simulations and shows no systematic emulation bias.
[IMAGE OMITTED. SEE PDF]
Performance Metrics
In this project, one motivation of using an ML radiative transfer parameterization is to accelerate the computation of the radiation component in ICON. We thus investigate the runtime of the different NN models coupled to ICON. In Table 3, the run times of the radiation computation are shown for 100 ICON time steps, corresponding to 5 hr of simulated time. The BiLSTM model has a similar runtime as ecRad and requires more than 2 GPUs (Nvidia A100 tensor core GPU with 80 GB of memory) to run. In comparison, the Unet and MLP models are respectively 5 and 8 times faster than ecRad on 12 GPUs. Reducing the size of the BiLSTM allows us to reduce the runtime significantly. The tiny and small lstms are 8 and 13 times faster than ecRad respectively on 12 GPUs. Note that, for the ICON simulation corresponding to the runtime given in Table 3, the radiation parameterization is called every time step and is solved at full resolution.
Table 3 Comparison of Run Time in Seconds for 100 ICON Time-Steps Using ecRad and the BiLSTM, MLP and the Unet Emulators as a Function of the Number of Graphic Processing Unit (GPU) Nodes (Nvidia A100 Tensor Core GPU With 80 GB of Memory)
| No. GPUs | 2 | 4 | 8 | 12 | |
| Runtime in seconds | MLP | 5 | 4 | 3 | 3 |
| Unet | 17 | 9 | 6 | 5 | |
| BiLSTM tiny | 3 | 3 | 2 | 2 | |
| BiLSTM small | 10 | 5 | 4 | 3 | |
| BiLSTM medium | 76 | 40 | 21 | 15 | |
| BiLSTM large | OOM | 76 | 40 | 28 | |
| ViT | 110 | 57 | 31 | 22 | |
| ecRad | 151 | 75 | 38 | 26 |
ML models size could be reduced further with, for example, automatic mixed precision (Carilli, 2024), where some operations are done with half-precision instead of single precision, and dynamic quantization (Dynamic Quantization, 2024), which reduces the resolution of the model's weight. ICON and PyTorch are written in two different languages (Fortran with OpenACC and C++ with CUDA) which access the same GPU memory space when coupled together. It is yet unclear how both languages interact regarding data access and further profiling would be required to optimize how PyTorch should share the data access with the rest of the ICON code. This is however beyond the scope of this study. By comparison, ICON and ecRad are both written in Fortran with OpenACC directives. Note also that in Ukkonen and Hogan (2024), the authors restructured the ecRad code and improved its runtime performance by a factor of up to 3. The code's restructuring is designed for CPU usage, but the improved parallelism could benefit from a GPU implementation of ecRad. The runtime performance can be further improved to a total speed-up factor of up to 12 by using the new spectrally reduced gas optics model ecCKD (Hogan & Matricardi, 2022). Moreover, in our study, ML models running in single precision are compared to ecRad running in double precision, but with the code restructuring presented in Ukkonen and Hogan (2024), ecRad can run fully in single precision. Depending on its performance on GPUs, this optimized version of ecRad could outperform the ML models presented here.
Summary
In this study, ML emulations of the ecRad radiative transfer parameterization are built for the GPU-enabled ICON model. Through a series of offline tests, the most accurate ML model has been identified as a bidirectional recurrent NN with long-short memory layers (BiLSTM) and additional physics-informed normalization of input and output features, as well as an additional term for the heating rates in the loss function. In this work, a significant technical advancement is made by integrating the ML radiative transfer parameterization into ICON and executing both in tandem on GPUs.
Since the air in the upper atmospheric layers is substantially less dense than in lower layers, a small error in the fluxes' profile results in a large heating rates' error in the upper levels, which during long integration can cause spurious temperature trends near the model top. This problem is difficult to detect during offline training and testing, but becomes apparent when the ML models are integrated in the driving ESM. Model tops are known to often require additional tuning not seen during offline training. For example, Brenowitz and Bretherton (2019) removed the model top from training to stabilize the online performance of their ML-based convection scheme. To mitigate the aforementioned problem with large heating rates, which is easily overlooked in the offline testing, a Gaussian smoother is applied to the output fluxes and a simplified equation is used to calculate the fluxes in the damping layer of ICON near the model top, which reduces the heating rates' error significantly and keeps the model free from any temperature drifts and very close to the original ecRad simulation for several simulated weeks. However, future research is needed to further improve the emulation at the top of the model, in particular for the shortwave radiation, and to increase the reliability of the results at scales beyond weather forecasting. A potential way forward is to train an emulator that infers the TOA and the surface level shortwave and longwave fluxes plus the heating rates on all levels within the atmosphere instead of the full fluxes' profile.
To seamlessly connect the Fortran code with the NNs implemented with Python, the Infero coupler from ECMWF was explored initially (Section Appendix B). Infero requires that the ML model inputs and outputs are provided in the CPU memory. However, the version of ICON examined in this paper operates entirely on GPUs. Consequently, using Infero leads to needless data transfers between CPU and GPU memory, causing notable delays in the ML parameterization compared to ecRad. However, the next generation of hardware, like the Grace Hopper chips (NVIDIA, 2023) chosen for the next CSCS supercomputer Alps (CSCS et al., 2021), reduces the overhead of CPU-GPU copies and can even expose a shared CPU-GPU memory space to the user. This could make Infero more competitive, even in its current form. There appear to be no obstacles preventing the adaptation of Infero for complete GPU utilization, thereby eliminating the need for data transfers between hardware components. As such, those limitations of Infero could potentially be nonexistent in future versions of this software. In this paper, to avoid CPU-GPU copies, the PyTorch-Fortran library (Alexeev, 2022) is adopted as an alternative solution. This approach enables direct processing of ML inputs and outputs on the GPU. The integrated system of GPU-enabled ICON, using OpenACC, and ML emulator implemented with PyTorch is deployed on the Piz Daint and Balfrin systems at CSCS, leveraging GPU capabilities for enhanced performance.
To the best of our knowledge, this is the first time that emulators of the radiation code using different ML architectures have been evaluated and compared online and that a full-fledged weather and climate model in combination with an ML-based parameterization developed in PyTorch has been run completely on GPUs. The performance gain compared with simulations using the original ecRad radiation scheme depends critically on the complexity of the NNs architecture, and not all tested NNs are per se faster than the traditional physics-based code. The ML emulators seem to suffer from trade-off between accuracy and speed similar to physical radiation schemes and speed-up compared to an optimized radiation code is thus not guaranteed. Performance gains reported in past studies may stem from the fact that the emulated parameterization was originally run on CPUs while the ML emulator was run on GPUs, or from the use of outdated radiation code making poor use of hardware which are compared to MLs models making good use of hardware, or from the use of fast ML models who are not accurate enough and may lead to instabilities when coupled to a weather or climate model. We cannot rule out the possibility that a more sophisticated tuning of the NN architectures would result in a higher speed-up, but this holds also true for the original radiation solver (Ukkonen & Hogan, 2024).
Appendix A - Modification of the Computation of the Fluxes at the Upper Levels
Four different sets of constants corresponding to the shortwave downward, shortwave upward, longwave downward and longwave upward fluxes are constructed. The constants , , are given by
For the upper levels, where the fluxes are approximated with Equation 5, the heating rates at level h for h in , are given by the following equation:
Appendix B - Implementation of the ML Emulators Into ICON
The implementation of the ML models with Python into the Fortran code of ICON using Infero (Antonino Bonanni & Hawkes, 2022) is discussed here. Infero works with models built with the Tensorflow Python library (Abadi et al., 2015) or with models in the ONNX format (Bai et al., 2019). The RF is trained with the Scikit-learn (Pedregosa et al., 2011) Python library and then ported to the ONNX format. The NNs are implemented with Tensorflow. In Figure B1, we show a schematic of how Infero integrates into ICON running on GPUs. Infero requires the inputs to be in CPU (referred to as host) memory and after an ML prediction on the GPUs (referred to as device), it returns the outputs there. This is a limitation for a weather or climate model running on GPUs as it leads to back-and-forth copies of data between the separate memory spaces. Note however that for most models, which are running solely on the central processing unit (CPU), this is not an issue and Infero then becomes a suitable coupler. Furthermore, Infero is developed by ECMWF, which is increasing its efforts in using ML methods to improve current weather models (ECMWF, 2023). Infero may thus see improvements in the future, allowing for an optimized coupling on the GPUs.
[IMAGE OMITTED. SEE PDF]
In this paper, we experiment with the ICON version running fully on GPUs. To use Infero, the data are first copied from the GPU to the CPU (step 1 in Figure B1). Infero can then access the input data (step 2) and compute the outputs with the NNs or RF (steps 3–5). Internally, Infero uses a “Tensorflow for C” backend, which will copy the input data to GPU memory (step 3), run the ML model on the GPUs (step 4) and then copy the output data back to CPU memory (step 5). Finally, the output data are accessed by ICON (step 6) and copied back to GPU memory (step 7). The input and output data are thus copied twice each. This drastically slows down the computation of the radiative fluxes and diminishes the possible speed-up from using an ML model instead of ecRad.
An alternative to Infero is PyTorch-Fortran (Alexeev, 2022). Figure B2 shows how the ML models are integrated into ICON with PyTorch-Fortran. PyTorch-Fortran requires the models to be built with the Pytorch library instead of Tensorflow. The main advantage of Pytorch-Fortran is that both ML emulator and ICON can run purely on GPUs. PyTorch-Fortran first obtains pointers to the input data, then runs the ML model and returns pointers to the output data to ICON. Therefore, it completely avoids copying data between CPU and GPU memory. This is a major advantage compared to Infero. A limitation is, however, that PyTorch-Fortran does not work with models built in the ONNX format. This renders it incompatible with the Scikit-learn library which contains a variety of ML models.
[IMAGE OMITTED. SEE PDF]
Appendix C - EcRad Default Configuration in ICON
The default choices for the optic models and for the solver of ecRad in ICON are given in Table C1. We do not model aerosols in our simulations. EcRad runs in double precision.
Table C1 EcRad Default Choices in ICON
| Gas optic model | RRTMG-IFS |
| Liquid optic model | SOCRATES |
| Ice optic model | Fu-IFS |
| Fluxes solver model | McICA |
Appendix D - Climatology Over Height
Figure D1 shows at each height the global climatology of the BiLSTM model and ecRad. Figure D2 shows the difference at each height between the global climatology of the BiLSTM model and ecRad.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Appendix E - Climatology Difference at the Surface
Figure E1 shows the difference between the surface climatology of the BiLSTM model and the surface climatology of ecRad, both shown in Figure 6.
[IMAGE OMITTED. SEE PDF]
Appendix F - Schematics of the Architectures
Figure F1, Figure F2, Figure F3 and Figure F4 are schematics for the different NN architectures detailed in Table 2. The inputs x_2d and x_3d, and the output are listed in Table 1. A detailed description of the neural networks is given in Table 2.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Acknowledgments
This work was funded through a grant by the Swiss Data Science Center (SDSC Grant C20-03). This research was supported by computation resources provided by the EXCLAIM project funded by ETH Zurich (CSCS project number d121). The Center for Climate Systems Modeling (C2SM) at ETH Zurich is acknowledged for providing technical and scientific support. Sebastian Schemm and Stefan Rüdisühli are supported by the European Research Council, H2020 European Research Council (Grant 848698). From November 2024 onward, Guillaume Bertoli was supported by the SNF postdoc mobility grant P500PN_225397. We thank the anonymous reviewers of the present manuscript for their helpful comments. Open access publishing facilitated by Eidgenossische Technische Hochschule Zurich, as part of the Wiley - Eidgenossische Technische Hochschule Zurich agreement via the Consortium Of Swiss Academic Libraries.
Data Availability Statement
The data were generated using the ICON climate model described in Prill et al. (2023). The software is available from MPI-M et al. (2024) in . The codes to reproduce the results of this paper are available from Mohebi and Özdemir (2025) in . Data to reproduce results of this work are hosted at the ETH Research Collection from Bertoli et al. (2025) in .
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2015). Tensorflow: Large‐scale machine learning on heterogeneous systems. Retrieved from https://www.tensorflow.org/
Alexeev, D. (2022). PyTorch‐Fortran v0.4. GitHub. Retrieved from https://github.com/alexeedm/pytorch‐fortran
Antonino Bonanni, T. Q., & Hawkes, J. (2022). Infero 0.1.0. GitHub. Retrieved from https://github.com/ecmwf/infero
Bai, J., Lu, F., Zhang, K., Chen, C., Fehlner, A., Ramalingam, G., et al. (2019). ONNX: Open neural network exchange. Retrieved from https://github.com/onnx/onnx
Bertoli, G., Mohebi Ganjabadi, S., Ozdemir, F., Jucker, J., Rüdisühli, S., Perez‐Cruz, F., et al. (2025). Icosahedral nonhydrostatic (ICON) model aquaplanet offline simulation data for radiation solver emulation [Dataset]. ETH Zurich. https://doi.org/10.3929/ethz‐b‐000721647
Brenowitz, N. D., & Bretherton, C. S. (2018). Prognostic validation of a neural network unified physics parameterization. Geophysical Research Letters, 45(12), 6289–6298. https://doi.org/10.1029/2018gl078510
Brenowitz, N. D., & Bretherton, C. S. (2019). Spatially extended tests of a neural network parametrization trained by coarse‐graining. Journal of Advances in Modeling Earth Systems, 11(8), 2728–2744. https://doi.org/10.1029/2019ms001711
Carilli, M. (2024). Automatic mixed precision. Retrieved from https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html.PyTorch
Chevallier, F., Chéruy, F., Scott, N. A., & Chédin, A. (1998). A neural network approach for a fast and accurate computation of a longwave radiative budget. Journal of Applied Meteorology, 37(11), 1385–1397. https://doi.org/10.1175/1520‐0450(1998)037〈1385:annafa〉2.0.co;2
CSCS, NVIDIA, & Entreprise, H. P. (2021). Swiss National Supercomputing Centre, Hewlett Packard enterprise and NVIDIA announce world’s most powerful AI‐capable supercomputer. Retrieved from https://www.cscs.ch/science/computer‐science‐hpc/2021/cscs‐hewlett‐packard‐enterprise‐and‐nvidia‐announce‐worlds‐most‐powerful‐ai‐capable‐supercomputer
Dynamic Quantization. (2024). Retrieved from https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html.PyTorch
ECMWF. (2023). Annual report 2022. Retrieved from https://www.ecmwf.int/sites/default/files/elibrary/2023/81359‐annual‐report‐2022.pdf
Gentine, P., Pritchard, M., Rasp, S., Reinaudi, G., & Yacalis, G. (2018). Could machine learning break the convection parameterization deadlock? Geophysical Research Letters, 45(11), 5742–5751. https://doi.org/10.1029/2018gl078202
Giorgetta, M. A., Sawyer, W., Lapillonne, X., Adamidis, P., Alexeev, D., Clément, V., et al. (2022). The ICON‐A model for direct QBO simulations on GPUs (version icon‐cscs:baf28a514). Geoscientific Model Development, 15(18), 6985–7016. https://doi.org/10.5194/gmd‐15‐6985‐2022
Hafner, K., Iglesias‐Suarez, F., Shamekh, S., Gentine, P., Giorgetta, M. A., Pincus, R., & Eyring, V. (2024). Interpretable machine learning‐based radiation emulation for icon. ESS Open Archive. https://doi.org/10.22541/essoar.173169996.65100750/v1
Hafner, K., Iglesias‐Suarez, F., Shamekh, S., Gentine, P., Giorgetta, M. A., Pincus, R., & Eyring, V. (2025). Stable machine learning based radiation emulation for ICON. ESS Open Archive. https://doi.org/10.22541/essoar.174708082.27787580/v1
Hogan, R. J., & Bozzo, A. (2015). Mitigating errors in surface temperature forecasts using approximate radiation updates. Journal of Advances in Modeling Earth Systems, 7(2), 836–853. https://doi.org/10.1002/2015MS000455
Hogan, R. J., & Bozzo, A. (2018). A flexible and efficient radiation scheme for the ECMWF model. Journal of Advances in Modeling Earth Systems, 10(8), 1990–2008. https://doi.org/10.1029/2018ms001364
Hogan, R. J., & Matricardi, M. (2022). A tool for generating fast k‐distribution gas‐optics models for weather and climate applications. Journal of Advances in Modeling Earth Systems, 14(10), e2022MS003033. https://doi.org/10.1029/2022ms003033
Kashinath, K., Mustafa, M., Albert, A., Wu, J.‐L., Jiang, C., Esmaeilzadeh, S., et al. (2021). Physics‐informed machine learning: Case studies for weather and climate modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2194), 20200093. https://doi.org/10.1098/rsta.2020.0093
Kedward, L. J., Aradi, B., Certik, O., Curcic, M., Ehlert, S., Engel, P., et al. (2022). The state of Fortran. Computing in Science & Engineering, 24(2), 63–72. https://doi.org/10.1109/mcse.2022.3159862
Krasnopolsky, V. M., Fox‐Rabinovitz, M. S., & Chalikov, D. V. (2005). New approach to calculation of atmospheric model physics: Accurate and fast neural network emulation of longwave radiation in a climate model. Monthly Weather Review, 133(5), 1370–1383. https://doi.org/10.1175/mwr2923.1
Lagerquist, R., Turner, D., Ebert‐Uphoff, I., Stewart, J., & Hagerty, V. (2021). Using deep learning to emulate and accelerate a radiative‐transfer model. Journal of Atmospheric and Oceanic Technology. https://doi.org/10.1175/jtech‐d‐21‐0007.1
Liu, Y., Caballero, R., & Monteiro, J. M. (2020). RadNet 1.0: Exploring deep learning architectures for longwave radiative transfer. Geoscientific Model Development, 13(9), 4399–4412. https://doi.org/10.5194/gmd‐13‐4399‐2020
Meyer, D., Hogan, R. J., Dueben, P. D., & Mason, S. L. (2022). Machine learning emulation of 3d cloud radiative effects. Journal of Advances in Modeling Earth Systems, 14(3), e2021MS002550. https://doi.org/10.1029/2021ms002550
Mohebi, S., & Özdemir, F. (2025). pidlrad: Physics‐informed deep learning based radiation solver [Software]. Zenodo. https://doi.org/10.5281/zenodo.16760889
MPI‐M, DWD, KIT, DKRZ, & C2SM. (2024). ICON: Icosahedral nonhydrostatic weather and climate model [Software]. ICON. https://icon‐model.org/
NVIDIA. (2022). NVIDIA A100 tensor core GPU. Author. Retrieved from https://www.nvidia.com/content/dam/en‐zz/Solutions/Data‐Center/a100/pdf/nvidia‐a100‐datasheet‐nvidia‐us‐2188504‐web.pdf
NVIDIA. (2023). NVIDIA GH200 grace hopper superchip. Author. Retrieved from https://resources.nvidia.com/en‐us‐grace‐cpu/grace‐hopper‐superchip
O’Gorman, P. A., & Dwyer, J. G. (2018). Using machine learning to parameterize moist convection: Potential for modeling of climate, climate change, and extreme events. Journal of Advances in Modeling Earth Systems, 10(10), 2548–2563. https://doi.org/10.1029/2018ms001351
OpenACC. (2022). OpenACC, version 3.2. Retrieved from https://www.openacc.org/
Pal, A., Mahajan, S., & Norman, M. R. (2019). Using deep neural networks as cost‐effective surrogate models for super‐parameterized E3SM radiative transfer. Geophysical Research Letters, 46(11), 6069–6079. https://doi.org/10.1029/2018gl081646
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in PyTorch. In NIPS‐W.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit‐learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Petty, G. W. (2006). A first course in atmospheric radiation. Sundog Publishing.
Prill, F., Reinert, D., Rieger, D., & Zängl, G. (2023). ICON tutorial 2023: Working with the ICON model. Deutscher Wetterdienst. https://doi.org/10.5676/DWD_PUB/NWV/ICON_TUTORIAL2023
Roh, S., & Song, H.‐J. (2020). Evaluation of neural network emulations for radiation parameterization in cloud resolving model. Geophysical Research Letters, 47(21), e2020GL089444. https://doi.org/10.1029/2020gl089444
Ukkonen, P. (2022). Exploring pathways to more accurate machine learning emulation of atmospheric radiative transfer. Journal of Advances in Modeling Earth Systems, 14(4), e2021MS002875. https://doi.org/10.1029/2021ms002875
Ukkonen, P., & Hogan, R. J. (2024). Twelve times faster yet accurate: A new state‐of‐the‐art in radiation schemes via performance and spectral optimization. Journal of Advances in Modeling Earth Systems, 16(1), e2023MS003932. https://doi.org/10.1029/2023ms003932
Ukkonen, P., Pincus, R., Hogan, R. J., Nielsen, K. P., & Kaas, E. (2020). Accelerating radiation computations for dynamical models with targeted machine learning and code optimization. Journal of Advances in Modeling Earth Systems, 12(12), e2020MS002226. https://doi.org/10.1029/2020ms002226
Veerman, M. A., Pincus, R., Stoffer, R., van Leeuwen, C. M., Podareanu, D., & van Heerwaarden, C. C. (2021). Predicting atmospheric optical properties for radiative transfer computations using neural networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2194), 20200095. https://doi.org/10.1098/rsta.2020.0095
Yuval, J., O’Gorman, P. A., & Hill, C. N. (2021). Use of neural networks for stable, accurate and physically consistent parameterization of subgrid atmospheric processes with good performance at reduced precision. Geophysical Research Letters, 48(6), e2020GL091363. https://doi.org/10.1029/2020gl091363
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.