Introduction
Should global warming occur at the upper end of the range of current projections, the local impacts of unmitigated climate change would be dramatic. Particular concerns relate to the projected sea-level rise, increases in the incidence of extreme events such as heat waves and floods, and changes in the availability of water resources and the occurrence of droughts .
Current climate projections are mostly based on global climate models (GCMs). These models represent the coupled atmosphere–ocean–land system and integrate the governing equations, for instance, for a set of prescribed emissions scenarios. Despite significant progress during the last decades, uncertainties are still large. For example, current estimates of the equilibrium global mean surface warming for doubled greenhouse gas concentrations range between 1.5 and 4.5 C . On regional scales and in terms of the hydrological cycle, the uncertainties are even larger. Reducing the uncertainties of climate change projections, in order to make optimal mitigation and adaptation decisions, is thus urgent and has a tremendous economic value .
How can the uncertainties of climate projections be reduced? There is overwhelming evidence from the literature that the leading cause of uncertainty is the representation of clouds, largely due to their influence upon the reflection of incoming solar radiation . Horizontal resolutions of current global climate models are typically in the range of 50–200 km. At this resolution, clouds must be parametrized, based on theoretical and semiempirical considerations. Refining the resolution to the kilometer scale would allow the explicit representation of deep convective clouds (thunderstorms and rain showers; e.g., Fig. ). Studies using regional climate models demonstrate that at this resolution, the representation of precipitation is dramatically improved . The representation of shallow cumulus cloud layers, which are common over significant fractions of the tropical oceans, requires even higher resolution. The United States National academy of sciences has thus recommended developing “high-end global models that execute efficiently ... , enabling cloud-resolving atmospheric resolutions (2–4 km) and eddy-resolving ocean resolutions (5 km)” in the near future.
Visualization of a baroclinic wave at day 10 of a simulation with 930 m grid spacing. White shading: volume rendering of cloud ice, cloud water, and graupel g kg. Blue shading: isosurface of rain and snow hydrometeors g kg. The white contours denote surface pressure.
[Figure omitted. See PDF]
While the scientific prospects of such an undertaking are highly promising, the computational implications are significant. Increasing the horizontal resolution from 50 to 2 km increases the computational effort by at least a factor of 000. Such simulations will only be possible on future extreme-scale high-performance computers. Furthermore, power constraints have been driving the widespread adoption of many-core accelerators in leading edge supercomputers and the weather and climate community is struggling to migrate the large existing codes to these architectures and use them efficiently.
But what does efficient mean? While concerns of the total cost of ownership of a high-performance computing (HPC) system have shifted the focus from peak floating point performance towards improving power efficiency, it is not clear what the right efficiency metric is for a fully fledged climate model. Today, floating point operations are around 100 cheaper than data movement in terms of time and 1000 cheaper in terms of energy, depending on where the data come from . Thus, while focusing on floating point operations was very relevant 25 years ago, it has lost most of this relevance today. Instead, domain-specific metrics may be much more applicable to evaluate and compare application performance. A metric often used for climate models is the throughput achieved by the simulation measured in simulated years per wall clock day (SYPD; see for a detailed discussion on metrics). For global atmospheric models, a suitable near-term target is to conduct decade-long simulations and to participate in the Atmospheric Model Intercomparison Project (AMIP) effort.
This is part of the Coupled Model Intercomparison Project (CMIP6; see ).
Such simulations require a 36-year long simulation for the period 1979–2014, driven by observed atmospheric greenhouse gas concentrations and sea-surface temperatures. Within the context of current climate modeling centers, such a simulation would be feasible for an SYPD greater than or equal to 0.2–0.3. At such a rate the simulation would take up to several months. However, domain-specific metrics such as SYPD are very dependent on the specific problem and approximations in the code under consideration and are often hard to compare. Ideally comparisons would be performed for production-quality global atmospheric models that have been extensively validated for climate simulations and cover the full (non-hydrostatic and compressible) dynamics and the entire suite of model parameterizations.With the SYPD metric alone, it is hard to assess how efficiently a particular computing platform is used. Efficiency of use is particularly important because, on the typical scale of climate simulations, computing resources are very costly and energy intensive. Thus, running high-resolution climate simulations also faces a significant computer science problem when it comes to computational efficiency. As mentioned before, floating point efficiency is often not relevant for state-of-the-art climate codes. Not only does counting floating point operations per second (flop s) not reflect the actual (energy) costs well, but the typical climate code has very low arithmetic intensity (the ratio of floating point operations to consumed memory bandwidth). Attempts to increase the arithmetic intensity may increase the floating point rate, but it is not clear if it improves any of the significant metrics (e.g., SYPD). However, solely focusing on memory bandwidth can also be misleading. Thus, we propose memory usage efficiency (MUE), a new metric that considers the efficiency of the code's implementation with respect to input/output (I/O) complexity bounds as well as the achieved system memory bandwidth.
In summary, the next grand challenge of climate modeling is refining the grid spacing of the production model codes to the kilometer scale, as it will allow addressing long-standing open questions and uncertainties on the impact of anthropogenic effects on the future of our planet. Here, we address this great challenge and demonstrate the first simulation of a production-level atmospheric model, delivering 0.23 (0.043) SYPD at a grid spacing of 1.9 km (930 m), sufficient for AMIP-type simulations. Further, we evaluate the efficiency of these simulations using a new memory usage efficiency metric.
Current state of the art
Performing global kilometer-scale climate simulations is an ambitious goal , but a few kilometer-scale landmark simulations have already been performed. While arguably not the most relevant metric, many of the studies have reported sustained floating point performance. In 2007, performed a week-long simulation with a horizontal grid spacing of 3.5 km with the Nonhydrostatic Icosahedral Atmospheric Model (NICAM) on the Earth Simulator, and in 2013 performed a 12 h long simulation at a grid spacing of 870 m on the K computer, achieving 230 Tflop s double-precision performance.
1 Tflop s = flop s.
In 2014, performed a 20-day long simulation with a horizontal grid spacing of 3 km with the Model for Prediction Across Scales (MPAS) and later, in 2015, participated in the Next Generation Global Prediction System (NGGPS) model intercomparison project at the same resolution and achieved 0.16 SYPD on the full National Energy Research Scientific Computing Center (NERSC) Edison system. In 2015, simulated several months of an extended aquaplanet channel at a grid spacing of 4 km using the System for Atmospheric Modeling (SAM). were the first to deploy a weather code on the TSUBAME system accelerated using graphics processing units (GPUs). The fully rewritten NICAM model sustained a double-precision performance of 60 Tflop s on 2560 GPUs of the TSUBAME 2.5 supercomputer. In 2016, implemented a fully implicit dynamical core at 488 m grid spacing in a -plane channel achieving 7.95 Pflop s on the TaihuLight supercomputer.1 Pflop s = flop s.
The optimal numerical approach for high-resolution climate models may depend on the details of the target hardware architecture. For a more thorough analysis, the physical propagation of information in the atmosphere has to be considered. While many limited-area atmospheric models use a filtered set of the governing equations that suppresses sound propagation, these approaches are not precise enough for global applications . Thus, the largest physical group velocity to face in global atmospheric models is the speed of sound. The speed of sound in the atmosphere amounts to between 280 and 360 m s. Thus in a time span of an hour, the minimum distance across which information needs to be exchanged amounts to about 1500 km, corresponding to a tiny fraction of 1.4 % of the earth's surface. However, many numerical schemes exchange information at much larger rates. For instance, the popular pseudo-spectral methodology (e.g., ) requires Legendre and Fourier transforms between the physical grid and the spherical harmonics and thus couples them globally at each time step. Similarly, semi-Lagrangian semi-implicit time-integration methods require the solution of a Helmholtz-type elliptical equation , which implies global communication at each time step. Both methods use long time steps, which may partially mitigate the additional communication overhead. While these methods have enabled fast and accurate solutions at intermediate resolution in the past, they are likely not suited for ultrahigh-resolution models, as the rate of communication typically increases proportionally to the horizontal mesh size. Other approaches use time-integration methods with only locally implicit solvers (e.g., ), where they try to retain the advantages of fully implicit methods but only require nearest-neighbor communication.
The main advantage of implicit and semi-implicit approaches is that they allow large acoustic Courant numbers , where denotes the speed of sound and and the time step and the grid spacing, respectively. For instance, use an acoustic Courant number up to 177; i.e., their time step is 177 times larger than in a standard explicit integration (this estimate is based on the m simulation with s). In their case, such a large time step may be chosen, as the sound propagation is not relevant for weather phenomena.
However, although implicit methods are unconditionally stable (stable irrespectively of the time step used), there are other limits to choosing the time step. In order to appropriately represent advective processes with typical velocities up to 100 m s and associated phase changes (e.g., condensation and fallout of precipitation), numerical principles dictate an upper limit to the advective Courant number , where denotes the largest advective wind speed, e.g., . The specific limit for depends on the numerical implementation and time-stepping procedures. For instance, semi-Lagrangian schemes may produce accurate results for values of up to 4 or even larger. For most standard implementations, however, there are much more stringent limits, often requiring that . For the recent study of , who used a fully implicit scheme with a time step of 240 s, the advective Courant number reaches values of up to and 17.2 for the km and 488 m simulation, respectively. Depending upon the numerical approximation, such a large Courant number will imply significant phase errors or even a reduction in effective model resolution . In order to produce accurate results, the scheme would require a significantly smaller time step and would require reducing the time step with decreasing grid spacing. For the NGGPS intercomparison the hydrostatic Integrated Forecasting System (IFS) model used a time step of 120 s at 3.125 km , the regional semi-implicit, semi-Lagrangian, fully non-hydrostatic model MC2 used a time step of 30 s at 3.0 km , and Météo France in their semi-implicit Application of Research to Operations at Mesoscale (AROME) model use a time step of 45 and 60 s for their 1.3 and 2.5 km implementations, respectively. Since the IFS model is not a non-hydrostatic model, we conclude that even for fully implicit, global, convection-resolving climate simulations at 1–2 km grid spacing, a time step larger than 40–60 s cannot be considered a viable option.
In the current study we use the split-explicit time-stepping scheme with an underlying Runge–Kutta time step of the Consortium for Small-Scale Modeling (COSMO) model (see Sect. ). This scheme uses sub-time stepping for the fast (acoustic) modes with a small time step and explicit time stepping for all other modes with a large time step . Most of the computations are required on the large time step, with , depending on the combination of time-integration and advection scheme. In contrast to semi-implicit, semi-Lagrangian, and implicit schemes, the approach does not require solving a global equation and all computations are local (i.e., vertical columns exchange information merely with their neighbors). The main advantage of this approach is that it exhibits – at least in theory – perfect weak scaling.
Weak scaling is defined as how the solution time varies with the number of processing elements for a fixed problem size per processing elements. This is in contrast to strong scaling, where the total problem size is kept fixed.
This also applies to the communication load per sub-domain, when applying horizontal domain decomposition.Methods
Model description
Illustration of the computational complexity of the COSMO dynamical core, using a CDAG. The nodes of the graph represent computational kernels (blue ellipses) that can have multiple input and output variables, halo updates (green rectangles), and boundary condition operations (orange rectangles). The edges of the graph represent data dependencies. Since the dynamical core of COSMO has been written using a DSL, the CDAG can be produced automatically using an analysis back end of the DSL compiler. The lengthy serial section in the middle of the figure corresponds to the sound waves sub-stepping in the fast-wave solver. The parallel section in the upper left corresponds to the advection of the seven tracer variables. CDAGs can automatically be produced from C++ code.
[Figure omitted. See PDF]
For the simulations presented in this paper, we use a refactored version 5.0 of the regional weather and climate code developed by COSMO and – for the climate mode – the Climate Limited-area Modelling (CLM) Community . At kilometer-scale resolution, COSMO is used for numerical weather prediction and has been thoroughly evaluated for climate simulations in Europe . The COSMO model is based on the thermo-hydrodynamical equations describing non-hydrostatic, fully compressible flow in a moist atmosphere. It solves the fully compressible Euler equations using finite difference discretization in space . For time stepping, it uses a split-explicit three-stage second-order Runge–Kutta discretization to integrate the prognostic variables forward in time . For horizontal advection, a fifth-order upwind scheme is used for the dynamic variables and a Bott scheme is used for the moisture variables. The model includes a full set of physical parametrizations required for real-case simulations. For this study, we use a single-moment bulk cloud microphysics scheme that uses five species (cloud water, cloud ice, rain, snow, and graupel) described in . For the full physics simulations, additionally a radiation scheme , a soil model , and a sub-grid-scale turbulence scheme are switched on.
The COSMO model is a regional model and physical space is discretized in a rotated latitude–longitude–height coordinate system and projected onto a regular, structured, three-dimensional grid (IJK). In the vertical, a terrain-following coordinate supports an arbitrary topography. The spatial discretization applied to solve the governing equations generates so called stencil computations (operations that require data from neighboring grid points). Due to the strong anisotropy of the atmosphere, implicit methods are employed in the vertical direction, as opposed to the explicit methods applied to the horizontal operators. The numerical discretization yields a large number of mixed compact horizontal stencils and vertical implicit solvers, strongly connected via the data dependencies on the prognostic variables. Figure shows the data dependency graph of the computational kernels of the dynamical core of COSMO used in this setup, where each computational kernel corresponds to a complex set of fused stencil operations in order to maximize the data locality of the algorithm. Each computational kernel typically has multiple input and output fields and thus data dependencies as indicated with the edges of the Computational Directed Acyclic Graph (CDAG) shown in Fig. . Maximizing the data locality of these stencil computations is crucial to optimize the time to solution of the application.
To enable the running of COSMO on hybrid high-performance computing systems with GPU-accelerated compute nodes, we rewrote the dynamical core of the model, which implements the solution to the non-hydrostatic Euler equations, from Fortran to C++ . This enabled us to introduce a new C++ template library-based domain-specific language (DSL) we call Stencil Loop Language (STELLA) to provide a performance-portable implementation for the stencil algorithmic motifs by abstracting hardware-dependent optimization. Specialized back ends of the library produce efficient code for the target computing architecture. Additionally, the DSL supports an analysis back end that records the access patterns and data dependencies of the kernels shown in Fig. . This information is then used to determine the amount of memory accesses and to assess the memory utilization efficiency. For GPUs, the STELLA back end is written in CUDA, and other parts of the refactored COSMO implementation use OpenACC directives .
Thanks to this refactored implementation of the model and the STELLA DSL, COSMO is the first fully capable weather and climate model to go operational on GPU-accelerated supercomputers . In the simulations we analyze here, the model was scaled to nearly 5000 GPU-accelerated nodes of the Piz Daint supercomputer at the Swiss National Supercomputing Centre.
See
Hardware description
The experiments were performed on the hybrid partition of the Piz Daint supercomputer, located at the Swiss National Supercomputing Centre (CSCS) in Lugano. At the time when our simulation was performed, this supercomputer consisted of a multi-core partition, which was not used in this study, as well as a hybrid partition of 4'936 Cray XC50 nodes. These hybrid nodes are equipped with an Intel E5-2690 v3 CPU (code name Haswell) and a PCIe version of the NVIDIA Tesla P100 GPU (code name Pascal) with 16 GBytes second-generation high-bandwidth memory (HBM2).
1 GByte = Bytes.
The nodes of both partitions are interconnected in one fabric (based on Aries technology) in a Dragonfly topology .Energy measurements
We measure the energy to solution of our production-level runs on Piz Daint using the methodology established and described in detail by . The resource utilization report provided on Cray systems for a job provides the total energy () consumed by each application run on compute nodes. The total energy (which includes the interconnect) is then computed using where is the wall time for the application, the W term accounts for the 100 W per blade contribution from the Aries interconnect, and the 0.95 on the denominator adjusts for AC/DC conversion.
Simulation setup and verification
Evolution of a baroclinic wave in a dry simulation with 47 km grid spacing on day 8 and 10: (a) surface pressure and (b) temperature on the 850 hPa pressure level (roughly 1.5 km above sea level).
[Figure omitted. See PDF]
When pushing ahead the development of global high-resolution climate models, there are two complementary pathways. First, one can refine the resolution of existing global climate models . Second, one may alternatively try to expand the computational domain of high-resolution limited-area models towards the global scale . Here we choose the latter and develop a near-global model from the limited-area high-resolution model COSMO.
We perform near-global simulations for a computational domain that extends to a latitude band from 80 S to 80 N, which covers 98.4 % of the surface area of planet Earth. The simulation is inspired by the test case used by the winner of the 2016 Gordon Bell Prize .
The simulations are based on an idealized baroclinic wave test , which can be considered a standard benchmark for dynamical cores of atmospheric models. The test describes the growth of initial disturbances in a dynamically unstable westerly jet stream into finite-amplitude low- and high-pressure systems. The development includes a rapid transition into a nonlinear regime, accompanied by the formation of sharp meteorological fronts, which in turn trigger the formation of complex cloud and precipitation systems.
The setup uses a two-dimensional (latitude–height) analytical description of a hydrostatically balanced atmospheric base state with westerly jet streams below the tropopause, in both hemispheres. A large-scale local Gaussian perturbation is then applied to this balanced initial state which triggers the formation of a growing baroclinic wave in the Northern Hemisphere, evolving over the course of several days (Fig. ). To allow moist processes, the dry initial state is extended with a moisture profile and the parametrization of cloud-microphysical processes is activated.
The numerical problem is discretized on a latitude–longitude grid with up to 36 000 16 001 horizontal grid points for the 930 m simulation. In the zonal direction the domain is periodic and at 80 north/south confined by boundary conditions, relaxing the evolving solution against the initial conditions in a 500 km wide zone. The vertical direction is discretized using 60 stretched model levels, spanning from the surface to the model top at 40 km. The respective layer thickness widens from 20 m at the surface to 1.5 km near the domain top.
Output of the baroclinic wave (day 10): (a, b) dry simulation with 1.9 km grid spacing, (c, d, e) moist simulation with 0.93 km grid spacing, (a, c) surface pressure, (b, d) temperature on the 850 hPa pressure level, and (e) precipitation.
[Figure omitted. See PDF]
For the verification against previous dry simulations, a simulation at 47 km grid spacing is used. The evolution of
the baroclinic wave (Fig. ) very closely follows the solution originally found by
The evolution in the moist and the dry simulation at very high resolution is shown in Fig. . The high-resolution simulations reveal the onset of a secondary (likely barotropic) instability along the front and the formation of small-scale warm-core vortices with a spacing of up to 200–300 km. The basic structure of these vortices is already present in the dry simulation, but they exhibit considerable intensification and precipitation in the moist case. The formation of a large series of secondary vortices is sometimes observed in maritime cases of cyclogenesis but appears to be a rather rare phenomena. However, it appears that cases with one or a few secondary vortices are not uncommon and they may even be associated with severe weather .
The resulting cloud pattern is dominated by the comma-shaped precipitating cloud that forms along the cold and occluded fronts of the parent system (Fig. ). The precipitation pattern is associated with stratiform precipitation in the head of the cloud and small patches with precipitation rates exceeding 5 mm h in their tail, stemming from small embedded convective cells. Looking closely at the precipitation field (Fig. e), it can be seen that the secondary vortices are colocated with small patches of enhanced precipitation.
Efficiency metric
In the past, the prevalent metric to measure the performance of an application was the number of floating point operations executed per second (flop s). It used to pay off to minimize the number of required floating point operations in an algorithm and to implement the computing system in such a way that a code would execute many such operations per second. Thus, it made sense to assess application performance with the flop s metric. However, the world of supercomputing has changed. Floating point operations are now relatively cheap. They cost about a factor of 1000 less if measured in terms of energy needed to move operands between memory and registers, and they execute several hundred times faster compared to the latency of memory operations. Thus, algorithmic optimization today has to focus on minimizing data movement, and it may even pay off to recompute certain quantities if this avoids data movements. In fact, a significant part of the improvements in time to solution in the refactored COSMO code are due to the recomputation of variables that were previously stored in memory – the original code was written for vector supercomputers in the 1990s. On today's architectures, it may even pay off to replace a floating point minimal sparse algorithm with a block-sparse algorithm, into which many trivial zero operations have been introduced . Using the flop s metric to characterize application performance could be very misleading in these cases.
A popular method to find performance bottlenecks of both compute- and memory-bound applications is the roofline model . However, assessing performance of an application simply in terms of sustained memory bandwidth, which is measured in bytes per second, would be equally deceptive. For example, by storing many variables in memory, the original implementation of COSMO introduces an abundance of memory movements that boost the sustained memory bandwidth but are inefficient on modern processor architectures, since these movements cost much more than the recomputation of these variables.
Furthermore, full-bandwidth utilization on modern accelerator architectures, such as the GPUs used here, requires very specific conditions. Optimizing for bandwidth utilization can lead the application developer down the wrong path because unaligned, strided, or random accesses can be intrinsic to an underlying algorithm and severely impact the bandwidth . Optimizations to improve alignment or reduce the randomness of memory accesses may introduce unnecessary memory operations that could be detrimental to time to solution or energy efficiency.
Thus, in order to properly assess the quality of our optimizations, one needs to directly consider data movement. Here we propose a method how this can be done in practice for the COSMO dynamical core and present the resulting MUE metric.
Our proposal is motivated by the full-scale COSMO runs we perform here, where 74 % of the total time is spent in local stencil computations. The dependencies between stencils in complex stencil programs can be optimized with various inlining and unrolling techniques . Thus, to be efficient, one needs to achieve maximum spatial and temporal data reuse to minimize the number of data movement operations and perform them at the highest bandwidth.
In order to assess the efficiency in memory usage of an implementation on a particular machine, one needs to compare the actual number of data transfers executed,
Here, we define data transfers as load/stores from system memory.
which we denote with , with the necessary data transfers of the algorithm. is the theoretical lower bound of the number of memory accesses required to implement the numerical algorithm.The MUE can be intuitively interpreted as how well the code is optimized both for data locality and bandwidth utilization; i.e., if MUE 1, the implementation reaches the memory movement lower bound of the algorithm and performs all memory transfers with maximum bandwidth. Formally, where and represent the bandwidth achieved by an implementation and maximum achievable bandwidth, respectively.
In order to compute the MUE for COSMO, we developed a performance model that combines the theoretical model from , hypergraph properties, and graph partitioning techniques to estimate the necessary data transfer, , from the CDAG information of COSMO. By partitioning the CDAG into subcomputations that satisfy certain conditions imposed by the architecture, this model determines the theoretical minimum amount of memory transfers by maximizing the data locality inside the partitions. To approximate the number of actual transfers , we use the same technique to evaluate the quality of current COSMO partitioning. The values and were measured empirically – the former by profiling our application and the latter by a set of micro-benchmarks.
The details of how to determine the MUE for COSMO are given in Appendix . The MUE metric cannot be used to compare different algorithms but is a measure of the efficiency of an implementation of a particular algorithm on a particular machine, i.e., how much data locality is preserved and what the achieved bandwidth is. It thus complements other metrics such as SYPD and may complement popular metrics such as flop s or memory bandwidth. As compared to the frequently applied roofline model, the MUE metric also includes the schedule of operations, not simply the efficient use of the memory subsystem. It is thus a stronger but also more complex metric.
Performance results
To establish a performance baseline for global kilometer-scale simulations, we here present a summary of the key performance metrics for the simulations at 930 m, 1.9 km, and 47 km grid spacing, as well as a study of weak and strong scalability.
Weak scalability
Weak scalability on the hybrid P100 Piz Daint nodes, per COSMO time step of the dry simulation.
[Figure omitted. See PDF]
Until this study the GPU version of COSMO had only been scaled up to 1000 nodes of the Piz Daint supercomputer , while the full machine – at the time of the experiment – provides 4932 nodes. The scaling experiments (weak and strong) were performed with the dry simulation setup. Including a cloud microphysics parametrization, as used in the moist simulation (Fig. ), increases time to solution by about 10 %. Since microphysics does not contain any inter-node communication and is purely local to a single column of grid points, we do not expect an adverse impact on either weak or strong scalability.
Figure shows weak scaling for three per-node domain sizes ranging from to grid points in the horizontal, while keeping the size in the vertical direction fixed at 60 grid points. In comparison, on 4888 nodes, the high-resolution simulations at 930 m and 1.9 km horizontal grid spacing correspond to a domain size per node of about and .
The exact domain size of these simulations is slightly different on each node due to the domain decomposition.
The model shows excellent weak scalability properties up to the full machine, which can at least partially be explained by the nearest-neighbor halo-exchange pattern. This property reduces the complexity of COSMO's scalability on Piz Daint to the strong scaling behavior of the code. Essentially, the number of grid points per node determines the achievable time to solution of a given problem.Strong scalability
Strong scalability on Piz Daint: on P100 GPUs (filled symbols) and on Haswell CPUs using 12 MPI ranks per node (empty symbols).
[Figure omitted. See PDF]
From earlier scaling experiments of the STELLA library and the GPU version of COSMO , it is known that, for experiments in double precision on Tesla K20x, linear scaling is achieved as long as the number of grid points per node exceeds about to grid points per horizontal plane. In comparison, single-precision measurements on Tesla P100 already start to saturate at a horizontal domain size of about grid points per node (Fig. ),
COSMO supports two floating point formats to store numbers – double and single precision – which can be chosen at compile time.
corresponding to about 32 nodes for a 19 km setup and about 1000 nodes for a 3.7 km setup. Since with the 930 m and 1.9 km setup we are already in 930 m or close to 1.9 km, the linear scaling regime on the full machine, we here chose a coarser horizontal grid spacing of 3.7 and 19 km. The lower limit on the number of nodes is given by the GPU memory of 16 GB. In addition to the GPU benchmarks (filled symbols), we measured the performance with the CPU version of COSMO (empty symbols), using 12 MPI (Message Passing Interface) ranks per CPU, i.e., one MPI rank per Haswell core.The CPU measurements were performed on the hybrid partition of Piz Daint as well, since the multi-core partition is much smaller.
Exceeding 1000 nodes ( grid points per node) execution on CPUs yields a shorter time to solution than on GPUs.Time to solution
Time compression (SYPD) and energy cost (MWh SY) for three moist simulations: at 930 m grid spacing obtained with a full 10-day simulation, at 1.9 km from 1000 steps, and at 47 km from 100 steps.
MWh | |||||
---|---|---|---|---|---|
no. of nodes | s | SYPD | SY | grid points | |
930 m | 4888 | 6 | 0.043 | ||
1.9 km | 4888 | 12 | 0.23 | ||
47 km | 18 | 300 | 9.6 |
On 4888 nodes, a 10-day long moist simulation at 930 m grid spacing required a wall time of 15.3 h at a rate of 0.043 SYPD (Table ), including a disk I/O load of five 3-D fields and seven 2-D fields,
The standard I/O routines of COSMO require global fields on a single node, which typically do not provide enough storage to hold a global field. To circumvent this limitation, a binary I/O mode, allowing each node to write its output to the file system, was implemented.
written periodically every 12h of the simulation. Short benchmark simulations at 1.9 and 47 km, integrated for 1000 and 100 time steps, respectively, yield 0.23 and 9.6 SYPD. As mentioned earlier, the minimum value required for AMIP-type simulations is 0.2–0.3 SYPD. While for the 47 km setup scalability already starts to saturate, even with the 18 nodes reported (Table ), the high-resolution simulations, with an approximate per-node domain size of and , are still in a regime of good scalability. We have also conducted a 1.9 km simulation with a full set of physical parametrizations switched on, and this increases the time to solution by 27 % relative to the moist simulations presented here.In conclusion, the results show that AMIP-type simulations at horizontal grid spacings of 1.9 km using a fully fledged atmospheric model are already feasible on Europe's highest-ranking supercomputer. In order to reach resolutions of 1 km, a further reduction of time to solution of at least a factor of 5 is required.
In the remainder of this section, we attempt a comparison of our results at 1.9 km against the performance achieved by in their 2 km simulation (as their information for some of the other simulations is incomplete). They argue that for an implicit solver the time step can be kept at 240 s independent of the grid spacing and report (see their Figs. 7 and 9) values of 0.57 SYPD for a grid spacing of 2 km, on the full TaihuLight system. As explained in Sect. , such large time steps are not feasible for global climate simulations resolving convective clouds (even when using implicit solvers), and a maximum time step of 40–80 s would very likely be needed; this would decrease their SYPD by a factor of 3 to 6. Furthermore, their simulation covers only 32 % of the Earth's surface (18 N to 72) but uses twice as many levels; this would further reduce their SYPD by a factor of 1.5. Thus we estimate that the simulation of at 2 km would yield 0.124 to 0.064 SYPD when accounting for these differences. In comparison, our simulation at 1.9 km yields 0.23 SYPD; i.e., it is faster by at least a factor of 2.5. Note that this estimate does not account for additional simplifications in their study (neglect of microphysical processes, spherical shape of the planet, and topography). In summary, while a direct comparison with their results is difficult, we argue that our results can be used to set a realistic baseline for production-level GCM performance results and represent an improvement by at least a factor of 2 with respect to previous results.
Energy to solution
Based on the power measurement (see Sect. ) we now provide the energy cost of full-scale simulations (Table ) using the energy cost unit MWh per simulation year (MWh/SY). The 10-day long simulation at 930 m grid spacing running on 4888 nodes requires MWh SY, while the cost of the simulation at 1.9 km on 4888 nodes is MWh SY. For comparison, the coarse resolution at 47 km simulation on a reduced number of nodes (18) requires only MWh SY.
A 30-year AMIP-type simulation (with full physics) at a horizontal grid spacing of 1 km on the Piz Daint system would take 900 days to complete, resulting in an energy cost of approximately 22 GWh – which approximately corresponds to the consumption of 6500 households during 1 year.
Again, we attempt a comparison with the simulations performed by . The Piz Daint system reports a peak power draw of 2052 kW when running the high-performance LINPACK (HPL) benchmark. The sustained power draw when running the 930 m simulation amounted to 1059.7 kW, thus 52 % of the HPL value. The TaihuLight system reports a sustained power draw for the HPL benchmark of 15 371 kW . While do not report power consumption of their simulations, we expect the simulations on Piz Daint to be at least 5 times more power efficient, even when assuming similar achieved SYPD (see above).
We conservatively assume an application power draw of at least 35 % on TaihuLight.
Simulation efficiency
Data transfer cost estimations of the dynamical core for a theoretical lower bound (METIS), dynamical core implementation using the STELLA library (COSMO), and non-optimized dynamical core implementation (no-merging).
Level | METIS | COSMO | No-opt |
---|---|---|---|
Registers | |||
Shared memory | 64 800 | 107 600 | 229 120 |
L2 cache | 1023 | 1160 | 2341 |
In view of the energy consumption, it is paramount to consider the efficiency of the simulations executed. We do this here by considering the memory usage efficiency metric introduced in Sect. .
To estimate a solution for the optimization problem described in Appendix , Eq. (), we use the METIS library . The results are presented in Table . The METIS column is the approximation of the lower bound obtained from the performance model using the METIS library. The COSMO column is the evaluation of current COSMO partitioning in our model. The no-opt column shows the amount of data transfers of COSMO if no data locality optimization techniques are applied, like in the original Fortran version of the code. Since the original CDAG is too large for the minimization problem of the performance model, three different simplified versions of the CDAG, which focus on the accesses to three different layers of the memory hierarchy of the GPU, are studied: registers, shared memory, and L2 cache.
The model shows how efficient COSMO is in terms of data transfers – it generates only 14, 66, and 13 % more data transfers than the lower bound in the register, shared memory, and L2 cache layers, respectively. The sophisticated data locality optimization techniques of the STELLA implementation of the dynamical core of COSMO result in very good data reuse. On the P100, all memory accesses to/from DRAM go through the L2 unit. Therefore, we focus on the efficiency of this unit such that where and stand for estimated number of main memory operations and its lower bound, respectively.
Performance model verification results. Measured dynamical core time step execution time for the STELLA implementation optimized for data locality and the non-optimized implementation compared against the corresponding MUE metric.
Metric | Optimized | Not optimized | Ratio |
---|---|---|---|
Time per step | 0.16 s | 0.25 s | 0.64 |
Estimated MUE | 0.67 | 0.44 | 0.65 |
The model also can estimate the efficiency of our optimizations. Assuming that we can reach the peak achievable bandwidth if we perform no data locality optimizations (), then
It can be seen that
This result shows the importance of data locality optimizations – the optimized implementation is more than 50 % faster than a potential version that achieves peak bandwidth while using no data locality techniques.
To validate the model results, we have conducted single-node runs with and without data locality optimizations. Bandwidth measurements of the non-optimized version are very close to maximum achievable bandwidth (not shown). The fact that the two ratios in the third column of Table agree is testimony to the high precision of the performance model.
Conclusions
The work presented here sets a new baseline for fully fledged kilometer-scale climate simulations on a global scale. Our implementation of the COSMO model that is used for production-level numerical weather predictions at MeteoSwiss has been scaled to the full system on 4888 nodes of Piz Daint, a GPU-accelerated Cray XC50 supercomputer at the Swiss National Supercomputing Centre (CSCS). The dynamical core has been fully rewritten in C++ using a DSL that abstracts the hardware architecture for the stencil algorithmic motifs and enables sophisticated tuning of data movements. Optimized back ends are available for multi-core processors with OpenMP, GPU accelerators with CUDA, and for performance analysis.
The code shows excellent strong scaling up to the full machine size when running at a grid spacing of 4 km and below, on both the P100 GPU accelerators and the Haswell CPU. For smaller problems, e.g., at a coarser grid spacing of 47 km, the GPUs run out of parallelism and strong scalability is limited to about 100 nodes, while the same problem continues to scale on multi-core processors to 1000 nodes. Weak scalability is optimal for the full size of the machine. Overall, performance is significantly better on GPUs as compared to CPUs for the high-resolution simulations.
The simulations performed here are based on the idealized baroclinic wave test , which is part of the standard procedure to test global atmospheric models. Our results are validated against the original solution published in this paper.
We measured time to solution in terms of simulated years per wall clock day (SYPD) for a near-global simulation on a latitude band from 80 S to 80 N that covers 98.4 % of planet Earth's surface. Running on 4888 P100 GPUs of Piz Daint, currently Europe's largest supercomputer, we measured 0.23 SYPD at a 1.9 km grid spacing. This performance is adequate for numerical weather predictions and 10-year scale climate studies. Simulations at this resolution had an energy cost of 97.8 MWh SY.
In a moist simulation with 930 m horizontal grid spacing, we observed the formation of frontal precipitating systems, containing embedded explicitly resolved convective motions, and additional cutoff warm-core cyclonic vortices. The explicit representation of embedded moist convection and the representation of the resolved instabilities exhibits physically different behavior from coarser-resolution simulations. This is testimony to the usefulness of the high-resolution approach, as a much expanded range of scales and processes is simulated. Indeed it appears that for the current test case, the small vortices have not previously been noted, as the test case appears to converge for resolutions down to 25 km , but they clearly emerge at kilometer-scale resolution.
These results serve as a baseline benchmark for global climate model applications. For the 930 m experiment we achieved 0.043 SYPD on 4888 GPU-accelerated nodes, which is approximately one-seventh of the 0.2–0.3 SYPD required to conduct AMIP-type simulations. The energy cost for simulations at this horizontal grid spacing was 596 MWh SY.
Scalable global models do not use a regularly structured grid such as COSMO, due to the “pole problem”. Therefore, scalable global models tend to use quasi-uniform grids such as cubed-sphere or icosahedral grids. We believe that our results apply directly to global weather and climate models employing structured grids and explicit, split-explicit, or horizontally explicit, vertically implicit (HEVI) time discretizations (e.g., FV3, NICAM). Global models employing implicit or spectral solvers may have a different scaling behavior.
Our work was inspired by the dynamical core solver that won the 2016 Gordon Bell award at the Supercomputing Conference . The goal of this award is to reward high performance achieved in the context of a realistic computation. A direct comparison with the results reported there is difficult, since were running a research version of a dynamical core solver and COSMO is a fully fledged regional weather and climate model. Nevertheless, our analysis indicates that our benchmarks represent an improvement both in terms of time to solution and energy to solution. As far as we know, our results represent the fastest (in terms of application throughput measured in SYPD at 1 km grid spacing) simulation of a production-ready, non-hydrostatic climate model on a near-global computational domain.
In order to reach the 3–5 SYPD performance necessary for long climate runs, simulations would be needed that run 100 times faster than the baseline we set here. Given that Piz Daint with NVIDIA P100 GPUs is a multi-petaflop system, the 1 km scale climate runs at 3–5 SYPD performance represent a formidable challenge for exascale systems. As such simulations are of interest to scientists around the globe, we propose that this challenge be defined as a goal post to be reached by the exascale systems that will be deployed in the next decade.
Finally, we propose a new approach to measure the efficiency of memory bound application codes, like many weather and climate models, running on modern supercomputing systems. Since data movement is the expensive commodity on modern processors, we advocate that the code's performance on a given machine be characterized in terms of data movement efficiency. We note that both detailed mathematical analysis and the general applicability of our MUE metric is beyond the scope of this paper and requires additional publication. In this work we show a use case of the MUE metric and demonstrate its precision and usefulness in assessing memory subsystem utilization of a machine, using the I/O complexity lower bound as the necessary data transfers, , of the algorithm. With the time to solution and the maximum system bandwidth , it is possible to determine the memory usage efficiency that captures how well the code is optimized both for data locality and bandwidth utilization. It will be interesting to investigate the MUE metric in future performance evaluations for weather and climate codes on high-performance computing systems.
The particular version of the COSMO model used in this study is based on the
official version 5.0 with many additions to enable GPU capability and available under
license (
Computing the efficiency metric
Here, we discuss how to determine the memory usage efficiency (MUE) given in Eq. () for an application code, such as the COSMO model. We need to determine the necessary data transfers , the maximum system bandwidth , and measure the execution time .
Necessary transfers
A natural representation of data flow and dependencies of algorithms is a CDAG. This abstraction is widely used for register allocation optimization, scheduling problems, and communication minimization. A CDAG models computations as vertices () and communications as edges () between them. A CDAG can be used to develop theoretical models that reason about data movements of an application. However, not all the edges of the CDAG account for data transfers, since the data required by a computation might be stored in fast memory (cached), depending on the execution schedule. Finding an execution schedule that minimizes the transaction metric is NP-hard for general CDAGs and therefore an intractable problem .
Minimizing data movement has been the subject of many studies. Two main approaches have been established: (1) finding analytical lower bounds for chosen algorithms for a given machine model and (2) finding optimal graph partitions . The former is designed for particular, highly regular small algorithms, like sorting or matrix multiplication and is not suitable for large-scale applications like COSMO. The latter approach is mostly used for minimizing network communication and has not been applied to large applications either. In our performance model, we combine the two into a novel graph-cutting technique. We build on Hong and Kung's 2S partitioning and construct a hypergraph partitioning technique to estimate a memory movement lower bound. We do not consider internode communication here. To the best of our knowledge, we are the first to apply these techniques to a real-world parallel application.
The key concept behind estimating is to partition the whole CDAG into subcomputations (2 partitions) , such that each requires at most data transfer operations. Then, if is the minimal number of 2 partitions for a given CDAG, showed that the minimal number of memory movement operations for any valid execution of the CDAG is bounded by
Here we outline the key steps of our modeling approach:
-
We reduce Hong and Kung's 2 partitioning definition to hypergraph cut by relaxing the constraints on the dominator and minimum set sizes. Each hyper-edge contains a vertex from the original CDAG and all its successors.
-
We approximate the minimal hypergraph cut by minimizing the total communication volume.
-
We then express the memory movement lower bound assubject towhere Nbr is the number of partitions that vertex is adjacent to, is the memory size of vertex , and is the size of memory at a level for which the optimization is performed. Equation () now minimizes the sum of the communication volume across all partitions (assuming we load partitions one after the other), while constraint Eq. () bounds the boundary weight for each partition to such that it fits in fast memory.
COSMO CDAG
COSMO CDAGs at various GPU memory hierarchy levels.
Memory level | Vertex def. | |||
---|---|---|---|---|
Registers | one IJK value | 157 803 | 984 101 | 32 |
Shared memory | one IJ plane | 2649 | 12 137 | 8 |
L2 cache | whole array | 1912 | 9863 | 29 |
Figure shows the data dependency CDAG of the computational kernels of the dynamical core of COSMO, where each kernel corresponds to a complex set of fused stencil operations in order to maximize the data locality of the algorithm. A single time step in COSMO accesses 781 variables, each of which is represented by a array for our 930 m simulation. Some variables are updated multiple times during a time step which results in a total number of variable accesses (CDAG vertices) of more than . The resulting large and complex graph makes estimating impractical. In order to reduce the complexity, one can coarsen the CDAG by grouping multiple accesses into a single vertex. As an example, Fig. shows the coarsest representation of the CDAG where each vertex models a full kernel. Each kernel may read and write various output variables, compute multiple stencil operations and boundary conditions, or perform halo exchanges. Thus, in this coarsened version, valuable data dependency information is lost, and one cannot argue about the optimality and possible rearrangement of the operations fused within a kernel.
We now describe how we determine coarsening strategies of the COSMO CDAG for three levels of the memory hierarchy of our target system, registers, shared memory, and L2 cache.
-
Registers (65'536): the COSMO GPU implementation assigns all computations accessing variables with the same IJ coordinate to the same GPU thread and use registers to reuse values in the K direction. To model this memory hierarchy, it is only necessary to keep the stencil accesses in the K direction. Thus, all accesses in the IJ plane are represented as a single vertex in the CDAG, which is then simplified to 781 variables and their dependencies among all 60 levels in K.
-
Shared memory (64 kB): the shared memory of the GPU is used to communicate values between the different compute threads. In order to model this layer, all different accesses in the K direction of a variable are represented as a single vertex in the CDAG, while all accesses in the IJ plane are kept.
-
L2 cache (4 MB): this last cache level before DRAM is used to store whole arrays (fields). In this layer all accesses to a variable in any direction are represented as a vertex in the CDAG. It keeps only the data dependencies among variables, irrespective of the offset and direction of the access.
Bandwidth of the representative stencil benchmarks and GPU STREAM on Tesla P100. All kernels (except for GPU STREAM) operate on a 3-D domain.
[Figure omitted. See PDF]
To generate the whole CDAG, we used the STELLA analysis back end to trace all local memory accesses to all fields. Based on the information from the access offsets and order of operations, we reconstruct the read–write, write–read, and write–write dependencies.
Maximum achievable bandwidth
We now describe how we measure the memory usage efficiency in practice. We start by describing how to determine the maximum achievable bandwidth for COSMO stencils. Even though the Tesla P100 has a theoretical peak memory bandwidth of 720 GB s , we argue that this may not be achievable for real applications. A well-established method to measure the maximum achievable bandwidth of GPUs is the GPU STREAM benchmark . Our tests show that the maximum achievable bandwidth for COPY is 557 GB s if at least 30 MB double-precision numbers are copied (Fig. ). However, stencil codes on multidimensional domains like COSMO require more complex memory access patterns that, even when highly tuned, cannot achieve the same bandwidth as STREAM due to architectural limitations.
We identified the most common patterns and designed and tuned a set of micro-benchmarks that only mimic the memory access patterns without the computations to investigate the machine capability of handling memory accesses for stencils. They include aligned, unaligned, and strided patterns in all dimensions. All benchmarks operate on a 3-D domain of parametric size, on either single- or double-precision numbers. The results of a representative set of four chosen micro-benchmarks are shown in Fig. , together with the GPU STREAM, which operates on 1-D domain. The fastest stencil kernel (double-precision aligned COPY) reaches 510 GB s. The slowdown compared to GPU STREAM is due to the more complex access pattern in the 3-D domain. Furthermore, using single-precision numbers further deteriorates the bandwidth on P100 (COPY (float) reaches 475 GB s). Our COSMO 930 m run uses predominantly single-precision numbers on a domain, which results in 28.2 MB of data per field. Our measurements show that the maximum achievable bandwidth for this setup is 362 GB s (in the best case of the simple COPY benchmark). We will use this upper-bound number as the maximum system bandwidth. The average measured memory bandwidth across all COSMO real-world stencils is 276 GB s, which gives .
The authors declare that they have no conflict of interest.
Acknowledgements
This work was supported by the Swiss National Science Foundation under Sinergia grant CRSII2_154486/1 and by a grant from the Swiss National Supercomputing Centre and PRACE. We would like to acknowledge the many contributors to the GPU-capable version of COSMO used in this study, among others Andrea Arteaga, Mauro Bianco, Isabelle Bey, Christophe Charpilloz, Valentin Clement, Ben Cumming, Tiziano Diamanti, Tobias Gysi, Peter Messmer, Katherine Osterried, Anne Roches, Stefan Rüdisühli, and Pascal Spörri. Also, the authors would like to acknowledge the Center for Climate Systems Modeling (C2SM) and the Federal Office of Meteorology and Climatology MeteoSwiss for their support. Furthermore, we would like to thank Nils Wedi and Piotr Smolarkiewicz (ECMWF) for useful comments and discussions. Finally, we would like to thank an anonymous reviewer, Rupert W. Ford (STFC), and the topical editor of GMD (Sophie Valcke) for their constructive comments and reviews that significantly improved the clarity and quality of the manuscript. Edited by: Sophie Valcke Reviewed by: Rupert W. Ford and one anonymous referee
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2018. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The best hope for reducing long-standing global climate model biases is by increasing resolution to the kilometer scale. Here we present results from an ultrahigh-resolution non-hydrostatic climate model for a near-global setup running on the full Piz Daint supercomputer on 4888 GPUs (graphics processing units). The dynamical core of the model has been completely rewritten using a domain-specific language (DSL) for performance portability across different hardware architectures. Physical parameterizations and diagnostics have been ported using compiler directives. To our knowledge this represents the first complete atmospheric model being run entirely on accelerators on this scale. At a grid spacing of 930 m (1.9 km), we achieve a simulation throughput of 0.043 (0.23) simulated years per day and an energy consumption of 596 MWh per simulated year. Furthermore, we propose a new memory usage efficiency (MUE) metric that considers how efficiently the memory bandwidth – the dominant bottleneck of climate codes – is being used.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 Federal Institute of Meteorology and Climatology, MeteoSwiss, Zurich, Switzerland
2 ITS Research Informatics, ETH Zurich, Switzerland
3 Scalable Parallel Computing Lab, ETH Zurich, Switzerland
4 Institute for Atmospheric and Climate Science, ETH Zurich, Switzerland
5 Institute for Theoretical Physics, ETH Zurich, Switzerland; Swiss National Supercomputing Centre, CSCS, Lugano, Switzerland
6 Swiss National Supercomputing Centre, CSCS, Lugano, Switzerland