Content area
We explore the domain-specific Python library GT4Py (GridTools for Python) for implementing a representative physical parametrization scheme and the related tangent-linear and adjoint algorithms from the Integrated Forecasting System (IFS) of ECMWF. GT4Py encodes stencil operators in an abstract and hardware-agnostic fashion, thus enabling more concise, readable, and maintainable scientific applications. The library achieves high performance by translating the application into targeted low-level coding implementations. Here, the main goal is to study the correctness and performance portability of the Python rewrites with GT4Py against the reference Fortran code and a number of automatically and manually ported variants created by ECMWF. The present work is part of a larger cross-institutional effort to port weather and climate models to Python with GT4Py. The focus of the current work is the IFS prognostic cloud microphysics scheme, a core physical parametrization represented by a comprehensive code that takes a significant share of the total forecast model execution time. In order to verify GT4Py for numerical weather prediction (NWP) systems, we put additional emphasis on the implementation and validation of the tangent-linear and adjoint model versions which are employed in data assimilation. We benchmark all prototype codes on three European supercomputers characterized by diverse graphics processing unit (GPU) and central processing unit (CPU) hardware, node designs, software stacks, and compiler suites. Once the application is ported to Python with GT4Py, we find excellent portability, competitive GPU performance, and robust execution in all tested scenarios including with single precision.
1 Introduction
Soon after its first public release in 1957, Fortran became the language of choice for weather and climate models . On the one hand, its procedural programming style and its built-in support for multi-dimensional arrays have granted Fortran large popularity in the whole scientific computing community. On the other, its low-level nature guarantees fast execution of intensive mathematical operations on vector machines and conventional central processing units (CPUs). In the last decades, these characteristics have permitted weather forecasts to be run several times per day under tight operational schedules on high-performance computing (HPC) systems .
In recent years, in response to the simultaneous end of Moore's law and Dennard scaling, and due to the societal challenge to reduce energy consumption, the computer hardware landscape has been undergoing a rapid specialization to prevent unsustainable growth of the power envelope . As a result, most supercomputers nowadays have a heterogeneous node design, where energy-efficient accelerators such as graphics processing units (GPUs) co-exist with traditional CPUs. Because Fortran has been conceived with CPU-centric machines in mind, efficient programming of hybrid HPC platforms using the core Fortran language can be challenging . Indeed, the sustained performance of legacy weather and climate model codes written in Fortran has decreased over the decades , revealing the urgency for algorithmic and software adaptations to remain competitive in the medium and long term .
Compiler directives (or pragmas) are an attractive solution for parallelization, both to spread a workload across multiple CPU threads and to offload data and computations to GPU. The most famous incarnations of this programming paradigm are OpenMP and OpenACC . Because compiler directives accommodate incremental porting and enable non-disruptive software development workflows, they are adopted by many weather and climate modeling groups, who are facing the grand challenge of accelerating large code bases with thousands of source files and millions of lines of code, which stem from decades of scientific discoveries and software developments . In order not to threaten the overall readability of the code by exposing low-level instructions, the annotation of Fortran codes with compiler directives can be automated in the pre-processor step of the compilation process using tools such as the CLAW compiler or the ECMWF source-to-source translation tool Loki (
On the contrary, domain-specific languages (DSLs) separate the code describing the science from the code actually executing on the target hardware, thus enabling performance portability, namely application codes that achieve near-optimal performance on a variety of computer architectures . Large portions of many modeling systems are being rewritten using multiple and diverse DSLs, not necessarily embedded in Fortran. For instance, the dynamical core of the weather prediction model from the COnsortium for Small-scale MOdeling (COSMO; ) has been rewritten in C++ using the GridTools library to port stencil-based operators to GPUs . Similarly, HOMMEXX-NH is an architecture-portable C++ implementation of the non-hydrostatic dynamical core of the Energy Exascale Earth System model (E3SM; ) harnessing the Kokkos library to express on-node parallelism . The GungHo project for a new dynamical core at the UK Met Office blends the LFRic infrastructure with the PSyclone code generator . Pace is a Python rewrite of the Finite-Volume Cubed-Sphere Dynamical Core (FV3; ) using GT4Py (GridTools for Python) to accomplish performance portability and productivity. Similarly, various Swiss partners including the Swiss Federal Office of Meteorology and Climatology (MeteoSwiss), ETH Zurich, and the Swiss National Supercomputing Centre (CSCS) are porting the ICOsahedral Non-hydrostatic modeling framework (ICON; ) to GT4Py . In another related project , a next-generation model for the IFS at ECMWF is developed in Python with GT4Py building on the Finite Volume Module (FVM; ).
The focus of the portability efforts mentioned above is the model dynamical core – the part of the model numerically solving the fundamental non-linear fluid-dynamics equations. In the present work, we turn the attention to physical parametrizations – which account for the representation of subgrid-scale processes – and additionally address the associated tangent-linear and adjoint algorithms. Parametrizations are being commonly ported to accelerators using OpenACC (e.g., ). Wrappers around low-level legacy physics codes might then be designed to facilitate adoption within higher-level workflows . Lately, first attempts at refactoring physical parametrizations with respect to portability have been documented in the literature. For instance, presented a rewrite of the MPAS-Albany Land Ice (MALI) ice-sheet model using Kokkos. Here, we present a Python implementation of the cloud microphysics schemes CLOUDSC and CLOUDSC2 , which are part of the physics suite of the IFS at ECMWF.
As we mention in Sect. , the versions of CLOUDSC and CLOUDSC2 considered in this study correspond to older release cycles of the IFS than the one currently used in production.
Details on the formulation and validation of the schemes are discussed in Sect. . The proposed Python implementations build upon the GT4Py toolchain, and in the remainder of the paper we use the term CLOUDSC-GT4Py to refer to the GT4Py rewrite of CLOUDSC, while the GT4Py ports of the non-linear, tangent-linear, and adjoint formulations of CLOUDSC2 are collectively referred to as CLOUDSC2-GT4Py. The working principles of the GT4Py framework are illustrated in Sect. , where we also advocate the advantages offered by domain-specific software approaches. Section sheds some light on the infrastructure code and how it can enable composable and reusable model components. In Sect. , we compare the performance of CLOUDSC-GT4Py and CLOUDSC2-GT4Py, as measured on three leadership-class GPU-equipped supercomputers, to established implementations in Fortran and C/C++. We conclude the paper with final remarks and future development paths.2 Defining the targeted scientific applications
Several physical and chemical mechanisms occurring in the atmosphere are active on spatial scales that are significantly smaller than the highest affordable model resolution. It follows that these mechanisms cannot be properly captured by the resolved model dynamics but need to be parametrized. Parametrizations express the bulk effect of subgrid-scale phenomena on the resolved flow in terms of the grid-scale variables. The equations underneath physical parametrizations are based on theoretical and semi-empirical arguments, and their numerical treatment commonly adheres to the single-column abstraction; therefore adjustments can only happen within individual columns, with no data dependencies between columns. The atmospheric module of the IFS includes parametrizations dealing with the radiative heat transfer, deep and shallow convection, clouds and stratiform precipitation, surface exchange, turbulent mixing in the planetary boundary layer, subgrid-scale orographic drag, non-orographic gravity wave drag, and methane oxidation .
The focus of this paper is on the cloud microphysics modules of the ECMWF: CLOUDSC – used in operational forecasting – and CLOUDSC2 – employed in the data assimilation. The motivation is threefold:
i
Both schemes are among the most computationally expensive parametrizations, with CLOUDSC accounting for up to % of the total execution time of the high-resolution operational forecasts at ECMWF.
- ii
They are representative of the computational patterns ubiquitous in physical parametrizations.
- iii
They already exist in the form of dwarfs. The weather and climate “computational dwarfs”, or simply “dwarfs”, are model components shaped into stand-alone software packages to serve as archetypes of relevant computational motifs and provide a convenient platform for performance optimizations and portability studies .
In recent years, the Performance and Portability team of ECMWF has created the CLOUDSC and CLOUDSC2 dwarfs. The original Fortran codes for both packages, corresponding respectively to the IFS Cycle 41r2 and 46r1, have been pulled out of the IFS code base, slightly polished
Compared to the original implementations run operationally at ECMWF, the CLOUDSC and CLOUDSC2 dwarf codes do not include (i) all the IFS-specific infrastructure code, (ii) the calculation of budget diagnostics, and (iii) dead code paths.
and finally made available in public code repositories (CLOUDSC is a single-moment cloud microphysics scheme that parametrizes stratiform clouds and their contribution to surface precipitation . It was implemented in the IFS Cycle 36r4 and has been operational at ECMWF since November 2010. Compared to the pre-existing scheme, it accounts for five prognostic variables (cloud fraction, cloud liquid water, cloud ice, rain, and snow) and brings substantial enhancements in different aspects, including treatment of mixed-phase clouds, advection of precipitating hydrometeors (rain and snow), physical realism, and numerical stability . For a comprehensive description of the scheme, we refer the reader to and the references therein. For all the coding versions considered in this paper, including the novel Python rewrite, the calculations are validated by direct comparison of the output against serialized language-agnostic reference data provided by ECMWF.
2.2 CLOUDSC2: cloud microphysics in the context of data assimilation
The CLOUDSC2 scheme represents a streamlined version of CLOUDSC, devised for use in the four-dimensional variational assimilation (4D-Var) at ECMWF . The 4D-Var merges short-term model integrations with observations over a 12 h assimilation window to determine the best possible representation of the current state of the atmosphere. This then provides the initial conditions for longer-term forecasts . The optimal synthesis between model and observational data is found by minimizing a cost function, which is evaluated using the tangent-linear of the non-linear forecasting model, while the adjoint model is employed to compute the gradient of the cost function . For the sake of computational economy, the tangent-linear and adjoint operators are derived from a simplified and regularized version of the full non-linear model. CLOUDSC2 is one of the physical parametrizations included in the ECMWF's simplified model, together with radiation, vertical diffusion, orographic wave drag, moist convection, and non-orographic gravity wave activity . In the following, we provide a mathematical and algorithmic representation of the tangent-linear and adjoint versions of CLOUDSC2. For the sake of brevity, in the rest of the paper we will refer to the non-linear, tangent-linear, and adjoint formulations of CLOUDSC2 using CLOUDSC2NL, CLOUDSC2TL, and CLOUDSC2AD, respectively.
Let be the functional description of CLOUDSC2, connecting the input fields with the output variables . The tangent-linear operator of is derived from the Taylor series expansion
1 where and are variations on and , and is a suitable norm. The formal correctness of the coding implementation of can be assessed through the Taylor test (also called the “V-shape” test), which ensures that the following condition is satisfied up to machine precision: 2 The logical steps carried out in the actual implementation of the Taylor test are sketched in Algorithm (Appendix ).
The adjoint operator of is defined such that for the inner product : 3 In particular, Eq. () must hold for and : 4 The latter condition is at the heart of the so-called symmetry test for (see Algorithm in Appendix ).
3 A domain-specific approach to scientific software developmentIn scientific software development, it is common practice to conceive a first proof-of-concept implementation of a numerical algorithm in a high-level programming environment like MATLAB/Octave , or Python. Because these languages do not require compilation and support dynamic typing, they provide a breeding ground for fast prototyping. However, the direct adoption of interpreted languages in HPC has historically been hindered by their intrinsic slowness. To squeeze more performance out of the underlying silicon, the initial proof of concept is translated into either Fortran, C or C++. This leads to the so-called “two-language problem”, where the programming language used for the germinal prototyping is abandoned in favor of a faster language that might be more complicated to use. The lower-level code can be parallelized for shared memory platforms using OpenMP directives, while distributed memory machines can be targeted using Message Passing Interface (MPI) libraries. The resulting code can later be migrated to GPUs, offering outstanding compute throughput and memory bandwidth, especially for single instruction, multiple data (SIMD) applications. GPU porting is accomplished using either OpenACC or OpenMP directives or via a CUDA (
The schematic visualization in Fig. a highlights how the above workflow leads to multiple implementations of the same scientific application utilizing different programming models and coding styles. This unavoidably complicates software maintainability: ideally, any modification in the numerical model should be encoded in all implementations, so as to preserve the coherency across the hierarchy. The maintainability problem is exacerbated as the number of lines of code, the pool of platforms to support, and the user base increase. This situation has been known as the “software productivity gap” , and we argue that it cannot be alleviated by relying on general-purpose programming paradigms and monolithic code designs. Instead, it calls for a more synergistic collaboration between domain scientists (which here include model developers, weather forecasters, and weather and climate scientists) and computer experts. A path forward is provided by DSLs through separation of concerns (Fig. b) so that domain scientists can express the science using syntactic constructs that are aligned with the semantics of the application domain and hide any architecture-specific detail. The resulting source code is thus hardware-agnostic, more concise, easier to read, and easier to manipulate. A toolchain developed by software engineers then employs automatic code generation techniques to synthesize optimized parallel code for the target computer architecture in a transparent fashion.
Figure 1
Diagrams comparing (a) a well-established workflow in scientific software development and (b) a DSL-based approach resembling the software engineering strategy advocated in this paper. The dashed red-and-blue line in (b) marks the separation-of-concerns between the domain scientists and the computer experts.
[Figure omitted. See PDF]
3.1 The GT4Py frameworkGT4Py (
A stencil is an operator that computes array elements by accessing a fixed pattern of neighboring items.
kernels as found in weather and climate applications. The library is developed and maintained by CSCS, ETH Zurich, and MeteoSwiss and benefits from important contributions by international partners such as the Paul Allen Institute for Artificial Intelligence (AI2). The choice of embedding the GT4Py framework in Python has been mainly dictated by the following factors:i.
Python is taught in many academic courses due its clean, intuitive, and expressive syntax; therefore a significant fraction of early-career domain scientists are exposed to the language.
- ii.
It admits a powerful ecosystem of open-source packages for building end-to-end applications.
- iii.
It is possible to seamlessly interface Python with lower-level languages with minimal overhead and virtually no memory copies.
- iv.
Under the thrust of the artificial intelligence and machine learning community (AI/ML), the popularity and adoption of Python across the whole scientific community is constantly growing, as opposed to Fortran .
The proposed Python implementations of CLOUDSC and CLOUDSC2 are based on the first public release of GT4Py, which only supports Cartesian grids. The latest advancements to support unstructured meshes (contained in the sub-package
Figure 2
Simplified view of the internal stages carried out by the GT4Py toolchain to generate a high-performance CPU or GPU implementation of the three-dimensional Laplacian stencil starting from its GTScript definition. For the sake of visualization, only two intermediate representations (IRs) are included: the GridTools IR (GTIR) and the implementation IR (IIR).
[Figure omitted. See PDF]
Figure showcases the main steps undertaken by the GT4Py toolchain to translate the high-level definition of the three-dimensional Laplacian operator into optimized code, which can be directly called from within Python. The stencil definition is given as a regular Python function using the GTScript DSL. GTScript abstracts spatial for-loops away: computations are described for a single point of a three-dimensional Cartesian grid and can be controlled with respect to the vertical index bounds using the
Any function marked with the
-
NumPy is the de facto standard for array computing in Python and can be used for debugging and fast-prototyping.
-
GridTools is a set of libraries and utilities to write performance-portable applications in the area of weather and climate.
-
DaCe is a parallel programming framework, which internally uses the Stateful DataFlow multiGraph (SDFG) data-centric intermediate representation to decouple domain science and performance engineering.
The generated code is compiled under the hood, and Python bindings for the resulting executable are automatically produced; therefore the stencil can finally be executed by passing the input and output fields and by specifying the origin and size of the computation domain. GT4Py provides convenient utilities to allocate arrays with an optimal memory layout for any given backend, relying on NumPy for CPU storages and CuPy for GPU storages. Concerning GPU computing, we highlight that GT4Py supports both NVIDIA and AMD GPUs.
A more realistic and pertinent code sample is provided in Listing . It is an abridged GT4Py implementation of the procedure computing the saturation water vapor pressure as a function of air pressure and temperature. The code is extracted from the CLOUDSC2-GT4Py dwarf and highlights two additional features of GTScript: functions and external symbols. Functions can be thought of as macros and can be used to improve composability, reusability, and readability. External symbols are used to encode those scalar parameters (e.g., physical constants) that are kept constant throughout a simulation and might only change between different model setups. External values must be provided at stencil compilation time. The functionalities provided by the package
Listing 1
GTScript (the Python-embedded DSL exposed by GT4Py) functions and stencil computing the saturation water vapor pressure given the air pressure and temperature. Abridged excerpt from the CLOUDSC2-GT4Py dwarf.
[Figure omitted. See PDF]
4 Infrastructure codeAll stencils of CLOUDSC-GT4Py and CLOUDSC2-GT4Py are defined, compiled, and invoked within classes that leverage the functionalities provided by the Sympl package . Sympl is a toolset of Python utilities to write self-contained and self-documented model components. Because the components share a common application public interface (API), they favor modularity, composability, and inter-operability . These aspects are of utter importance, for instance, when it comes to assessing the impact of process coupling on weather forecasts and climate projections .
Listing 2
A Python class to compute the saturation water vapor pressure given the air pressure and temperature. Abridged excerpt from the CLOUDSC2-GT4Py dwarf.
[Figure omitted. See PDF]
Sympl components interact through dictionaries whose keys are the names of the model variables (fields) and whose values are xarray's
The bespoke infrastructure code for CLOUDSC-GT4Py and CLOUDSC2-GT4Py is bundled as an installable Python package called
Listing brings a concrete example from CLOUDSC2-GT4Py: a model component leveraging the stencil defined in Listing to compute the saturation water vapor pressure. The class inherits
In this section, we highlight the results from comprehensive performance testing. We compare the developed CLOUDSC-GT4Py and CLOUDSC2-GT4Py codes against reference Fortran versions and various other programming prototypes. The simulations were performed on three different supercomputers:
i.
Piz Daint (
https://www.cscs.ch/computers/piz-daint , last access: 23 September 2024), an HPE Cray XC40/XC50 system installed at CSCS in Lugano, Switzerland;- ii.
MeluXina (
https://docs.lxp.lu/ , last access: 23 September 2024), an ATOS BullSequana XH2000 machine hosted by LuxConnect in Bissen, Luxembourg, and procured by the EuroHPC Joint Undertaking (JU) initiative; - iii.
the Cray HPE EX235a supercomputer LUMI (
https://docs.lumi-supercomputer.eu/ , last access: 23 September 2024), an EuroHPC pre-exascale machine at the Science Information Technology Center (CSC) in Kajaani, Finland.
On each machine, the CLOUDSC and CLOUDSC2 applications are executed on a single hybrid node that sports one or multiple GPU accelerators alongside the host CPU. An overview of the node architectures for the three considered supercomputers can be found in Table .
Table 1Overview of the node architecture for the hybrid partition of Piz Daint, MeluXina, and LUMI. Only the technical specifications which are most relevant for the purposes of this paper are reported.
| System | CPU | GPU | RAM | NUMA domains |
|---|---|---|---|---|
| Piz Daint | 1 Intel Xeon E5-2690v3 12c | 1 NVIDIA Tesla P100 16 GB | 64 GB | 1 |
| MeluXina | 2 AMD EPYC Rome 7452 32c | 4 NVIDIA Tesla A100 40 GB | 512 GB | 4 |
| LUMI | 1 AMD EPYC Trento 7A53 64c | 4 AMD Instinct MI250X | 512 GB | 8 |
Besides the GT4Py codes, we involve up to four alternative lower-level programming implementations, which will be documented in an upcoming publication:
- a.
The baseline is Fortran enriched with OpenMP directives for multi-threading execution on CPU.
- b.
An optimized GPU-enabled version based on OpenACC using the single-column coalesced (SCC) loop layout in combination with loop fusion and temporary local array demotion (so-called “k-caching”) is considered. While the SCC loop layout yields more efficient access to device memory and increased parallelism, the k-caching technique significantly reduces register pressure and memory traffic. This is achieved via loop fusion to eliminate most loop-carried dependencies and consequently allows temporaries to be demoted to scalars.
- c.
The currently best-performing Loki-generated and GPU-enabled variant is also taken into account.
- d.
Finally, an optimized GPU-enabled version of CLOUDSC including k-caching is used. The code is written in either CUDA or HIP, to target both NVIDIA GPUs (shipped with Piz Daint and MeluXina) and AMD GPUs (available on LUMI).
For each coding version of the CLOUDSC and CLOUDSC2 dwarfs considered in the performance analysis, the table reports the compiler suite used to compile the codes on Piz Daint, MeluXina, and LUMI. The codes are compiled with all major optimization options enabled. Those implementations which are either not available or not working are marked with a dash; more details, as well as a high-level description of each coding implementation, are provided in the text.
| System | Implementation | CLOUDSC | CLOUDSC2: non-linear | CLOUDSC2: symmetry test |
|---|---|---|---|---|
| Piz Daint | Fortran: OpenMP (CPU) | Intel Fortran 2021.3.0 | Intel Fortran 2021.3.0 | Intel Fortran 2021.3.0 |
| Fortran: OpenACC (GPU) | NVIDIA Fortran 21.3-0 | – | – | |
| Fortran: Loki (GPU) | NVIDIA Fortran 21.3-0 | NVIDIA Fortran 21.3-0 | – | |
| C: CUDA (GPU) | NVIDIA CUDA 11.2.67 | – | – | |
| GT4Py: CPU k-first | g++ (GCC) 10.3.0 | g++ (GCC) 10.3.0 | g++ (GCC) 10.3.0 | |
| GT4Py: DaCe (GPU) | NVIDIA CUDA 11.2.67 | NVIDIA CUDA 11.2.67 | NVIDIA CUDA 11.2.67 | |
| MeluXina | Fortran: OpenMP (CPU) | NVIDIA Fortran 22.7-0 | NVIDIA Fortran 22.7-0 | – |
| Fortran: OpenACC (GPU) | NVIDIA Fortran 22.7-0 | – | – | |
| Fortran: Loki (GPU) | NVIDIA Fortran 22.7-0 | NVIDIA Fortran 22.7-0 | – | |
| C: CUDA (GPU) | NVIDIA CUDA 11.7.64 | – | – | |
| GT4Py: CPU k-first | g++ (GCC) 11.3.0 | g++ (GCC) 11.3.0 | g++ (GCC) 11.3.0 | |
| GT4Py: DaCe (GPU) | NVIDIA CUDA 11.7.64 | NVIDIA CUDA 11.7.64 | NVIDIA CUDA 11.7.64 | |
| LUMI | Fortran: OpenMP (CPU) | Cray Fortran 14.0.2 | Cray Fortran 14.0.2 | Cray Fortran 14.0.2 |
| Fortran: OpenACC (GPU) | Cray Fortran 14.0.2 | – | – | |
| Fortran: Loki (GPU) | Cray Fortran 14.0.2 | Cray Fortran 14.0.2 | – | |
| C: HIP (GPU) | Cray C/C++ 15.0.1 | – | – | |
| GT4Py: CPU k-first | Cray C/C++ 15.0.1 | Cray C/C++ 15.0.1 | Cray C/C++ 15.0.1 | |
| GT4Py: DaCe (GPU) | Cray C/C++ 15.0.1 | Cray C/C++ 15.0.1 | Cray C/C++ 15.0.1 |
Table documents the compiler specifications employed for each of the programming implementations, on Piz Daint, MeluXina, and LUMI. We consistently apply the most aggressive optimization, ensuring that the underlying code manipulations do not harm validation. For the different algorithms at consideration, validation is carried out as follows:
-
For CLOUDSC and CLOUDSC2NL, the results from each coding version are directly compared with serialized reference data produced on the CPU. For each output field, we perform an element-wise comparison using the NumPy function
allclose (https://numpy.org/devdocs/reference/generated/numpy.allclose.html , last access: 23 September 2024). Specifically, the GT4Py rewrites validate on both CPU and GPU, with an absolute and relative tolerance of and when employing double precision. When reducing the precision to 32 bits, the absolute and relative tolerance levels need to be increased to and on CPU and and on GPU. In the latter case, we observe that the field representing the enthalpy flux of ice still does not pass validation. We attribute the larger deviation from the baseline data on the device to the different instruction sets underneath CPUs and GPUs. -
All implementations of CLOUDSC2TL and CLOUDSC2AD are validated using the Taylor test (see Algorithm ) and the symmetry test (see Algorithm ), respectively. In this respect, we emphasize that the GT4Py implementations satisfy the conditions of both tests on all considered computing architectures, regardless of whether double or single precision is employed.
The source repositories for CLOUDSC and CLOUDSC2 dwarfs may include multiple variants of each reference implementation, varying for the optimization strategies. In our analysis, we always take into account the fastest variant of each alternative implementation; for the sake of reproducibility, Table contains the strings identifying the coding versions at consideration and the corresponding NPROMA
NPROMA blocking is a cache optimization technique adopted in all Fortran codes considered in this paper. Given a two-dimensional array shaped , this is re-arranged as a three-dimensional array shaped . Commonly, the leading dimension of the three-dimensional array is called “NPROMA”, with being the “NPROMA blocking factor”. Here, we indicate simply as “NPROMA” for the sake of brevity. For further discussion of the NPROMA blocking, we refer the reader to and .
employed in the runs. Similarly, for all Python implementations we consider only the most performant backends of GT4Py: the GridTools C++ CPU backend with k-first memory layout, and the DaCe GPU backend. Table 3For each reference implementation of the CLOUDSC and CLOUDSC2 dwarfs, the table reports the string identifying the specific variant considered in the performance analysis on Piz Daint, MeluXina, and LUMI. The corresponding NPROMA is provided within parentheses. Those implementations which are either not available or not working are marked with a dash.
| System | Implementation | CLOUDSC | CLOUDSC2: non-linear | CLOUDSC2: symmetry test |
|---|---|---|---|---|
| Piz Daint | Fortran: OpenMP (CPU) | |||
| Fortran: OpenACC (GPU) | – | – | ||
| Fortran: Loki (GPU) | – | |||
| C: CUDA (GPU) | – | – | ||
| MeluXina | Fortran: OpenMP (CPU) | – | ||
| Fortran: OpenACC (GPU) | – | – | ||
| Fortran: Loki (GPU) | – | |||
| C: CUDA (GPU) | – | – | ||
| LUMI | Fortran: OpenMP (CPU) | |||
| Fortran: OpenACC (GPU) | – | – | ||
| Fortran: Loki (GPU) | – | |||
| C: HIP (GPU) | – | – |
Figure 3
Execution time on a single NUMA domain of a hybrid node of the Piz Daint supercomputer for CLOUDSC (a, d), CLOUDSC2NL (b, e), and the symmetry test for CLOUDSC2TL and CLOUDSC2AD (c, f) using either double-precision (a, b, c) or single-precision (d, e, f) floating point arithmetic. The computational domain consists of 65 536 columns and 137 vertical levels. Displayed are the multi-threaded Fortran baseline using OpenMP (grey); two GPU-accelerated Fortran implementations, using either OpenACC directives (lime) or the source-to-source translation tool Loki (yellow); an optimized CUDA C version (green); and the GT4Py rewrite, using either the GridTools C++ CPU backend with k-first data ordering (blue) or the DaCe GPU backend (orange). All numbers should be interpreted as an average over realizations. The panels only show the code versions available and validating at the time of writing.
[Figure omitted. See PDF]
For the interpretation of the CPU-versus-GPU performance numbers, we note that host codes are executed on all the cores available on a single non-uniform memory access (NUMA) domain of a compute node, while device codes are launched on the GPU attached to that NUMA domain. In a distributed-memory context, this choice allows the same number of MPI ranks to be fit per node, on either CPU or GPU. Table reports the number of NUMA partitions per node for Piz Daint, MeluXina, and LUMI, with the compute and memory resources being evenly distributed across the NUMA domains. Note that the compute nodes of the GPU partition of LUMI have the low-noise mode activated, which reserves one core per NUMA domain to the operating system; therefore only seven out of eight cores are available to the jobs. Moreover, we highlight that each MI250X GPU consists of two graphics compute dies (GCDs) connected via four AMD Infinity Fabric links but not sharing physical memory. From a software perspective, each compute node of LUMI is equipped with eight virtual GPUs (vGPUs), with each vGPU corresponding to a single GCD and assigned to a different NUMA domain.
Figure 4
As Fig. but for the MeluXina supercomputer.
[Figure omitted. See PDF]
Figures – visualize the execution times for CLOUDSC (left column), CLOUDSC2NL (center column), and the symmetry test for CLOUDSC2TL and CLOUDSC2AD (right column) for Piz Daint, MeluXina, and LUMI, respectively.
When measuring the performance of the symmetry test, the validation procedure – corresponding to lines 11–23 of Algorithm – is switched off.
All performance numbers refer to a grid size of 65 536 columns, with each column featuring 137 vertical levels. In each figure, execution times are provided for simulations running either entirely in double precision (corresponding to the 64-bit IEEE format and denoted as FP64; top row) or in single precision (corresponding to the 32-bit IEEE format and denoted as FP32; bottom row). Within each panel, the plotted bars reflect the execution time of the various codes, with a missing bar indicating the corresponding code (non-GT4Py) is either not available or not working properly, specifically as follows:-
The Fortran version of CLOUDSC2AD can only run on a single OpenMP thread on MeluXina (the issue is still under investigation).
-
A native GPU-enabled version of CLOUDSC using 32-bit floating point arithmetic does not exist at the time of writing, and no CUDA/HIP implementations are available for CLOUDSC2.
-
All Fortran-based implementations of the three formulations of CLOUDSC2 can only use double-precision computations.
-
A Loki version of CLOUDSC2TL and CLOUDSC2AD is not available at the time of writing.
Notably, we find the GT4Py rewrite of both CLOUDSC and CLOUDSC2 to be very robust, as the codes execute on every CPU and GPU architecture included in the study and can always employ either double- or single-precision floating point arithmetic. With GT4Py, changing the backend with the respective target architecture, or changing the precision of computations, is as easy as setting a namelist parameter. Moreover, at the time of writing, the GT4Py implementations of the more complex tangent-linear and adjoint formulations of CLOUDSC2 were the first codes enabling GPU execution, again both in double or single precision.
Figure 5
As Fig. but for the LUMI supercomputer.
[Figure omitted. See PDF]
The performance of the high-level Python with GT4Py compares well against Fortran with OpenACC. The runtimes for GT4Py with its DaCe backend versus OpenACC are similar on Piz Daint, MeluXina, and LUMI. One outlier is the double-precision result on LUMI, for which the OpenACC code appears relatively slow. We suppose this behavior is associated with the insufficient OpenACC support for the HPE Cray compiler. Only the HPE Cray compiler implements GPU offloading capabilities for OpenACC directives on AMD GPUs, meaning that Fortran OpenACC codes require an HPE Cray platform to run on AMD GPUs. In contrast, GT4Py relies on the HIPCC compiler driver developed by AMD to compile device code for AMD accelerators, and this guarantees a proper functioning irrespective of the machine vendor. We further note that the DaCe backend of GT4Py executes roughly 2 times faster on MeluXina's NVIDIA A100 GPUs than on LUMI's AMD Instinct MI250X GPUs. As mentioned above, from a software perspective, each physical GPU module on LUMI is considered two virtual GPUs; therefore the code is actually executed on half of a physical GPU card. Hence we can speculate that using both dies of an AMD Instinct MI250X GPU, performance would be on a par with the NVIDIA A100 GPU.
Another interesting result is that both CLOUDSC-GT4Py and CLOUDSC2-GT4Py are consistently faster than the implementations generated with Loki. Loki allows bespoke transformation recipes to be built to apply changes to programming models and coding styles in an automated fashion. Therefore, GPU-enabled code can be produced starting from the original Fortran by, e.g., automatically adding OpenACC directives. However, because not all optimizations are encoded yet in the transformations, the Loki-generated device code cannot achieve optimal performance. Notwithstanding, source-to-source translators such as Loki are of high relevance for enabling GPU execution with large legacy Fortran code bases.
As used in this paper, GT4Py cannot yet attain the performance achieved by manually optimized native implementations with either Fortran on CPU or CUDA/HIP on GPU. Multi-threaded Fortran can be up to 3 times faster than the GridTools CPU backend of GT4Py using the k-first (C-like) memory layout, while the DaCe GPU backend of GT4Py can be up to a factor of 2 slower than CUDA/HIP. On the one hand, so far the development of GT4Py has been focused on GPU execution (see, e.g., ) because this will be the dominant hardware for time-critical applications in the years to come. On the other hand, we stress that the k-caching CUDA and HIP variants of CLOUDSC were semi-automatically generated by performance engineering experts, starting from an automatic Fortran-to-C transpilation of the SCC variants and manually applying additional optimizations that require knowledge about the specific compute patterns in the application. This process is not scalable to the full weather model and not a sustainable code adaptation method. In contrast, no significant performance engineering (by, e.g., loop fusing and reducing the number of temporary fields) has been applied yet with CLOUDSC-GT4Py and CLOUDSC2-GT4Py.
Figure 6
For the GT4Py rewrites of CLOUDSC (a, d), CLOUDSC2NL (b, e), and the symmetry test for CLOUDSC2TL and CLOUDSC2AD (c, f), fraction of the total execution time spent within the stencil computations (full bars) and the Python side of the application (hatched bars) on Piz Daint, MeluXina, and LUMI. Results are shown for the GridTools C++ CPU backend with k-first data ordering (blue) and the DaCe GPU backend (orange), using either double-precision (a, b, c) or single-precision (d, e, f) floating point arithmetic.
[Figure omitted. See PDF]
To rule out the possibility that the performance gap between the Python DSL and lower-level codes is associated with overhead originating from Python, Fig. displays the fraction of runtime spent within the stencil code generated by GT4Py and the high-level Python code of the application (infrastructure and framework code; see Sect. ). Across the three supercomputers, the Python overhead decreases as (i) the complexity and length of computations increase, (ii) the peak throughput and bandwidth delivered by the hardware underneath decrease, and (iii) the floating point precision increases. On average, the Python overhead accounts for % of the total runtime on GPU and % on CPU; the latter corresponds to % relative to the Fortran execution time.
Finally, we observe a significant sensitivity of the GPU performance with respect to the thread block size
In the Fortran code, the thread block size corresponds to the NPROMA.
: for values smaller than 128, performance is degraded across all implementations, with the gap between CUDA/HIP and GT4Py+DaCe being smaller. This shows that some tuning and toolchain optimizations can be performed to improve performance with the DSL approach. 6 ConclusionsThe CLOUDSC and CLOUDSC2 cloud microphysics schemes of the IFS at ECMWF have served as demonstrators to show the benefits of a high-level domain-specific implementation approach to physical parametrizations. We presented two Python implementations based on the GT4Py framework, in which the scientific code is insulated from hardware-specific aspects. The resulting application provides a productive user interface with enhanced readability and maintainability and can run efficiently and in a very robust manner across a wide spectrum of compute architectures/systems. The approach can be powerful in the light of the increasingly complex HPC technology landscape, where general-purpose CPUs are increasingly complemented by domain-specific architectures (DSAs) such as GPUs, tensor processor units (TPUs), field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). In addition to the CLOUDSC scheme used in the IFS forecast model, we have presented results with the GT4Py rewrites of the non-linear, tangent-linear, and adjoint formulations of CLOUDSC2 used in data assimilation. In both CLOUDSC-GT4Py and CLOUDSC2-GT4Py, the stencil kernels are encapsulated within model components sharing a minimal and clear interface. By avoiding any assumption on the host model, the interface aims to provide interoperable and plug-and-play physical packages, which can be transferred more easily between different modeling systems with virtually no performance penalty.
We carried out a comprehensive study to assess the portability of the Python codes across three major supercomputers, differing in terms of the vendor, the node architecture, the software stack, and the compiler suites. We showed that the GPU performance of GT4Py codes is competitive against optimized Fortran OpenACC implementations and performs particularly well when compared to the available codes generated with the Loki source-to-source translation tool. Low-level implementations written in either Fortran for CPUs or CUDA/HIP for GPUs, with additional optimizations that possibly require knowledge about the specific compute patterns, can provide better performance but are extremely challenging to create and maintain for entire models. The CPU performance of GT4Py is currently suboptimal, but this is an expected result given the focus on GPUs in the DSL development so far, and we clearly expect this to improve significantly with upcoming and future GT4Py versions. Indeed, since the DSL can accommodate any specific low-level optimization, attaining the same performance as native, expert-written Fortran and CUDA/HIP models is feasible and will be the target of our future efforts.
The presented results, based on a representative physical parametrization and considering tangent-linear and adjoint versions, add to the notion that weather and climate model codes can execute significantly faster on GPUs , and the number of HPC systems with accelerators is steadily increasing
In the 63rd edition of the TOP500 list published in June 2024, 190 out of the 500 most powerful supercomputers in the world use graphics accelerator technology (
The current study supports our ongoing efforts and plans to port other physical parametrizations to Python with GT4Py. However, we note that GT4Py was originally devised to express the computational motifs of dynamical cores based on grid point methods, so not all patterns found in the parametrizations are natively supported by the DSL. In this respect, current limitations of the DSL exist for higher-dimensional fields (e.g., arrays storing a tensor at each grid point), but again we expect these to be fully supported for HPC with future extensions of GT4Py. In addition, Python offers a variety of alternative libraries that can be employed in conjunction with GT4Py to generate fast machine code.
Appendix A Algorithmic description of the Taylor test and symmetry test for CLOUDSC2
In Sect. , we briefly described the aim and functioning of the Taylor test and the symmetry test for CLOUDSC2. Here, we detail the logical steps performed by the two tests with the help of pseudo-codes encapsulated in Algorithms and .
Algorithm A1The Taylor test assessing the formal correctness of the coding implementation of the tangent-linear formulation of CLOUDSC2, denoted as
[Figure omitted. See PDF]
Algorithm A2The symmetry test assessing the formal correctness of the coding implementation of the adjoint formulation of CLOUDSC2, denoted as
[Figure omitted. See PDF]
Code and data availability
The source codes for
Author contributions
SU ported the CLOUDSC and CLOUDSC2 dwarfs to Python using GT4Py and ran all the numerical experiments presented in the paper, under the supervision of CK and HW. SU further contributed to the development of the infrastructure code illustrated in Sect. , under the supervision of CS, LS, and TCS. MS made relevant contributions to the Fortran and C reference implementations of the ECMWF microphysics schemes. SU and CK wrote the paper, with feedback from all co-authors.
Competing interests
The contact author has declared that none of the authors has any competing interests.
Disclaimer
Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.
Acknowledgements
We would like to thank three anonymous referees for carefully reviewing the manuscript and providing many constructive comments. We acknowledge EuroHPC JU for awarding the project ID 200177 access to the MeluXina supercomputer at LuxConnect and the project ID 465000527 access to the LUMI system at CSC and thank Thomas Geenen and Nils Wedi from Destination Earth for their help. We are grateful to Michael Lange and Balthasar Reuter for discussions and support regarding IFS codes.
Financial support
This study was conducted as part of the Platform for Advanced Scientific Computing (PASC)-funded project KILOS (“Kilometer-scale non-hydrostatic global weather forecasting with IFS-FVM”), which also provided us with computing resources on the Piz Daint supercomputer at CSCS. Christian Kühnlein was supported by the ESiWACE3 project funded by the European High Performance Computing Joint Undertaking (EuroHPC JU) and the European Union (EU) under grant agreement no. 101093054.
Review statement
This paper was edited by Peter Caldwell and reviewed by three anonymous referees.
© 2025. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.