Introduction
After the Indian Ocean tsunami of 26 December 2004, emphasised that one of the greatest contributions of science to society is to serve it purposefully, as when providing forecasts to allow communities to respond before a disaster strikes. In the last 12 years, the numerical modelling of tsunamis has experienced great progress – see . There is a variety of mathematical models, such as shallow-water equations (see , , , , , and ), the Boussinesq equations (see and ), or the Navier–Stokes equations (see and ) and a large number of implementations, primarily for individual target computer architectures. The use cases of such models are wide ranging, and most rely on high numerical accuracy as well as high computational performance to deliver results – examples include sensitivity analysis by , probabilistic tsunami hazard assessments by , , and , and more efficient and informed tsunami early warning by and .
For widespread use three key ingredients are needed; first, the stability and robustness of the numerical approach, which gives a confidence in the results produced; second, the computational performance of the code, which allows for obtaining the right results quickly, efficiently utilising the available computational resources; and third, the ability to integrate into a workflow, allowing for simple preprocessing and post-processing, efficiently supporting the kinds of use cases that come up – for example large numbers of different initial conditions.
In Sect. 2 we discuss a number of codes currently being used in production, which as such are trusted and reliable codes, as part of a workflow. Yet, the computational performance of most of these codes is “good enough”; they were written by domain scientists and may have been tuned to one architecture or another, but, for example, GPU support is almost non-existent. In today's and tomorrow's quickly changing hardware landscape, however, “future-proofing” numerical codes is of exceptional importance for continued scientific delivery. Domain scientists can not be expected to keep up with architectural advances and spend a significant amount of time re-factoring code to new hardware. What to compute must be separated from how it is computed – indeed in a recent paper by , leaders in the weather community chart the ways forward and point to domain-specific languages (DSLs) as a potential way to address this issue.
OP2, by , is such a DSL, embedded in C/C++ and Fortran; it has been in development since 2009. It provides an abstraction for expressing unstructured mesh computations at a high level and then provides automated tools to translate scientific code written once into a range of high-performance implementations targeting multicore central processing units (CPUs), graphics processing units (GPUs), and large heterogeneous supercomputers. The original VOLNA model () was already discussed and validated in detail – it was used in production for small-scale experiments and modelling but was inadequate for targeting large-scale scenarios and statistical analysis; therefore it was re-implemented on top of OP2; this paper describes the process, challenges, and results from that work.
As VOLNA-OP2 delivered a qualitative leap in terms of possible uses due to the high performance it can deliver on a variety of hardware architectures, its users have started integrating it into a wide variety of workflows; one of the key uses is for uncertainty quantification: for the stochastic inversion problem of the 2004 Sumatra tsunami in , for developing Gaussian process emulators that help reduce the number of simulation runs in and , applications of stochastic emulators to a submarine slide at the Rockall Bank in , a study of run-up behind islands in , the durability of oscillating wave surge converters when hit by tsunamis in , tsunamis in the St. Lawrence Estuary in , a study of the generation and inundation phases of tsunamis in , and others.
The time dependency in the deformation enables the tsunami to be actively generated – see . This is a step forward from the common passive mode of tsunami genesis that utilises an instantaneous rupture. The active mode is particularly important for tsunamigenic earthquakes with long and slow ruptures, e.g. the 2004 Sumatra–Andaman event described in and , and submerged landslides in , e.g. the Rockall Bank event in .
These applications present a number of challenges in integration into the workflow, as well as scalable performance: the need for extracting snapshots of state variables on the full mesh, or at a number of specified locations, and capturing the maximum wave elevation or inundation – all in the context of distributed memory execution.
As the above references indicate, VOLNA-OP2 has already been key in
delivering scientific results in a range of scenarios, and through the
collaboration of the authors, it is now capable of efficiently supporting a
number of use cases, making it a versatile tool to the community; therefore
we have now publicly released it: it is freely available at
The rest of the paper is organised as follows: Sect. discusses related work; Sect. presents the OP2 library, upon which VOLNA-OP2 is built; Sect. discusses the VOLNA simulator itself, its structure, and features; Sect. discusses performance and scalability results on CPUs and GPUs; and finally Sect. draws conclusions.
Related work
Tsunamis have long been a key target for scientific simulations. give a detailed look at various mathematical, numerical, and implementation approaches to past and current tsunami simulations. The most common set of equations solved are the shallow-water equations, and most codes use structured and nested meshes. A popular discretisation is finite differences, such codes include NOAA's MOST , COMCOT , and CENALT . On more flexible meshes many codes, such as SELFE , TsunAWI , ASCETE , and Firedrake-Fluids , use the finite-element discretisation or the finite-volume discretisation in the cases of the VOLNA code , GeoClaw , or HySEA . Another model is described by the Boussinesq equations – these equations and the solver are more complex than shallow-water solvers. Since they are primarily needed only for dispersion (see ), they are used less commonly; examples include FUNWAVE and COULWAVE . Finally, the 3-D Navier–Stokes equations provide the most complete description, but they are significantly more complex than other models – examples include SAGE and the work of .
Most of these codes described above work on CPUs, and while there has been some work on GPU implementations by , , , and , who use structured meshes and finite differences or finite volumes, it is unclear whether these are used in production, and they are not open source. Celeris is a Boussinesq solver that uses finite volumes and a structured mesh – it is hand-coded for GPUs using graphics shaders, and its source code is available; however it can only use a single GPU.
As far as we are aware, only Tsunami-HySEA (), which also uses finite volumes, uses GPU clusters in production – that code, however, only supports GPUs and is hand-written in CUDA. Performance reported by on a 10 million point test case shows a strong scaling efficiency going from 1 to 12 GPUs between 88 % and 73 % (overall 12 GPUs are 5.88 faster than 1 GPU) and a speed-up with 1 GPU over an unspecified (likely single core) CPU implementation. Direct comparison to VOLNA-OP2 is not possible since Tsunami-HySEA uses (nested) structured meshes, and the multi-GPU version is not open source.
Building system with OP2.
[Figure omitted. See PDF]
The OP2 domain-specific language
The OP2 library is a DSL embedded in C and Fortran that allows unstructured mesh algorithms to be expressed at a high level and provides automatic parallelisation and a number of other features. It provides an abstraction that lets the domain scientist describe a mesh using a number of sets (such as quadrilaterals or vertices), connections among these sets (such as edges to nodes), and data defined on sets (such as and coordinates on vertices). Once the mesh is defined, an algorithm can be implemented as a sequence of parallel loops, each over all elements of a given set applying different “kernel functions”, accessing data either directly on the iteration set or indirectly through, at most, one level of indirection. This abstraction enables the implementation of a wide range of algorithms, such as the finite-volume algorithms that VOLNA uses, but it does require that for any given parallel loop, the order of execution must not affect the end result (within machine precision) – this precludes the implementation of Gauss–Seidel iterations, for example.
OP2 enables its users to write an application only once using its API, which is then automatically parallelised to utilise multicore CPUs, GPUs, and large supercomputers through the use of MPI, OpenMP, and CUDA. This is carried out in part through a code generator that parses the parallel loop expressions and generates code around the computational kernel to facilitate parallelism and data movement, and in part through different back-end libraries that manage data, including MPI halo exchanges, or GPU memory management, as shown in Fig. . For more details, see and .
Parallelisation approaches in OP2
OP2 takes full responsibility for orchestrating parallelism and data movement – from the user perspective, the code written looks and feels like sequential C code that makes calls to an external library. To utilise clusters and supercomputers, OP2 uses the message passing interface (MPI) to parallelise in a distributed memory environment; once the mesh is defined by the user, OP2 automatically partitions and distributes it among the available resources. It uses the standard owner-compute approach with halo exchanges and overlaps computations with communications. In conjunction with MPI, OP2 uses a number of shared-memory parallelisation approaches, such as CUDA and OpenMP.
A key challenge in the finely grained parallelisation of unstructured mesh algorithms is the avoidance of race conditions when data are indirectly modified. For example, in a parallel loop over edges, when indirectly incrementing data on vertices, multiple edges may try to increment the same vertex, leading to race conditions. OP2 uses a colouring approach to resolve this; elements of the iteration set are grouped into mini-partitions, and each element within these mini-partitions is coloured, so no two elements of the same colour access the same value indirectly. Subsequently, mini-partitions are coloured as well. For CUDA, we assign mini-partitions of the same colour to different CUDA thread blocks, and for OpenMP to different threads. There is then a global synchronisation among different mini-partition colours. In the case of CUDA, threads processing elements within each thread block use the first level of colouring to apply increments in a safe way, with block-level synchronisation in between. Code generation that is suitable for auto-vectorisation by the compilers is also supported; it carries out the packing and unpacking of vector registers. The results obtained on different architectures may only differ due to differences in compiler optimisations (particularly at aggressive levels) and the different order in which partial results are accumulated. Previous work describes further details, accuracy, and performance comparisons of various architectures; these are available in and .
Input and output
OP2 supports parallel file I/O through the HDF5 library (), which is critically important to its integration into VOLNA's workflow: reading in the input problem and writing out data required for analysis simultaneously on multiple processes.
The VOLNA simulator
Model, numerics, and previous validation
The finite-volume framework is the most natural numerical method to solve the non-linear shallow-water equations (NSWEs), in part because of their ability to treat shocks and breaking waves. It belongs to a class of discretisation schemes that are highly efficient in the numerical solution of systems of conservation laws, which are common in compressible and incompressible fluid dynamics. Finite-volume methods are preferred over finite differences and often over finite elements because they intrinsically address conservation issues, improving their robustness: total energy, momentum, and mass quantities are conserved exactly, assuming no source terms and appropriate boundary conditions. The code was validated against the classical benchmarks in the tsunami community as described below.
Numerical model
Following the needs of the target applications, the following non-dispersive NSWEs (in Cartesian coordinates) form the physical model of VOLNA: Here, is the time-dependent bathymetry, is the horizontal component of the depth-averaged velocity, is the acceleration due to gravity, and is the total water depth. Further, is the identity matrix of order 2. The tsunami wave height or elevation of free surface is computed as where the sum of static bathymetry and the dynamic seabed uplift constitute the dynamic bathymetry, is usually sourced from bathymetry datasets pertaining to the region of interest (for example global datasets like ETOPO1/GEBCO or regional bathymetries). The vertical component of the seabed deformation is calculated depending on the physics of tsunami generation, e.g. via co-seismic displacement for finite fault segmentations by , submarine sliding by , etc.
In addition to the capabilities of employing active generation and consequent tsunami propagation, VOLNA also models the run-up–run-down (i.e. the final inundation stage of the tsunami). These three functionalities qualify VOLNA to simulate the entire tsunami life cycle. The ability of NSWEs ()–() to model both propagation and run-up and run-down processes was validated in and , respectively. Thus, the use of a uniform model for the entire life cycle obviates many technical issues such as the coupling between the seabed deformation and the sea surface deformation and the use of nested grids.
VOLNA uses the cell-centred approach for control volume tessellation, meaning that degrees of freedom are associated with cell barycentres. However, in order to improve the spatial accuracy, a second-order extension is employed. A local gradient of the physical variables over each cell is calculated; then a limited linear projection of the variables at the cell interfaces is used within the numerical flux solver. The limiter used is a restrictive version of the scheme purposed by ; the minimum calculated limiter of the physical variables within a cell is used in the reconstruction. This limiter ensures that numerical oscillations are constrained in realistic cases. A Harten–Lax–van Leer (HLLC) numerical flux which incorporates the contact discontinuity is used to ensure that the standard conservation and consistency properties are satisfied: the fluxes from adjacent triangles that share an edge exactly cancel when summed and the numerical flux with identical state arguments reduces to the true flux of the same state. Details of the numerical implementation can be found in .
Validation
The original version of VOLNA was thoroughly validated against the National Tsunami Hazard Mitigation Program (NTHMP) benchmark problems . A brief look at how the new implementation, which utilises the more restrictive limiter, performs with regards to two benchmark problems is given below. The reader is referred to the original paper or the NTHMP website for further details on the set-up of the benchmark problems.
Benchmark problem 1 – solitary wave on a simple beach
The analytical solution to the run-up of a solitary wave on a sloping beach was derived by . Thus, in this benchmark problem one compares the simulated results with the derived analytical solution.
Solitary wave on a simple beach – comparison between the simulated run-up and analytical solution at the shoreline (time , 45, 55, 65 ). Solid line – VOLNA; dashed line – analytical solution; thick line – beach.
[Figure omitted. See PDF]
Solitary wave on a simple beach – comparison between VOLNA and solution at different locations: (a) : notice that the location becomes “dry” for ; (b) .
[Figure omitted. See PDF]
Set-up
The beach bathymetry comprises a constant depth () followed by a sloping plane beach of angle . The initial water level is defined as a solitary wave of height centred at a distance from the toe of the beach and the initial wave-particle velocity is proportional to the initial water level: Here , , , and . For this benchmark problem the following ratio must also hold: .
Tasks
In order to verify the model, the wave run-up at various time steps (Fig. ) and the wave height at two locations ( and ) (Fig. ) are compared to the analytical solution. The test was run on a node of CSD3 Wilkes2 utilising a P100 GPU.
It can be seen from the plots above that the agreement between numerical results and the analytical solutions is very good. Therefore, the new implementation of the model is able to accurately simulate the run-up of the solitary wave.
(a) The incoming water level incident on the m boundary and comparison between VOLNA and laboratory results at different locations for benchmark problem 2: (b) , ; (c) , ; (d) , .
[Figure omitted. See PDF]
Benchmark problem 2 – wave run-up onto a complex 3-D beach
This benchmark problem involves the comparison of laboratory results for a tsunami run-up onto a complex 3-D beach with simulated results. The laboratory experiment reproduces the 1993 Hokkaido–Nansei–Oki tsunami, which struck the island of Okushiri, Japan. The experiment is a 1 : 400 scale model of the bathymetry and topography around a narrow gully and the tsunami is an incident wave fed in as a boundary condition.
Set-up
The computational and laboratory domain corresponds to a 5.49 m by 3.40 m wave tank and the bathymetry for the domain is given for 0.014 m by 0.014 m grid cells. The incoming wave is incident on the m boundary and is defined for the first 22.5 s (Fig. a), after which it is recommended that a non-reflective boundary condition be set. At , , and m fully reflective boundaries are to be defined.
Tasks
The validation in the model involves comparing the temporal variation of the moving shoreline, the water height at fixed gauges, and the maximum run-up. For the basis of this brief validation, we compared the water height at three gauges installed in the tank, located at (4.521, 1.196), (4.521, 1.696), and (4.521, 2.196).
It can be seen from the gauge plots in Fig. b–d that the first elevation wave arrives between 15 and 25 s. The overall dynamics of this elevation wave is accurately captured by the model at all the gauges, particularly the arrival time and initial amplitude. Considering the results of the two benchmark tests and the full validation of the original VOLNA code, one can see that the new implementation, which implements a more restrictive limiter, still preforms satisfactorily and is consistent with the previous version. The benchmark was run on a 24-core Intel(R) Xeon(R) E5-2620 v2 CPU.
Code structure
The structure of the code is outlined in Algorithm 1; the user inputs a configuration file (.vln), which specifies the mesh to be read in from Gmsh files, as well as initial and boundary conditions of state variables, such as the bathymetry deformation starting the tsunami, which can be defined in various ways (mathematical expressions or files, or a mix of both). We use a variable time step third-order (four stage) Runge–Kutta method for evolving the solution in time. In each iteration, events may be triggered, e.g. further bathymetry deformations, displaying the current simulation time, or outputting simulation data to VTK files for visualisation.
Code structure of VOLNA
-
Initialise mesh from Gmsh file
-
Initialise state variables
-
while do
-
Perform pre-iteration events
-
Third-order Runge–Kutta time stepper
-
Determine local gradients of state variables on each
-
cell
-
Compute a local limiter on each cell
-
Reconstruct state variables, compute boundary
-
conditions and determine fluxes across cell faces
-
Compute time step
-
Apply fluxes and bathymetric source terms to state
-
variables on cells
-
Perform post-iteration events
-
-
end while
The original VOLNA source code was implemented in C++, utilising libraries such as Boost . This gives a very clear structure, abstracting data management, event handling, and low-level array operations for the higher-level algorithm – an example is shown in Fig. . While this coding style was good for readability, it had its limitations in terms of performance: there was an excessive amount of data movement and certain operations could not be parallelised – indirect increments with potential race conditions in particular. Some features – such as describing the bathymetry lift with a mathematical formula – were implemented with functionality and simplicity, not performance, in mind.
Code snippets from the original and OP2 versions.
[Figure omitted. See PDF]
To better support performance and scalability, and thus allow for large-scale simulations, we have re-engineered the VOLNA code to use OP2 – the overall code structure is kept similar, but matters of data management and parallelism are now entrusted to OP2. To support parallel execution we separated the preprocessing step from the main body of the simulation: first the mesh and simulation parameters are parsed into a HDF5 data file, which can then be read in parallel by the main simulation, which also uses HDF5's parallel file I/O to write results to disk.
Performance-critical parts of the code, essentially any operations on the computational mesh, are re-implemented using OP2: they are written with an element-centric approach and grouped for maximal data reuse. Calculations that were previously a sequence of operations, each calculating all partial results for the entire mesh, now apply only to single elements (such as cells or edges), and OP2 automatically applies these computations to each element – this avoids the use of several temporaries and improves computational density. This process involves outlining the computational kernel to be applied at each set element (cell or edge) to a separate function and writing a call to the OP2 library – a matching code snippet is shown in Fig. .
The workflow of VOLNA is made of a few sources of information being created and given as inputs to the code. The first is the merged bathymetry and topography over the whole computational domain, i.e. the sea floor and land elevations, over which the flow will propagate. This is given through an unstructured triangular mesh. This is then transformed into a usable input to VOLNA via the volna2hdf5 code to generate compact HDF5 files. The mesh is also renumbered with the Gibbs–Poole–Stockmeyer algorithm to improve locality.
The second is the dynamic source of the tsunami. It can be an earthquake or a landslide. To describe the temporal evolution of seabed deformation, either a function or a series of files can be used. When a series of files is used (typically when another numerical model provides the spatio-temporal information of a complex deformation), there is a need to define the frequency of these updates in the so-called vln generic input file to VOLNA. A recent improvement has been the ability to define these series of files for a sub-region of the computational domain, and at possibly lower resolution. Performance is better when using a function for the seabed deformation since I/O requirements for files can generate large overheads – VOLNA-OP2 allows for describing the initial bathymetry with an input file and then specifying relative deformations using arbitrary code that is a function of spatial coordinates and time. Similarly, one can also define initial conditions for wave elevation and velocity.
The generic input file of VOLNA includes information about the frequency of the updates in the seabed deformation, the virtual gauges in which time series of outputs will be produced, and possibly some options to output time series of outputs over the whole computational domain in order to create movies for instance. These I/O requirements obviously affect performance: the more data to output and the slower the file system, the larger the effect.
To simulate tsunami hazard for a large number of scenarios is computationally expensive, so VOLNA has been replaced in past studies by a statistical emulator, i.e. a cheap surrogate model of the simulator. To build the emulator, input parameters are varied in a design of experiments, and the runs are submitted with these inputs to collect input–output relationships. The output of interest could for example be the waveforms, free surface elevation, and velocity, among others. The increase in flexibility in the definition of the region over which the earthquake source of the tsunami is defined reduces the size of the series of files used as inputs: this is really helpful when a set of simulations needs to be run. Similarly, the ability to specify the relative deformation using an arbitrary code that is a function of spatial coordinates and time also reduces the computational and memory overheads when running a set of simulations.
Results
Running VOLNA
A key goal of this paper is to demonstrate that by utilising the OP2 library VOLNA delivers scalable high performance on a number of common systems. Therefore we take a test case simulating tsunami propagation in the Indian Ocean and run it on three different machines: NVIDIA P100 graphical processing units (the Wilkes2 machine in Cambridge's CSD3), a classical CPU architecture in the Peta-5–Skylake part of CSD3 (specifically dual-socket Intel Xeon Gold 6142 16-core Skylake CPUs), and Intel's Xeon Phi platform in Peta-5–KNL (64-core Knights Landing-generation chips, configured in cache mode).
There are five key computational stages that make up 90 % of the total runtime: a stage evolving time using the third-order Runge–Kutta scheme (RK), a “gradients” stage computing gradients among cells, a stage that computes the fluxes across the edges of the mesh (“fluxes”), a stage that computes the minimum time step (“dT”), and a stage that applies the fluxes to the cell-centred state variables (“applyFluxes”). Each of these stages consist of multiple steps, but for performance analysis we study them in groups.
(a) Bathymetry from GEBCO's geodetic grid is mapped onto a Cartesian grid for use in VOLNA. (b) Uplift caused by a uniform slip of 30 m in the four-segment finite fault model (given in Table ).
[Figure omitted. See PDF]
The RK stage is computationally fairly simple (no indirect accesses are made and cell-centred state variables are updated using other cell-centred state variables) and therefore parallelism is easy to exploit, and the limiting factor to performance will be the speed at which we can move data: achieved bandwidth. Both the gradients and the fluxes stages are computationally complex and involve accessing large numbers of data indirectly through cell-to-cell and edge-to-cell mappings. The dT stage moves significant numbers of data to compute the appropriate time step for each cell, triggering an MPI halo exchange as well, and then carries out a global reduction to calculate the minimum – particularly over MPI this can be an expensive operation, but overall it is limited by bandwidth. The applyFluxes stage, while computationally simple, is complex due to its indirect increment access patterns; per-edge values have to be added onto cell-centred values, and in parallelising this operation, OP2 needs to make sure to avoid race conditions. The performance of this loop is limited by the irregular access and control throughout the hardware. For an in-depth study of individual computational loops and their performance, we refer the reader to our previous work in .
Details of the non-uniform (NU) triangular meshes.
Mesh | Name | Vertices | Edges | Triangles | Source | Mesh size at coast |
---|---|---|---|---|---|---|
NU | 53.7M | 26 863 692 | 80 564 925 | 53 701 234 | 12.5 km | 125 m |
NU | 13.8M | 6 931 758 | 20 771 822 | 13 840 065 | 25 km | 250 m |
NU | 3.6M | 1 812 073 | 5 414 155 | 3 602 083 | 50 km | 500 m |
NU | 0.95M | 485 453 | 1 435 017 | 949 565 | 100 km | 1000 m |
Finite fault parameters of the four-segment tsunamigenic earthquake source.
Segment | Length () | Down-dip width () | Longitude | Latitude | Depth | Strike | Dip | Rake |
---|---|---|---|---|---|---|---|---|
(km) | (km) | () | () | (km) | () | () | () | |
1 | 220 | 150 | 65.23 | 24.50 | 10 | 263 | 6 | 90 |
2 | 188 | 150 | 63.08 | 24.23 | 10 | 263 | 7 | 90 |
3 | 199 | 150 | 61.25 | 24.00 | 5 | 281 | 8 | 90 |
4 | 209 | 150 | 59.32 | 24.32 | 5 | 286 | 9 | 90 |
Non-uniform meshes corresponding to the test cases (see Table ).
[Figure omitted. See PDF]
Tsunami waveforms at virtual gauges located at Gwadar and Karachi.
[Figure omitted. See PDF]
Tsunami demonstration case
For performance and scaling analysis, we employ the Makran subduction zone as
the tsunamigenic source for the numerical simulations. Our region of interest
extends from 55 to 79 E and from to 30 N. The
bathymetry (Fig. a) is obtained from GEBCO
(
Performance and scaling on classical CPUs
As the most commonly used architecture, we first evaluate performance on classical CPUs in the Cambridge CSD3 supercomputer: dual-socket Xeon Gold 6142 CPU, with 16 cores each, supporting the AVX512 instruction set. We test a plain MPI configuration (32 processes per node), as well as a hybrid MPI+OpenMP configuration, with two MPI processes per node (one per socket), and 16 OpenMP threads each, with process and thread binding enabled.
Performance scaling on (a) Peta-5–CPU (Intel Xeon CPU) and (b) Peta-5–KNL (Intel Xeon Phi) at different mesh sizes with pure MPI (solid) and MPI+OpenMP (dashed).
[Figure omitted. See PDF]
We use OP2's vectorised code generation capabilities, as described in . The RK stage performs the same in both variants; however the fluxes and dT stages saw significant performance gains – the compiler did not automatically vectorise computations; it had to be forced to do so. The applyFluxes stage could not be vectorised due to a compiler issue.
On a single node with pure MPI, running the largest mesh, 9 % of time was spent in the RK stage, achieving 182 GB s throughput on average; 40 % of time was spent in the gradients stage, achieving 108 GB s; 25 % of time was spent in the fluxes stage, achieving 142 GB s; 12 % of time was spent in the dT phase, achieving 65 GB s; and 12 % of time was spent in the applyFluxes stage, achieving 221 GB s thanks to a high degree of data reuse. The maximum bandwidth on this platform is 189 GB s as measured by STREAM Triad. The time spent in MPI communications ranged from 23 % on the smallest mesh to 10 % on the largest mesh.
When scaling to multiple nodes with pure MPI, as shown in Fig. a, it is particularly evident on the smallest problem that the problem size per node needs to remain reasonable, otherwise MPI communications will dominate the runtime: for the NU mesh, at 32 nodes 251 s out of 308 s total (81 %). This can be characterised by the strong scaling efficiency: when doubling the number of computational resources (nodes), what percentage of the ideal speedup is achieved. For small node counts these values remain above a reasonable 85 %, but particularly for the smaller problems runtimes actually become worse. It is evident that on the Peta-5–Skylake cluster the interconnect used for MPI communications becomes a bottleneck for scaling – this overhead is significantly lower on Archer, for example; on the largest mesh at 32 nodes this overhead is only 32 %.
We have also evaluated execution with a hybrid MPI+OpenMP approach, as shown with the dashed lines in Fig. a. However, on this platform it failed to outperform the pure MPI configuration.
Performance and scaling on the Intel Xeon Phi
Second, we evaluate Intel's latest many-core chip, the Xeon Phi x7210, which integrates 64 cores, each equipped with AVX-512 vector processing units and supporting four threads and built with a 16 GB on-chip high-bandwidth memory, here used as a cache for off-chip DDR4 memory. The chips were configured in the “quad” mode, with all 16 GB as cache. We evaluate a pure MPI approach (128 processes) as well as using four MPI processes, one per quadrant, and 32 OpenMP threads each. Bandwidth achieved as measured by STREAM Triad is 448 GB s. Vectorisation on this platform is paramount for achieving high performance – every stage with the exception of applyFluxes was vectorised – the latter was not due to compiler issues.
On a single node with pure MPI, the straightforward computations of the RK stage can utilise the available high bandwidth very efficiently: only 8.3 % of time spent here, achieving 194 GB s. The gradients stage takes 42 % of the time, achieving 82 GB s; the fluxes stage takes 25 % of the time and achieves 104 GB s; dT takes 11.4 % and achieves 46 GB s; and the applyFluxes stage takes 11.6 % and achieves 165 GB s. On the largest mesh, the Zeon Phi system is 21 % slower than a single node of the classical CPU system.
Performance when scaling to multiple nodes with pure MPI is shown in Fig. b: it is quite clear that scaling is worse than on the classical CPU architecture for smaller problem sizes – the Xeon Phi requires a considerably larger problem size per node to operate efficiently. Strong scaling efficiency is particularly poor on the smallest mesh, but even on the largest mesh it is only between 63 % and 92 %. Similar to the classical CPU system, the interconnect becomes a bottleneck to scaling. Running with a hybrid MPI+OpenMP configuration on the Xeon Phi does improve scaling significantly, as shown in Fig. b – this is due to having to exchange much fewer (but larger) messages. Strong scaling efficiency on the largest problem remains above 82 %. At scale, at least on this cluster, the Xeon Phi can outperform the classical CPU system on a node-to-node basis of comparison.
Performance and scaling on P100 GPUs
Third, we evaluate performance on GPUs – an architecture that has continually been increasing its market share in high-performance computing thanks to its efficient parallel architecture. The P100 GPUs are hosted in the Wilkes2 system, with 4 GPUs per node connected via the PCI-e bus. Each chip contains 60 scalar multiprocessors, with 64 CUDA cores each, giving a total of 3840 cores. There is also 16 GB of high-bandwidth memory on package, with a bandwidth of 497 GB s. To utilise these devices, we use CUDA code generated by OP2 and compiled with CUDA 9. Similar to Intel's Xeon Phi, high vector efficiency is required for good performance on the GPU.
On a single GPU, running the second-largest mesh NU (because NU does not fit in memory), 8.3 % of runtime is spent in the RK stage, achieving 342 GB s; gradients takes 50 % of the time, achieving only 136 GB s due to its high complexity; fluxes takes 15 %, achieving 379 GB s thanks to a high degree of data reuse in indirect accesses; dT takes 4.4 % and achieves 382 GB s; and finally applyFluxes takes 20 % of the time, achieving 204 GB s. Indeed, this last phase has the most irregular memory access patterns, which is commonly known to degrade performance on GPUs. Nevertheless, even a single GPU outperforms a classical CPU node by a factor of 1.5, and the Xeon Phi by 1.85.
Performance when scaling to multiple GPUs is shown in Fig. ; similar to the Xeon Phi, GPUs are also sensitive to the problem size and the overhead of MPI communications. However, given that there are four GPUs in one node, the overhead of communications is significantly lower. For the smallest problem, efficiency drops from 78 % to 58 %, and for the largest problem efficiency drops from 95 % to 89 %. Thanks to much better scaling (due to lower MPI overhead), 32 GPUs are faster than 32 nodes of Xeon CPUs and Xeon Phi.
Running costs and power consumption
Ultimately, when one needs to decide what platform to run these simulations on, a key deciding factor aside from time to solution is cost to solution. In the analysis above, aside from discussing absolute performance metrics, we have reported speedup numbers relative to other platforms – which from a performance benchmarking perspective is not strictly fair. However, such relative performance figures combined with the cost of access do help in the decision.
Performance scaling on Wilkes2 (P100 GPU) at different mesh sizes.
[Figure omitted. See PDF]
Admittedly the cost of buying hardware as well as the cost of core hours or GPU hours varies significantly; therefore here we do not look at specific prices. However, energy consumption is an indicator of pricing. A dual-socket CPU consumes up to 260 W, which is then roughly tripled when looking at the whole node due to memory, disks, networking, etc. In comparison, the Intel Xeon Phi CPU has a thermal design power of 215 W, roughly 750 W for the node. A P100 GPU has a TDP of 300 W, but has to be hosted in a CPU system – the more GPUs in a single machine, the better amortised this cost is: the TDP of a GPU node in Wilkes2 is around 1.8 kW ( for the GPUs, plus 800 (Ithaca, New York, USA) for the rest of the system) – which averages to 450 W GPU. Thus in terms of power efficiency GPUs are by far the best choice for VOLNA. Nevertheless, a key benefit of VOLNA-OP2 is that it can efficiently utilise any high-performance hardware commonly available.
Conclusions
In this paper we have introduced and described the VOLNA-OP2 code; a tsunami simulator built on the OP2 library, enabling execution on CPUs, GPUs, and heterogeneous supercomputers. By building on OP2, the science code of VOLNA itself is written only once using a high-level abstraction, capturing what to compute but not how to compute it. This approach enables OP2 to take control of the data structures and parallel execution; VOLNA is then automatically translated to use sequential execution, OpenMP, or CUDA, and by linking with the appropriate OP2 back-end library, these are then combined with MPI. This approach also future-proofs the science code: as new architectures come along, the developers of OP2 will update the back-ends and the code generators, allowing VOLNA to make use of them without further effort. This kind of ease of use and portability makes VOLNA-OP2 unique among tsunami simulation codes. Through performance scaling and analysis of the code on traditional CPU clusters, as well as GPUs and Intel's Xeon Phi, we have demonstrated that VOLNA-OP2 indeed delivers high performance on a variety of platforms and, depending on problem size, scales well to multiple nodes.
We have described the key features of VOLNA, the discretisation of the underlying physical model (i.e. NSWE) in the finite-volume context and the third-order Runge–Kutta time stepper, as well as the input–output features that allow the integration of the simulation step into a larger workflow; initial conditions, and bathymetry in particular, can be specified in a number of ways to minimise I/O requirements, and parallel output is used to write out simulation data on the full mesh or specified points.
There is still a need for even more streamlined and efficient workflows. For instance, we could integrate the finite fault source model for the slip with some assumptions on the rupture dynamics within VOLNA. We could also integrate the bathymetry-based meshing (the mesh needs to be tailored to the depth and gradients of the bathymetry to optimally reduce computational time). Indeed, there would be even fewer exchanges of files and more efficient computations, especially in the context of uncertainty quantification tasks such as emulation or inversion.
In the end, the gain in computational efficiency will allow higher-resolution modelling, such as using 2 m topography and bathymetry collected from lidar, i.e. a greater capability. It will allow greater capacity by enabling more simulations to be performed. Both of these enhancements will subsequently lead to better warnings more tailored to the actual impact on the coast as well as better urban planning since hazard maps will gain in precision geographically and probabilistically due to the possibility of exploring a larger number of more realistic scenarios.
The code is available at
All authors contributed to the writing of the paper. IZR performed the majority of coding, with all other co-authors contributing with extensions and the numerical aspects of the code. DG and DG designed and evaluated the test cases. IZR ran the scalability studies on various supercomputers. MBG, SG, and FD provided overall supervision of the code design, evaluation, and writing.
The authors declare that they have no conflict of interest.
Acknowledgements
We would like to thank Endre László, formerly of PPCU ITK, who worked on the initial port of VOLNA to OP2. István Z. Reguly was supported by the János Bólyai Research Scholarship of the Hungarian Academy of Sciences. Project no. PD 124905 has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the PD_17 funding scheme. The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work (10.5281/zenodo.22558). Serge Guillas gratefully acknowledges support through the NERC grants PURE (Probability, Uncertainty and Risk in the Natural Environment) NE/J017434/1 and “A demonstration tsunami catastrophe risk model for the insurance industry” NE/L002752/1. Serge Guillas and Devaraj Gopinathan acknowledge support from the NERC project (NE/P016367/1) under the Global Challenges Research Fund: Building Resilience programme. Devaraj Gopinathan acknowledges support from the Royal Society, UK, and Science and Engineering Research Board (SERB), India, for the Royal Society–SERB Newton International Fellowship (NF151483). Daniel Giles acknowledges support by the Irish Research Council's Postgraduate Scholarship Programme. Edited by: Simone Marras Reviewed by: two anonymous referees
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2018. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details





1 Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Prater u 50/a, 1088 Budapest, Hungary
2 School of Mathematics and Statistics, University College Dublin, Dublin, Ireland
3 Department of Statistical Science, University College London, London, UK
4 Centre de Mathématiques et de Leurs Applications (CMLA), Ecole Normale Supérieure, Paris-Saclay, Centre National de la Recherche Scientifique, Université Paris-Saclay, 94235 Cachan, France
5 Computer, Electrical and Mathematical Science and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
6 Math Institute, University of Oxford, Oxford, UK