1 Introduction
Optimal performance of a computational science model requires efficient numerical methods that are facilitated by the computational resources of the high performance computing (HPC) platform. For each calculation in the model, the operating system (OS) must provide sufficient access to the data so that the calculation can proceed without interruption. This is particularly true in highly parallelized models on HPC cluster systems, where the calculations are distributed across multiple compute nodes, often with strong data dependencies between the individual processes. operations represent such a bottleneck, where one must manage the access of potentially large datasets by many processes while also relying on the available interfaces, typically provided by a Linux operating system to a POSIX parallel (or cluster) filesystem such as Lustre and through to distributed storage arrays. A poorly designed model can be limited by the data speed of an individual disk, or a poorly configured kernel may lack a parallel filesystem that is able to distribute the data transfer across multiple disks.
Datasets in climate modelling at the highest practical resolutions are typically on the order of gigabytes in size per numerical field, and dozens of such fields may be required to define the state of the model. For example, a double-precision floating point variable of an ocean model over a grid of approximately 0.1 horizontal resolution and 75 vertical levels will typically require over 5 GiB of memory per field. Over 20 such fields may be necessary to capture the model state and preserve bitwise reproducibility, and the periodic storage of model output may involve a similar number of variables per diagnostic time step. A typical 1-year simulation can require reading hundreds of gigabytes of input data and can produce terabytes of model output. For disk speeds of 350 MB s, a serial transfer of each terabyte would take approximately 1 h and can severely burden the model runtime. For such models, efficient parallelization is a critical requirement, and the increase in future scalability requires further improvements in efficiency. Parallel can describe any skilful decomposition of the reading and writing of data across multiple threads, processes, compute nodes or physical storage. Many climate models, and ocean models in particular, can be characterized as hyperbolic partial differential equation (PDE) solvers, which are naturally decomposed into numerically solvable subdomains with only local data dependencies (Webb, 1996; Webb et al., 1997), and it is natural to consider parallel operations which follow a similar decomposition.
In short, there are four fundamental approaches to model , each with its respective tradeoffs, which are outlined in Table 1. The first three approaches are common when using a single file per process, although multiple problems can arise regardless of whether the operation is single-threaded or distributed (Shan and Shalf, 2007).
Table 1
Comparison of write pattern between serial and parallel .
Number of | Postproces- | ||
---|---|---|---|
Write pattern | output files | Run time | sing time |
Single-threaded, single file | 1 | Long | None |
Distributed , single file per domain | domains | Moderate | Long |
Distributed , single file per PE | PEs | Short | Long |
Parallel , single shared file | 1 | Scalable | None |
In the simplest and most extreme case, the field is fully decomposed to match the computational decomposition of the model, so that the data used by each process element (PE), such as an MPI rank or an OpenMP thread, are associated with a separate file, i.e. “distributed , single file per PE in Table 1”. An example decomposition is shown in Fig. 1, where the numbered black squares denote the computational domain of each PE. operations in this case are fully parallelized. But this can also require an increasing number of concurrent operations, which can produce an abnormal load on the OS and its target filesystems when such a model is distributed over thousands of PEs (Shan and Shalf, 2007). It can also result in datasets which are distributed over thousands of files, which may require significant effort to either analyse or reconstruct into a single file.
Figure 1A representative decomposition of a global domain. Black squares denote the computational domains of each process, and yellow boundaries denote the collection of computation domains into a larger domain. The global domain is denoted by the red boundary.
[Figure omitted. See PDF]
At the other extreme, it is possible to associate the data of all PEs with a single file, denoted by the red border in Fig. 1. One method for handling single-file is to allow all PEs to directly write to this file. Although POSIX permits concurrent writes to a single file, it can often compound the issues raised in the previous case, where resource contentions in the filesystem must now be resolved alongside any contentions associated with the writing of the data itself. Such methods are rarely scalable without considerable attention to the underlying resource management, and hence we do not consider this method in the paper.
A more typical approach for single-file is to assign a master PE which gathers data from all ranks and then serially writes the data to the output file. That is, “single-threaded, single file” in Table 1. While this approach avoids the issues of filesystem resourcing outlined above, it also requires either an expensive collective operation and the storage of the entire field into memory or a separation of the work into a sequence of multiple potentially expensive collectives and writes. These two options represent the traditional tradeoff of memory usage versus computational performance, and both are limited to serial write speeds.
In order to balance the desire for parallel performance while also limiting the number of required files, one can use a coarser decomposition of the grid which groups the local domains of several PEs into a larger “ domain”, i.e. “distributed , single file per domain” in Table 1. A representative domain decomposition, with domains delineated by the yellow borders, is shown in Fig. 1. Within each domain, one PE is nominated to be responsible for the gathering and writing of data. This has the effect of reducing the number of processes to the number of domains, while still permitting some degree of scalability from the concurrent . Several models and libraries provide implementations of domains, including the model used in this study (Maisonnave et al., 2017; Dennis et al., 2011). A similar scheme for rearranging data from compute tasks to selective tasks is proposed and implemented in the PIO (parallel ) library, which can be regarded as an alternative implementation of domains (Edwards et al., 2019).
Because the domain decomposition will produce fields that are fragmented across multiple files, this often requires some degree of preprocessing. For example, any model change which modifies the domain layout, such as an increase in the number of CPUs, will often require that any fragmented input fields be reconstructed as single files. A typical 0.25 global simulation can require approximately 30 min of postprocessing time to reconstruct its fields as single files; for global 0.1 simulations, this time can be on the order of several hours, often exceeding the runtime of the model which produced the output.
One solution, presented in this paper, is to use a parallel library with sufficient access to the OS and its filesystem, which can optimize performance around such limitations and provide efficient parallel within a single file, i.e. “parallel , single shared file” in Table 1. For example, a library based on MPI-IO can use MPI message passing to coordinate data transfer across processes and can reshape data transfers to optimally match the available bandwidth and number of physical disks provided by a parallel filesystem such as Lustre (Howison et al., 2010). This eliminates the need for writer PEs to allocate large amounts of memory and also avoids any unnecessary postprocessing of fragmented datasets into single files, while also presenting the possibility of efficient, scalable performance when writing to a parallel filesystem.
In this paper, we focus on a parallel implementation for the Modular Ocean Model (MOM), the principal ocean model of the Geophysical Fluid Dynamics Laboratory (GFDL) (Griffies, 2012). As MOM and its Flexible Modelling System (FMS) provide an implementation of domains, it is an ideal platform to assess the performance of these different approaches in a realistic model simulation. For this study, we focus on the MOM5 release, although the work remains relevant to the more recent and dynamically distinct MOM6 model, which uses the same FMS framework.
We present a modified version of FMS which supports parallel in MOM by using the parallel netCDF API, and we test two different netCDF implementations: the PnetCDF library (Li et al., 2003) and the pHDF5-based implementation of netCDF-4 (Unidata, 2015). When properly configured to accommodate the model grid and the underlying Lustre filesystem, both libraries demonstrate significantly greater performance when compared to serial , without the need to distribute the data across multiple domains.
In order to achieve the satisfied parallel performance, it is necessary to determine the optimal settings across the hierarchy of stack, including the user code, high-level libraries, middleware layer and parallel filesystem. There is a large number of parameters at each layer of the stack, and the right combination of parameters is highly dependent on the application, HPC platform, problem size and concurrency. Designing and conducting the tuning benchmark is the key task of this work. It is of particular relevance to MOM/FMS users who experience a bottleneck caused by performance. But given the ubiquity of in the HPC domain, the findings will be of interest to most researchers and members of the general scientific community.
The paper is outlined as follows. We first describe the basic implementation of the FMS library and summarize our changes required to support parallel . The benchmark process and tuning results are described and presented in the following section. Finally, we verify the optimal parameter values by applying them to an -intensive MOM simulation at higher resolution.
2Parallel implementations in FMS
The MOM source code, which is primarily devoted to numerical calculation, will rarely access any files directly and instead relies on FMS functions devoted to specific tasks, such as the saving of diagnostic variables or the reading of an existing input file. Generic operations for opening and reading of file data occur exclusively within the FMS library, and all tasks in MOM can be regarded as FMS tasks.
Within FMS, all operations over datasets are handled as parallel
operations and are accessed by using the mpp module, which manages the
MPI operations of the model across ranks. The API resembles most POSIX-based
interfaces, and the most important operations are the
Files are created or opened using the
The
When used with distributed datasets,
The
When operations have been completed,
Because FMS provides access to distributed datasets as well as a mechanism for collecting the data into larger domains for writing to disk, we concluded that FMS already contained much of the functionality provided by existing parallel libraries and that it would be more efficient to generalize the domain for both writing to files and passing data to a general-purpose IO libraries such as netCDF. By using FMS directly, there is no need to set up a dedicated server with extra PEs, as done in other popular parallel libraries such as XIOS (XIOS, 2020).
The major code changes relevant to the parallel implementations are outlined below:
-
All implementations are fully integrated into FMS and are written in a way to take advantage of existing FMS functionality.
-
netCDF files are now handled in parallel by invoking the
nc_create_par andnc_open_par functions in the FMS file handler,mpp_open . -
All fields are opened with collective read/write operations, via the
NF_COLLECTIVE tag. This is a requirement for accessing variables with unlimited time axis and also a necessary setting to achieve good performance. When possible, the pre-filling of variables is disabled to shorten the file initialization time. -
Infrastructure for configuring
MPI_Info settings has been added to allow fine tuning of the performance at the MPI-IO level. -
The root PEs of domains, which we identify as PEs, are grouped into a new communicator via FMS subroutines and used to access the shared files in parallel.
-
The FMS subroutine
write_record is modified to specify the correct start position and size of data blocks in the domain for each PE. -
New FMS namelist statements have been introduced to enable parallel support and features. An example namelist group is shown below.
Parallel performance benchmark
On large-scale platforms, performance optimization relies on many factors at the architecture level (filesystem), the software stack (high-level libraries) and the application (access patterns). Moreover, external noise from application interference and the OS can cause performance variability, which can mask the effect of an optimization.
Obtaining good parallel performance on a diverse range of HPC platforms is a major challenge, in part because of complex interdependencies between middleware and hardware. The parallel software stack is comprised of multiple layers to support multiple data abstractions and performance optimizations, such as the high-level library, the middleware layer and a parallel filesystem (Lustre, GPFS, etc.). A high-level library translates data structures of the application into a structured file format, such as netCDF-3 or netCDF-4. Specifically, PnetCDF and parallel HDF5 are the parallel interfaces to the netCDF-3 and netCDF-4 file formats, respectively, and they are built on top of MPI-IO. The middleware layer, which in our case is an MPI-IO implementation, handles the organization and access optimization from many concurrent processes. The parallel filesystem handles any accesses to files stored on the storage hardware in data blocks.
While each layer exposes tuneable parameters for improving performance, there is little guidance for application developers on how these parameters interact with each other and how they affect the overall performance. To address this, we select combinations of tuneable parameters at multiple layers covering parallelization scales, application layout, high-level libraries, netCDF formats, data storage layouts, MPI-IO and the Lustre filesystem. Although there is a large space of tunable parameters at all layers of the parallel stack, many parameters interact with each other, and only the leading ones need to be investigated.
3.1parameter space
With over 20 tunable parameters across the parallel stack, it can
become intractable to independently tune every parameter for a realistic
ocean simulation. In order to simplify this process, we conduct a
pre-selection process by executing a stand-alone FMS program
(
-
Application. As described in the introduction, the
io_layout parameter is used to define the distribution of domains in FMS. In the original distributed pattern, multiple PEs are grouped into a single domain within which a root PE collects data from the other PEs and writes them into a separate file. In our parallel implementation, the domain concept is preserved in that data are still gathered from each domain onto its root PE. The main difference is that these PEs now direct their data to the MPI-IO library, which controls how the data are gathered and written to a single shared file. Retaining the domain structures allows the application to reorganize data in memory prior to any operations and enables more contiguous access to the file. -
High-level library. In general, the data storage layout should match the application access patterns in order to achieve significant performance gains. The data layout of netCDF-3 is contiguous, whereas netCDF-4 permits more generalized layouts using blocks of contiguous subdomains (or “chunks”). To simplify the tuning, we use the default chunking layout of netCDF-4 files, so that we can focus on the impact of other parameters. We consider the impact of chunking on performance in the high-resolution benchmark.
-
MPI-IO. There are many parameters in the MPI-IO layer that could dramatically affect the performance. MPI-IO distinguishes between two fundamental styles of : independent and collective. We only consider collective in this work as it is required for accessing netCDF variables with unlimited dimensions (typically the time axis). All configurable settings for independent functions are thus excluded. The collective functions require process synchronization, which provides an MPI-IO implementation the opportunity to coordinate processes and rearrange the requests for better performance. For example, as the high-performance portable implementation packaged in MPICH and OpenMPI, ROMIO has two key optimizations, data sieving and collective buffering, which have demonstrated significant performance improvements over uncoordinated . However, even with these improvements, the shared file performance is still far below the single-file-per-process approach. Part of the reason is that shared file incurs higher overhead due to filesystem locking, which can never happen if a file is only accessed by a unique process. In order to reduce such overhead, it is necessary to tune the collective operations. By reorganizing the data access in memory, collective buffering assigns a subset of client PEs as aggregators. These aggregators gather smaller, non-contiguous accesses into a larger contiguous buffer and then write the buffer to the filesystem (Liao and Choudhary, 2008). Both aggregators and collective buffer size can be set through MPI info objects (Thakur et al., 1999). For example, the number of aggregators per node is controlled by the MPI-IO hint
cb_config_list , and the total number of aggregators is specified incb_nodes . To simplify the benchmark configuration, we always setcb_nodes to the total number of PEs and leavecb_config_list to control the actual aggregator distribution over all nodes. The collective buffer size,cb_buffer_size , is the size of the intermediate buffer on an aggregator for collective . We initially set the value to 64 kB in the lower-resolution model and then evaluate its impact on the performance of the higher-resolution model. -
Lustre Filesystem. The positioning of files on the disks can have a major impact on performance. On the Lustre filesystem, this can be controlled by striping the file across different OSTs (object storage targets). The Lustre stripe count,
striping_factor , specifies the number of OSTs over which a file is distributed, and the stripe size, striping_unit, specifies the number of bytes written to an OST before cycling to the next OST. As there is limit of 165 stripes for a shared file on our Lustre filesystem, we set a range of stripe counts up to 165 to align the number of nodes. The stripe size should generally match the data block size of operations (Turner and McIntosh-Smith, 2017); we find that the stripe size had limited effects on the write performance, and the default 1 MiB gave satisfactory performance in our preselection process.
The preselected parameters at all layers of software stack.
Layer | Parameter | Value |
---|---|---|
Application | io_layout | iox 32, 16, 8, 4, 2, 1 |
() | ioy 30, 15, 5, 3 | |
High-level library | Data storage layout | netCDF-3: contiguous |
netCDF-4: default chunking | ||
cb_buffer_size | 64 kB | |
MPI-IO | cb_nodes | number of PEs |
cb_config_list | 1, 2, 4, 8 | |
Lustre | striping_unit | 1 MB |
striping_factor | 15, 30, 60, 120, 165(max) |
The parallel performance benchmark configurations are set up as shown in Table 3.
Table 3
The parallel performance benchmark configurations.
Parameters | Description | |
---|---|---|
1 d simulations with diagnostic output enabled. | ||
Model | Configurations | 0.25 model () for performance tuning |
0.1 model () for validating performance | ||
Output | Diagnostic | Diagnostic fields: , , , , |
Diagnostic file write frequency: | ||
30 min interval for 0.25, 48 steps, 70 GB | ||
5 min interval for 0.1, 288 steps, 2.7 TB | ||
Benchmark | PEs | 240 & 960 for 0.25 model; 720 & 1440 for 0.1 model |
Domain layout | for 240 PEs; for 960 PEs (0.25) | |
for 720 PEs; for 1440 PEs (0.1) | ||
library/format | NetCDF v4.6.1 with the following libraries/formats: | |
HDF5 v1.8.20/netCDF-4 | ||
HDF5 v1.10.2/netCDF-4 & netCDF-4 classic | ||
PnetCDF v1.9.0/netCDF-3 (64-bit offsets) |
Project size details: we run a suite of 1 d simulations of the 0.25 global MOM-SIS model for each of the parameters in Table 3. We then apply these results to a 1 d simulation of 0.1 models and validate the parallel performance benefits. Each simulation is initialized with prescribed temperature and salinity fields and is forced by prescribed surface fields. The compute domain is represented by the horizontal grid sizes of and for the 0.25 and 0.1 models, respectively. Both configurations use a common 50-level vertical grid. Model output consists of several restart files in double-precision format and a diagnostic output file in single-precision format. In order to produce significant loads for such a short run, diagnostic output is saved after every time step. In the 0.25 configuration, the model writes 70 GB of data to the diagnostic file over 48 time steps with the 0.25 configuration and writes 2.7 TB of data over 288 steps with the 0.1 configuration model. Multiple independent runs are repeated, and the shortest time is shown for each case.
Domain layout: domain layout depends on the total number of PEs in use. Two distinct CPU configurations, 240 and 960 PEs, are considered for the 0.25 model. The domain layout is for 240 PEs and for 960 PEs. In 0.1 model, grids are distributed over 720 and 1440 PEs with the domain layout of and , respectively. PEs are equally assigned in node majority along the direction of the domain layout.
High-level libraries and netCDF formats: the netCDF library provides parallel access to netCDF-4 formatted files based on the HDF5 library and netCDF-3 formatted files via the PnetCDF library. HDF5 maintains two version tracks, 1.8 series and 1.10 series, in order to maintain the file format compatibility and the enabling of new features, such as the collective metadata or virtual datasets. We are interested in checking the performance to access different formats via various libraries as listed in Table 3.
We rely on the FMS timers to measure the time metrics on opening
(
Experiments are carried out on the NCI Raijin supercomputing platform. Each compute node consists of two Intel Xeon (Sandy Bridge) E5-2670 processors with a nominal clock speed of 2.6 GHz and containing eight cores, or 16 cores per compute node. Standard compute nodes have 64 GB of memory shared between the two processors. A Lustre filesystem having 40 OSSs (object storage servers) and 360 OSTs is mounted as the working directory via 56 Gb FDR InfiniBand connections.
4 Benchmark results4.1
Single-threaded single-file of the 0.25 model
The single-threaded single-file pattern of MOM5 is chosen as the reference to compare its time with the parallel methods. As with parallel , this method creates a single output file, and no postprocessing is required. The operation times and total execution times for our target libraries and PE configurations are shown in Table 4.
Table 4Serial single-file time in MOM5 by using 240 and 960 PEs.
0.25 Model | 240 PEs | 960 PEs | |||
---|---|---|---|---|---|
Time (s) | netCDF-3 | netCDF-4 | netCDF-3 | netCDF-4 | |
(PnetCDF 1.9.0) | (HDF5 1.10.2) | (PnetCDF 1.9.0) | (HDF5 1.10.2) | ||
Total runtime | 637.82 | 687.20 | 629.33 | 671.95 | |
7.46 | 6.39 | 15.62 | 14.97 | ||
3.90 | 3.73 | 6.16 | 4.88 | ||
4.58 | 4.15 | 2.37 | 2.43 | ||
545.50 | 592.39 | 576.92 | 616.35 | ||
0.65 | 0.96 | 1.23 | 2.37 |
We can see that all benchmarks are intensive and they are driven by file
initialization and writing operations. Specifically, writing a 4D dataset into
the diagnostic file takes about 85 % of total elapsed time. All other
times are notably shorter than
The time used in writing data into netCDF-4 formatted files is about 10 % longer than creating netCDF-3 formatted files. This reflects the fact that in serial , the root PE holding the global domain data tends to write the file contiguously, and it matches the contiguous data layout of netCDF-3 better than the default block chunking layout of netCDF-4.
Most operations excluding
Parallel performance tuning of the 0.25 model
4.2.1layout
As outlined in the introduction, layout specifies the topology of domains to which the global domain is mapped. In our parallel implementation, we adapt the layouts in FMS to define subdomains of parallel activity. Only the root PE of each domain is involved in accessing the shared output file via MPI-IO. A skilful selection of layout can help to control the contentions on opening and writing of files. layout is not involved in reading input files; all PEs access the input files independently when reading the grid and initialization data.
In this section we explore how layouts affect the performance. For each layout, we adjust the number of stripe counts and aggregators to approach the shortest time.
In the 240 PE benchmark, the domain PEs are distributed over a 2D grid of 16 PEs in the direction and 15 PEs in the direction, denoted as . On our platform, this corresponds to 16 PEs per node over 15 nodes. The experimental subdomain is similarly defined as , where , 2, 4, 8, 16 and , 5, 15. On our platform, which uses 16 CPUs per node, we can interpret as the number of PEs per node and as the number of nodes. A schematic diagram of PE domains and domains in 240 PE benchmark is shown in Fig. 2. For the 960 PE benchmark, the PE layout is , which utilizes 960 CPU cores over 60 nodes. The experimental layout is set as the combination of , 2, 4, 8, 16, 32 and , 30. Note that in the case of , there are nodes and one rank per node. For all other cases in the 960 PE benchmark, there are two nodes and PEs per node.
Figure 2A schematic diagram of computation domain (grid of yellow-outlined squares) and domain (grid of blue-outlined squares) with 12 PEs (labelled with filled-in yellow squares) in a 240 PE benchmark. The index of each PE is labelled.
[Figure omitted. See PDF]
The time metrics associated with different layouts by using 240 and 960 PEs are measured and compared. Each benchmark result is classified based on its library/format and the layout, and we report the shortest observed time in each category.
In all benchmarks, the elapsed times for writing files in netCDF-4 and netCDF classic formats are very similar, as both are produced by utilizing the HDF5 1.10.2 library. We will thus report performance among three libraries, i.e. HDF5 1.8.20, HDF5 1.10.2 and PnetCDF 1.9.0.
The mpp_open metric measures both the opening time of input
files and the creation time of output files. Its runtime versus layout
at 240 and 960 PE benchmarks is shown in Fig. 3. In all of the
experiments, PnetCDF has shorter
[Figure omitted. See PDF]
The
[Figure omitted. See PDF]
The majority of time is due to
[Figure omitted. See PDF]
The
[Figure omitted. See PDF]
The total elapsed time versus layout for all libraries are plotted in
Fig. 7. The HDF5 1.8.20 takes more time than HDF5 1.10.2 to produce the
netCDF-4 files, due to longer
Total elapsed time versus layout for different libraries and PE numbers. Higher contention at 960 PEs can overwhelm the overall performance trends observed at 240 PEs. The layout together with its PE distribution in [PE per node nodes] are labelled on the axis.
[Figure omitted. See PDF]
The impact of layout on each component time indicates that excessive parallelism can give rise to high contention within the file server and can diminish performance. We could thus set up the delegated processes to reduce the contention that is also detailed in other work (Nisar et al., 2008). The best performance is achieved by using a moderate number of PEs per node, such as eight PEs per node in the 240 PE or two PEs per node in the 960 PE benchmark. Each PE collects data from other PEs within the same domain and forms more contiguous data blocks to be written to disk. In the next section, we use the best-performing layouts, for 240 PE and for 960 PE, to explore the optimal settings of Lustre stripe count and MPI-IO aggregator.
4.2.2 Stripe count and aggregatorsThe Lustre stripe count and the number of MPI-IO aggregators can be set as
MPI-IO hints when creating or opening a file and are the two major MPI-IO
parameters affecting performance. The MPI-IO hint
240 PEs
The variations in each time metric versus the number of aggregators and stripe counts for each library are plotted in Fig. 8 for the 240 PE experiments.
Figure 8
The performance of 240 PE benchmarks with different library/format bindings regarding to the number of aggregators and stripe counts.
[Figure omitted. See PDF]
The
The
The optimal
The
The total runtime shows similar dependences on stripe count and aggregators
with
960 PEs
The variations in each time metric on the number of aggregator and stripe count in all library/format bindings are plotted in Fig. 9 for 960 PE experiments.
Figure 9The performance of different library/format bindings with a variety of aggregators and stripe counts by using 960 PEs.
[Figure omitted. See PDF]
The metrics for the 960 PE benchmarks show a similar trend to the 240 PE
benchmarks. Both
In both the 240 and 960 PE experiments, the best performance occurs when the Lustre stripe count matches the number of aggregators. Using a larger stripe count may degrade the performance, since each aggregator process must communicate with many OSTs and must contend with reduced memory cache locality when the network buffer is multiplexed across many OSTs (Bartz et al., 2015; Dickens and Logan, 2008; Yu et al., 2007).
4.2.3implementation profiling analysis
The above benchmark results show performance variances among different libraries and formats. In order to explore the source of differences in performance, we have developed an profile to capture function calls at multiple layers of the parallel stack, including netCDF, MPI-IO and POSIX , without requiring source code modifications. It provides a passive method for tracing events through the use of dynamic library preloading. It intercepts netCDF function calls issued by the application and reroutes them to the tracer, where the timestamp, library function name, target file name and netCDF variable name along with function arguments are recorded. The original library function is then called after these details have been recorded. It is applied similarly at the MPI-IO and POSIX layers. We have disabled profiling of HDF5 and PnetCDF libraries, as both are intermediate layers. Profiling overheads were measured to be negligible in comparison to the total time.
We apply the profiler described above to the 240 PE benchmark experiments, using the optimal parameters from the previous analysis. The profiling results are plotted in call path flowcharts for each library as shown in Figs. 10–12. The accumulated maximum PE time is presented within each function node and above call path links. The number of PEs involved in each call path is also given in the brackets. Call paths with trivial elapsed time have been omitted.
Figure 10The call path flow of a tuned 240 PE benchmark with HDF5 1.8.20/netCDF-4. It is classified into three layers, i.e. netCDF, MPI-IO and system functions. The maximum PE time together with the total number of PEs from the invoker are labelled above each path line, and the maximum PE time on each function are labelled within the node block.
[Figure omitted. See PDF]
Figure 11The call path flow of tuned 240 PE benchmark with HDF5 1.10.2/netCDF-4. It is classified into three layers, i.e. netCDF, MPI-IO and system functions. The maximum PE time together with the total number of PEs from the invoker are labelled above each path line, and the maximum PE time on each function are labelled within the node block.
[Figure omitted. See PDF]
Figure 12The call path flow of tuned 240 PE benchmark with PnetCDF. It is classified into three layers, i.e. netCDF, MPI-IO and system functions. The maximum PE time together with the total number of PEs from the invoker are labelled above each path line, and the maximum PE time on each function are labelled within the node block.
[Figure omitted. See PDF]
As shown in Fig. 10,
Aside from the metadata operations, reading and writing netCDF variables are
conducted collectively via
In the HDF5 1.10. track, collective was introduced to improve the performance of metadata operations. Collective metadata can improve performance by allowing the library to perform optimizations when reading the metadata, by having one rank read the data and broadcasting it to all other ranks. It can improve metadata write performance through the construction of an MPI-derived data type that is then written collectively in a single call. The call path flow of tuned 240 PE benchmark with HDF5 1.10.2/netCDF-4 is shown in Fig. 11.
It shows that
The call path flow of the tuned 240 PE benchmark with PnetCDF is shown in
Fig. 12. Due to the simpler file structure of netCDF-3, the
Load balance is another factor which may strongly affect performance. In Fig. 13 we compare the time distribution over PEs in three layers of the major write call path between HDF5 1.10.2 and PnetCDF.
Figure 13
Time distribution over PEs of major write call path functions, i.e. nc_put_vara_double for netCDF, MPI_File_write_at_all for MPI-IO and POSIX call write. The benchmark is running on 240 ranks with an layout of .
[Figure omitted. See PDF]
In the benchmark of the HDF5 1.10.2, both
As indicated in the above benchmark experiments, the write performance is
optimized by choosing an appropriate number of PEs, aggregators and
Lustre stripe counts. In contrast to
As noted earlier, the serial
Figure 14
The 960 PE benchmarks with layout and naggr1 by using serial read (sread) and parallel read (pread) with the HDF5 1.10.2 and PnetCDF libraries. Serial read times are overall more efficient over a range of stripe counts.
[Figure omitted. See PDF]
The
performance validation of 0.1 model
The tuning results from the 0.25 model suggests that the best parallel performance could be achieved with the following settings:
-
a parallel write with
-
a moderate number PEs per node to access the file, as defined by layout;
-
one or two aggregators per node, as defined by MPI-IO hints;
-
a stripe count matching the number of aggregators, as defined by MPI-IO hints.
-
-
a serial read on input files with the same stripe count as parallel write.
The domain layouts of the 720 and 1440 PE runs are and , respectively. We choose layouts of and for 720 and 1440 PEs, respectively, so there is one PE per node. The number of aggregators is also configured to one per node, and the stripe count is set to the total number of aggregators, i.e. 45 and 90 for the 720 and 1440 PE runs, respectively. For all benchmark experiments, we use serial independent reads and parallel writes. The measured time metrics in 720 and 1440 PE runs for the HDF5 1.10.2 and PnetCDF libraries are shown in Table 5. The timings of the original single-threaded single file (SIO) pattern in 720 and 1440 PE runs are also listed for comparison.
Table 5The time metrics of 0.1 model in 720 and 1440 PE runs with HDF5 1.10.2/netCDF-4 and PnetCDF 1.9.0/netCDF-3. SIO represents the original serial read and single-threaded write; PIO represents the serial read and parallel shared write. All values are taken from the maximum wall time among all PEs.
Library/format | HDF5 1.10.2 (netCDF-4) | PnetCDF 1.9.0 (netCDF-3) | |||||||
---|---|---|---|---|---|---|---|---|---|
PEs | 720 (45 nodes) | 1440 (90 nodes) | 720 (45 nodes) | 1440 (90 nodes) | |||||
pattern | SIO | PIO | SIO | PIO | SIO | PIO | SIO | PIO | |
Total runtime (s) | 21 689 | 1624 | 000 | 889 | 19 726 | 1387 | 000 | 782 | |
8 | 51 | 90 | 9 | 16 | 81 | ||||
25 | 11 | 11 | 15 | 11 | 14 | ||||
20 826 | 705 | 349 | 18 839 | 526 | 290 | ||||
8 | 37 | 59 | 0 | 0 | 1 | ||||
Non- time (s) | 828 | 820 | 380 | 860 | 834 | 396 |
As shown in Table 5, the original serial pattern requires a very long time (about 6 h) to create a large diagnostic file (2.7 TB) and multiple restart files (75 GB) in 720 PE runs. The serial 1440 PE runs exceeded the platform job time limit of 5 h and could not be completed, but the lack of scalability of serial indicated by 0.25 model (Table 4) suggests that the total time would be comparable to the 720 PE runs. We noticed that the PnetCDF timings are 20 % faster than the HDF5 times, as also observed in the 0.25 model benchmarks. Both libraries have similar non- times at each level of PE count, which comprise less than 5 % of total runtime, demonstrating that the benchmarks are intensive and that different libraries have no impact on the computation time.
The values of
The PnetCDF library shows better write performance than HDF5 in both serial
and parallel , as well as a much shorter time in
All HDF5 performance results used the default block chunking layout, where
the chunk size is close to 4 MB with a roughly equal number of chunks
along each axis. We repeated these tests by customizing the chunk layout
while keeping all other parameters unchanged. The chunk layout, (
Performance of 720 PE runs with customized chunk layouts in HDF5/netCDF-4. The default chunk layout of HDF5/netCDF-4 and contiguous layout of PnetCDF/netCDF-3 are shown as references.
[Figure omitted. See PDF]
The chunk layout of (1, 1) defines the whole file as a single chunk. In this
case, it occupies the same contiguous data layout with PnetCDF. Not
surprisingly, the
The
Choosing a good chunk layout depends strongly on the layout settings. Using a single chunk in the netCDF-4 file is unnecessary as it resembles the same data layout as the netCDF-3 format. Adopting an layout as the chunk shape is sufficient for achieving optimal performance if our intention is to create netCDF-4 formatted output files and to utilize more advanced features, such as compression and filtering operations.
Although benchmark tests in this work are highly intensive to explore the performance of parallel , the general simulation with less workloads could also benefit from parallel . To demonstrate it we conducted 8 d simulations of 0.1 model with frequencies in every 1 and 4 d. The time and total runtime in each simulation from 720 and 1440 PEs runs are listed in Table 6. The produced ocean diagnostic files are 73 and 19 GB for 1 and 4 d frequencies, respectively.
Table 6The time metrics of 0.1 model in 720 and 1440 PE runs
with less frequencies, i.e. write per 1 and 4 d in 8 d
simulations. SIO represents the original single-threaded write; PIO
represents parallel shared write. The time composes of contributions
from
pattern & format | SIO in netCDF4_classic | PIO in netCDF-4 | PIO in netCDF-3 | ||||||
---|---|---|---|---|---|---|---|---|---|
frequency | 1 d | 4 d | 1 d | 4 d | 1 d | 4 d | |||
Total runtime (s) | 8114 | 7817 | 7685 | 7569 | 7666 | 7469 | |||
720 PEs | time/ |
494/453 | 302/265 | 75/40 | 62/27 | 57/17 | 49/11 | ||
ratio | 6.09 % | 3.87 % | 0.98 % | 0.82 % | 0.74 % | 0.66 % | |||
Total runtime (s) | 4118 | 3743 | 3547 | 3578 | 3518 | 3549 | |||
1440 PEs | time/ |
452/421 | 269/238 | 59/24 | 48/14 | 51/14 | 40/7 | ||
time ratio | 10.98 % | 7.18 % | 1.67 % | 1.35 % | 1.45 % | 1.14 % |
For 720 PEs the time takes 6.09 % of total runtime for 1 d
frequency, and it reduces to 3.87 % of total runtime for a lower frequency
of 4 d. These are regarded as typical workloads of
normal model simulations at 5 % (Koldunov et al., 2019). The parallel
scheme could reduce the weight to be less than 1 % of total run time
in both netCDF-4 and netCDF-3 formats. It is noticed that total overheads
from those one-time operations such as
We have implemented parallel netCDF into the FMS framework of the MOM5 ocean model and presented results which demonstrate the performance gains relative to single-threaded single-file . We present a procedure for tuning the relevant parameters, which begins with identifying the parameters that are sensitive to overall performance by using a light-weight benchmark program. We then systematically measure the impact of this reduced list of parameters by running the MOM5 model at a lower (0.25) resolution and determining the optimal values for these parameters. This is followed by a validation of the results in the higher (0.1) resolution configuration.
Several rules for tuning the parameters across multiple layers of the stack are established to maintain the contiguous access patterns and achieve the optimal performance. At the user application layer, domains were defined to retain more contiguous access patterns by mapping the scattered grid data to a smaller number of PEs. We achieve the best performance when there is at least one PE per node, and there can be additional benefits to using multiple PEs per node, although an excessive number of PEs per node can impede performance.
At the MPI and Lustre levels of the stack, it was found that the number of aggregators used in collective MPI- operations and the number of Lustre stripe counts needed to be consistently restricted to no more than two per node in order to facilitate contiguous access and reduce the number of contentions between PEs.
An profiling tool has been developed to explore overall timings and load balance of individual functions across the stack. It was determined that the MPI implementation of particular operations in the HDF5 1.8.20 library used by netCDF-4 caused significant overhead when accessing metadata and that these issues were largely mitigated in HDF5 1.10.2. Additional profiling of the PnetCDF 1.9.0 library showed that it did not suffer from such overhead, due to the simpler structure of the netCDF-3 format.
High-resolution MOM5 benchmarks using the 0.1 configuration were able to confirm that the parallel implementations can dramatically reduce the write time of diagnostic and restart files. Using parallel enables the scaling of operations in pace with the compute time and improves the overall performance of MOM5, especially when running an -intensive configuration resembling our benchmark. The parallel implementation proposed in this paper provides an essential solution that removes any potential bottlenecks in MOM5 at higher resolutions in the future.
Although this work is applied to a model with a fixed regular grid, these results could be applied to a model with an unstructured mesh. Much of the work required to populate the domains and to define chunked regions is required to produce contiguous streams of data which are passed to the library. If the data are already stored as contiguous 1D arrays, then the task of dividing the data across servers could be trivial. If more complex data structures are used, such as linked lists, then the buffering of data into contiguous arrays could add significant overhead to parallel .
An investigation of data compression is not a part of this work, as traditionally it can only be used in serial . We note that the more recent version of HDF5, 1.10.2, introduced support for parallel compression, and it is expected that the netCDF library will soon follow. As the layout generally picks up one to two PE per compute node, it may produce chunks which are too large (i.e. too small a number of chunks) for efficient parallel compression. In this sense, the default chunk layout of netCDF4 should also be considered as it gains acceptable write performance and has suitable chunk sizes more suitable for parallel compression. Finally, it is explored that parallel could not only largely accelerate intensive model simulations but also prompt the scalability of a general case with typical workloads.
Code availability
The source code of parallel enabled FMS is available from 10.5281/zenodo.3700099 (Ward and Yang, 2020). The MOM5 code used in the work is
available at
Author contributions
RY and MW developed the parallel code contributions to FMS. RY carried out all model simulations, as well as performance profiling and analysis. RY and MW wrote the initial draft of the article. All coauthors contributed to the final draft of the article. BE supervised the project.
Competing interests
The authors declare that they have no conflict of interest.
Acknowledgements
This work used supercomputing resources provided by National Computational Infrastructure (NCI), The Australian National University.
Review statement
This paper was edited by Olivier Marti and reviewed by Nikolay V. Koldunov and Michael Kuhn.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
We present an implementation of parallel
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 National Computational Infrastructure, The Australian National University, Canberra, ACT 0200, Australia
2 National Computational Infrastructure, The Australian National University, Canberra, ACT 0200, Australia; now at: Geophysics Fluid Dynamics Laboratory, National Oceanic & Atmospheric Administration, Princeton, NJ 08540-6649, USA