Content area

Abstract

Modern High-Performance Computing (HPC) systems in the exascale era enable solutions to grand-scale problems in various domains such as energy, molecular dynamics, medicine, and others. The compute/scaling needs for applications constantly change due to dynamically evolving domains, requiring more resources than ever. To meet this increasing resource demand, hardware vendors constantly innovate techniques to achieve scale-up and scale-out performance by introducing new CPU architectures, interconnects, and accelerators, which form the overall architecture of modern HPC systems. While modern HPC architectures provide the compute and bandwidth required for large-scale applications, distributing work between processes on a system highly depends on the performance of communication operations. The presence of multi- or many-core processes with deep memory hierarchies and advanced interconnects within and between nodes makes it challenging for communication runtimes, as they require architecture awareness both within a node (intranode) and across nodes connected by a network (inter-node), while also being aware of the communication patterns used by applications.

Obtaining good scale-up and scale-out performance requires solutions for various bottle-necks in communication runtimes. Introducing new topology-aware design patterns for newer architectures requires significant effort. HPC applications that run at high core counts often experience contention between processes or threads on shared resources, such as caches, leading to degraded performance. Application compute and communication patterns can also affect intra-node performance due to varying message sizes and load imbalance, leading to idle time. The advent of the in-network computing paradigm has enabled network switches to perform operations such as reductions without involving the CPU while also reducing data movement costs within the network. These solutions often scale very well but the number of resources on the switch is limited. Moreover, another bottleneck in scaling out is often the amount of data that needs to be exchanged over the network. State-of-the-art MPI libraries do not minimize latency and maximize bandwidth in such cases, leading to missed performance improvement opportunities. In this thesis, we analyze bottlenecks in state-of-the-art MPI libraries and optimize MPI communication by bringing architecture awareness for point-to-point and collective operations from two angles: 1) Multi/many-core CPU and GPU architectures, 2) In-network compute and network architectures. We tackle optimizations for various message sizes - latency-bound (small messages) and bandwidth-bound (large messages) - and outperform state-of-the-art communication libraries in multiple micro-benchmarks and application use cases. Our intra-node topology-aware designs achieve up to 7.8x improvements at the micro-benchmark level and 15% for applications over state-of-the-art intra-node collective communication designs employed by production MPI libraries. The cache-aware MPI_Alltoall designs proposed in this thesis outperform state-of-the-art MPI libraries by up to 10x at the micro-benchmark level and 22.2% for the CPU time per loop in distributed Fast Fourier Transforms (FFTs) using P3DFFT. Our application load-imbalance aware designs outperform state-of-the-art MPI libraries by 50.71% in 3D stencil-based communication benchmarks and up to 37.65% in the miniAMR application. For scale-out performance, our proposed in-network aware reduction collective designs outperform state-of-the-art MPI libraries by up to 5.1x for small messages and 89% for large messages, and our pipelined point-to-point compression designs improve throughput in bandwidth-constrained scenarios and outperform state-of-the-art compression-based point-to-point designs by 49.7% in bandwidth benchmarks and 16.5% for 3D stencil-based communication patterns.

Details

1010268
Title
Designing High-Performance Architecture-Aware Communication Middleware for Modern HPC Systems
Number of pages
183
Publication year
2025
Degree date
2025
School code
0168
Source
DAI-A 87/5(E), Dissertation Abstracts International
ISBN
9798297962286
Committee member
Qin, Feng; Teodorescu, Radu
University/institution
The Ohio State University
Department
Computer Science and Engineering
University location
United States -- Ohio
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32384090
ProQuest document ID
3267939077
Document URL
https://www.proquest.com/dissertations-theses/designing-high-performance-architecture-aware/docview/3267939077/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic