Content area
High Performance Computing (HPC) has become the backbone of scientific discovery and engineering innovation, yet increasing system complexity and scale amplify challenges in fault tolerance, time synchronization, and performance understanding. This dissertation presents an integrated study that bridges these three critical dimensions to enhance resilience, coordination, and efficiency across scalable architectures and applications.
Fault tolerance for heterogeneous HPC systems is explored through CUDA kernel interruption using NVIDIA Multi-Process Service (MPS) and POSIX threads, as well as transparent checkpoint/restart with Application Binary Interface (ABI) portability. A taxonomy of consensus mechanisms adapted to MPI and HPC is introduced, classifying synchronous, asynchronous, and partially synchronous models, and aligning them with crash and Byzantine fault models.
Time synchronization is addressed through the adaptation of the HUYGENS algorithm for HPC, yielding a lightweight, software-based clock correction method that requires no specialized hardware. The HPC-HUYGENS approach employs ring-based timestamp probes, and Support Vector Machine (SVM) classification to minimize skew and drift. Experimental evaluation demonstrates that this method improves collective predictability and reduces synchronization overhead compared to the traditional MPI_Barrier for reductions.
Performance understanding is advanced through fine-grain profiling of communication patterns using Caliper, Benchpark, and Thicket. By instrumenting halo exchanges and collective regions in representative benchmarks (AMG2023, Kripke, Laghos), communication bottlenecks are identified, scaling inefficiencies quantified, and performance trade-offs assessed across architectures such as Intel Sapphire Rapids (Dane) and AMD MI250X (Tioga).
The findings establish a conceptual framework for resilient and efficient largescale HPC. By bridging fault tolerance, time synchronization, and performance analysis, this work demonstrates that resilience, coordination/synchronization, and efficiency are not isolated goals but mutually reinforcing pillars for next-generation HPC systems.