Content area

Abstract

High Performance Computing (HPC) has become the backbone of scientific discovery and engineering innovation, yet increasing system complexity and scale amplify challenges in fault tolerance, time synchronization, and performance understanding. This dissertation presents an integrated study that bridges these three critical dimensions to enhance resilience, coordination, and efficiency across scalable architectures and applications.

Fault tolerance for heterogeneous HPC systems is explored through CUDA kernel interruption using NVIDIA Multi-Process Service (MPS) and POSIX threads, as well as transparent checkpoint/restart with Application Binary Interface (ABI) portability. A taxonomy of consensus mechanisms adapted to MPI and HPC is introduced, classifying synchronous, asynchronous, and partially synchronous models, and aligning them with crash and Byzantine fault models.

Time synchronization is addressed through the adaptation of the HUYGENS algorithm for HPC, yielding a lightweight, software-based clock correction method that requires no specialized hardware. The HPC-HUYGENS approach employs ring-based timestamp probes, and Support Vector Machine (SVM) classification to minimize skew and drift. Experimental evaluation demonstrates that this method improves collective predictability and reduces synchronization overhead compared to the traditional MPI_Barrier for reductions.

Performance understanding is advanced through fine-grain profiling of communication patterns using Caliper, Benchpark, and Thicket. By instrumenting halo exchanges and collective regions in representative benchmarks (AMG2023, Kripke, Laghos), communication bottlenecks are identified, scaling inefficiencies quantified, and performance trade-offs assessed across architectures such as Intel Sapphire Rapids (Dane) and AMD MI250X (Tioga).

The findings establish a conceptual framework for resilient and efficient largescale HPC. By bridging fault tolerance, time synchronization, and performance analysis, this work demonstrates that resilience, coordination/synchronization, and efficiency are not isolated goals but mutually reinforcing pillars for next-generation HPC systems.

Details

1010268
Title
Bridging Fault Tolerance, Time Synchronization, and Performance Understanding Across Scalable Architectures and Applications
Number of pages
134
Publication year
2025
Degree date
2025
School code
0390
Source
DAI-B 87/6(E), Dissertation Abstracts International
ISBN
9798270238964
Committee member
Altarawneh, Amani; Bangalore, Purushotham V.; Ghafoor, Sheikh; Nine, Zulkar MD. S
University/institution
Tennessee Technological University
Department
Computer Science
University location
United States -- Tennessee
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32284112
ProQuest document ID
3285411607
Document URL
https://www.proquest.com/dissertations-theses/bridging-fault-tolerance-time-synchronization/docview/3285411607/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic