Content area

Abstract

High-performance computing (HPC) drives modern scientific and engineering discovery, yet the ever-growing scale and density of contemporary architectures make them increasingly susceptible to transient bit flips that evade hardware countermeasures. When such soft errors escape detection, they may manifest as silent data corruptions (SDCs), compromising result integrity. Conventional defences—error-correcting codes or checkpoint/restart—struggle to scale gracefully to exascale workloads, leaving a widening resilience gap.

This dissertation bridges that gap by combining large-language-model (LLM) analytics with interactive visualisation to create an end-to-end workflow that explains, predicts, and contextualises resilience in real HPC codes. First, the VISILIENCE framework overlays multiple resilience metrics on a program’s control-flow graph, enabling developers to trace error-propagation paths and prioritise mitigation without wading through raw logs. Building on those insights, the modular HAPPA predictor segments long kernels, embeds each segment with Transformer models, and aggregates them via mean, max, LSTM, or attention pooling, achieving state-of-the-art SDC-rate prediction accuracy. Its parameter-efficient successor, eHAPPA, adopts low-rank adaptation to cut trainable parameters by more than 99% while lowering mean-squared prediction error to 0.055—a 25% improvement over prior baselines. Finally, a loop-level study maps 52 benchmark applications to the thirteen “dwarfs” of parallel computation; guided by prompt-engineered GPT-4, it classifies loop semantics automatically and, through large-scale fault injection, uncovers pronounced dwarf-specific vulnerability patterns, with N-Body loops exceeding 30% SDC incidence and MapReduce loops remaining below 5%.

Together these contributions form a scalable tool chain that converts low-level fault data into actionable insight, enabling selective hardening and advancing the dependability of next-generation HPC systems.

Details

1010268
Title
A Resilience Study of Soft Errors in High-Performance Computing Applications: Visualization and LLM-Based Modeling
Number of pages
185
Publication year
2025
Degree date
2025
School code
0101
Source
DAI-B 87/1(E), Dissertation Abstracts International
ISBN
9798288836077
Advisor
Committee member
Jin, Ruoming; Yu, Gang; Shen, Tao; Lian, Xiang
University/institution
Kent State University
Department
College of Arts and Sciences / Department of Computer Science
University location
United States -- Ohio
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32208662
ProQuest document ID
3230447782
Document URL
https://www.proquest.com/dissertations-theses/resilience-study-soft-errors-high-performance/docview/3230447782/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic