Abstract

Ionizing radiation remains an obstacle to bringing graphics processing units (GPU) to space. Since radiation-hardened GPU chips are technically infeasible at the moment, an emphasis has been placed on the adaptation of commercial-off-the-shelf (COTS) GPUs to the space domain. At present, GPU error detection methods require redundant computation. This thesis work explores the utilization of hardware performance counters, special registers useful for monitoring internal GPU hardware events, for symptom-based, lightweight error detection. Hardware performance counters are successfully utilized for the detection of anomalous single event upsets in the L0 instruction cache, the load store unit, the arithmetic and logic unit, the fused multiply add pipeline, and the address divergence unit of a GPU. These upsets are detected using both supervised and unsupervised shallow machine learning models. Results indicate a viable alternative to redundancy-based computational methods for detection and handling of single-event upsets in a subset of components of a GPU architecture.

Details

Title
Towards a Spaceworthy Cots Graphics Processing Unit: Hardware Performance Counter-Based Symptomatic Fault Detection
Author
Teijeiro, Antonio Emilio
Publication year
2023
Publisher
ProQuest Dissertations & Theses
ISBN
9798381440027
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
2919539104
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.