Title
Optimizing software-directed instruction replication for GPU error detection.
Abstract
Application execution on safety-critical and high-performance computer systems must be resilient to transient errors. As GPUs become more pervasive in such systems, they must supplement ECC/parity for major storage structures with reliability techniques that cover more of the GPU hardware logic. Instruction duplication has been explored for CPU resilience; however, it has never been studied in the context of GPUs, and it is unclear whether the performance and design choices it presents make it a feasible GPU solution. This paper describes a practical methodology to employ instruction duplication for GPUs and identifies implementation challenges that can incur high overheads (69% on average). It explores GPU-specific software optimizations that trade fine-grained recoverability for performance. It also proposes simple ISA extensions with limited hardware changes and area costs to further improve performance, cutting the runtime overheads by more than half to an average of 30%.
Year
DOI
Venue
2018
10.1109/SC.2018.00070
SC
Keywords
Field
DocType
Graphics processing units,Hardware,Registers,Instruction sets,Runtime,Redundancy
Psychological resilience,Supercomputer,Instruction set,Computer science,Parallel computing,Error detection and correction,Software,Redundancy (engineering),Fault tolerance,Embedded system,Overhead (business)
Conference
ISBN
Citations 
PageRank 
978-1-5386-8384-2
11
0.62
References 
Authors
19
5
Name
Order
Citations
PageRank
Abdulrahman Mahmoud1172.44
S. K. S. Hari238420.20
Michael Sullivan331318.05
Timothy K. Tsai464756.27
Stephen W. Keckler53404201.71