Title
Relaxed Replication for Energy Efficient and Resilient GPU Computing
Abstract
Power and reliability are two intertwined challenges in GPU-accelerated large-scale computing. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. Managing power and resilience is challeng...
Year
DOI
Venue
2021
10.1109/FTXS54580.2021.00009
2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
DocType
ISBN
Citations 
Conference
978-1-6654-2059-4
1
PageRank 
References 
Authors
0.35
0
3
Name
Order
Citations
PageRank
Zheng Miao110.68
Jon C. Calhoun233.41
Rong Ge310.68