Abstract | ||
---|---|---|
Power and reliability are two intertwined challenges in GPU-accelerated large-scale computing. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. Managing power and resilience is challeng... |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/FTXS54580.2021.00009 | 2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) |
DocType | ISBN | Citations |
Conference | 978-1-6654-2059-4 | 1 |
PageRank | References | Authors |
0.35 | 0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Zheng Miao | 1 | 1 | 0.68 |
Jon C. Calhoun | 2 | 3 | 3.41 |
Rong Ge | 3 | 1 | 0.68 |