Title
FPGA Checkpointing for Scientific Computing
Abstract
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility of these devices in comparison to ASICs, and their low power consumption compared to GPUs and CPUs. However, scientific applications run for long periods of time and the hardware is always subject to failures due to either soft or hard errors. Thus, it is important to protect these long running jobs with fault tolerance mechanisms. Checkpoint-Restart is a popular technique in high-performance computing that allows large scale applications to cope with frequent failures. In this work we approach the fault tolerance of CPU-FPGA heterogeneous applications from a high level by using OmpSs@FPGA environment and a multi-level checkpointing library. We analyse the performance of several different applications and we understand what kind of overheads we can expect from checkpointing computational workloads running on FPGAs. Our results demonstrate overheads as low as 0.16% and 0.66% when checkpointing very frequently, indicating that this technique is efficient and does not add a significant amount of overhead to the system. In addition, we showcase a proof of concept for checkpointing partial data of the FPGA task itself. This can prove useful for workloads in which most data is offloaded to the FPGA memory at once and do not constantly move all the data between the accelerator and the CPU.
Year
DOI
Venue
2021
10.1109/IOLTS52814.2021.9486693
2021 IEEE 27th International Symposium on On-Line Testing and Robust System Design (IOLTS)
Keywords
DocType
ISSN
FPGA,FTI,fault tolerance,accelerator,resilience,checkpointing,reliability
Conference
1942-9398
ISBN
Citations 
PageRank 
978-1-6654-3371-6
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Marc Perelló Bacardit100.34
Leonardo Bautista-Gomez262.08
Osman Unsal316414.33