Title | ||
---|---|---|
Error Vulnerabilities and Fault Recovery in Deep-Learning Frameworks for Hardware Accelerators |
Abstract | ||
---|---|---|
Hardware accelerators such as GP-GPUs, Tensor Cores, and Deep-Learning Accelerators (DLA) are increasingly being used in real-time settings such as autonomous vehicles (AVs). In such deployments, any software errors and process failures in hardware systems can lead to critical faults in AVs. Therefore, assessing and mitigating hardware accelerator faults are critical requirements for safety-critical systems. Past work on this subject focused on simulated and injected software and hardware faults to understand and analyze the behavior of the software stack and the entire system. However, programming errors and process failures caused when using software frameworks must also be considered. In this paper, we present experiments which show that widely used deep-learning frameworks are vulnerable to programming mistakes and errors. We first focus on memory-related programming errors caused by applications using deep-learning frameworks that facilitate high-performance inferencing. We next find that a reset to recover from any fault imposes significant time penalties in reloading a pre-trained deep neural network model. To reduce these fault recovery times, we propose fault recovery mechanisms that checkpoint and resume the network based on the inference stage when an error is detected. Finally, we substantiate the practical feasibility of our approach and evaluate the improvement in recovery times
<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>
<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>
A demo video clip demonstrating our recovery algorithm has been uploaded to Youtube: https://www.youtube.com/watch?v=xwUYdJdA5oM.. We use a case-study with real-world applications on an Nvidia GeForce GTX 1070 GPU and an Nvidia Xavier embedded platform, which is commonly used by multiple automotive OEMs. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/RTCSA50079.2020.9203738 | 2020 IEEE 26th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA) |
Keywords | DocType | ISBN |
fault detection,fault recovery,deep learning framework,checkpoint,hardware accelerators | Conference | 978-1-7281-4403-0 |
Citations | PageRank | References |
1 | 0.36 | 0 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Iljoo Baek | 1 | 1 | 1.38 |
Zhihao Zhu | 2 | 3 | 5.13 |
Sourav Panda | 3 | 1 | 0.36 |
Nandha Kishore Srinivasan | 4 | 1 | 0.36 |
Soheil Samii | 5 | 1 | 0.36 |
Ragunathan Raj Rajkumar | 6 | 2 | 2.06 |