Title
Error Vulnerabilities and Fault Recovery in Deep-Learning Frameworks for Hardware Accelerators
Abstract
Hardware accelerators such as GP-GPUs, Tensor Cores, and Deep-Learning Accelerators (DLA) are increasingly being used in real-time settings such as autonomous vehicles (AVs). In such deployments, any software errors and process failures in hardware systems can lead to critical faults in AVs. Therefore, assessing and mitigating hardware accelerator faults are critical requirements for safety-critical systems. Past work on this subject focused on simulated and injected software and hardware faults to understand and analyze the behavior of the software stack and the entire system. However, programming errors and process failures caused when using software frameworks must also be considered. In this paper, we present experiments which show that widely used deep-learning frameworks are vulnerable to programming mistakes and errors. We first focus on memory-related programming errors caused by applications using deep-learning frameworks that facilitate high-performance inferencing. We next find that a reset to recover from any fault imposes significant time penalties in reloading a pre-trained deep neural network model. To reduce these fault recovery times, we propose fault recovery mechanisms that checkpoint and resume the network based on the inference stage when an error is detected. Finally, we substantiate the practical feasibility of our approach and evaluate the improvement in recovery times <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> A demo video clip demonstrating our recovery algorithm has been uploaded to Youtube: https://www.youtube.com/watch?v=xwUYdJdA5oM.. We use a case-study with real-world applications on an Nvidia GeForce GTX 1070 GPU and an Nvidia Xavier embedded platform, which is commonly used by multiple automotive OEMs.
Year
DOI
Venue
2020
10.1109/RTCSA50079.2020.9203738
2020 IEEE 26th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)
Keywords
DocType
ISBN
fault detection,fault recovery,deep learning framework,checkpoint,hardware accelerators
Conference
978-1-7281-4403-0
Citations 
PageRank 
References 
1
0.36
0
Authors
6
Name
Order
Citations
PageRank
Iljoo Baek111.38
Zhihao Zhu235.13
Sourav Panda310.36
Nandha Kishore Srinivasan410.36
Soheil Samii510.36
Ragunathan Raj Rajkumar622.06