Title
Making Convolutions Resilient Via Algorithm-Based Error Detection Techniques
Abstract
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and high-performance computing systems. As such systems require high levels of resilience to errors, CNNs must execute correctly in the presence of hardware faults. Full duplication provides the needed assurance but incurs a prohibitive 100 percent overhead. In this article, we focus on algorithmically verifying convolutions, the most resource-demanding operations in CNNs. We use checksums to verify convolutions. We identify the feasibility and performance related challenges that arise in algorithmically detecting errors in convolutions in optimized CNN inference deployment platforms (e.g., TensorFlow or TensorRT on GPUs) that fuse multiple network layers and use reduced-precision operations, and demonstrate how to overcome them. We propose and evaluate variations of the algorithm-based error detection (ABED) techniques that offer implementation complexity, runtime overhead, and coverage trade-offs. Results show that ABED can detect all transient hardware errors that might otherwise corrupt output with low runtime overheads (6-23 percent). Only about 1.4 percent of the total computations in a CNN are not protected by ABED, which can be duplicated for full CNN protection. ABED for the compute-intensive convolutions and duplicating the rest can offer at least 1.6× throughput compared to full duplication.
Year
DOI
Venue
2022
10.1109/TDSC.2021.3063083
IEEE Transactions on Dependable and Secure Computing
Keywords
DocType
Volume
Resilience,hardware error detection,convolutional neural networks
Journal
19
Issue
ISSN
Citations 
4
1545-5971
1
PageRank 
References 
Authors
0.35
9
4
Name
Order
Citations
PageRank
S. K. S. Hari138420.20
Michael Sullivan231318.05
Timothy K. Tsai364756.27
Stephen W. Keckler43404201.71