Title
SuDoku: Tolerating High-Rate of Transient Failures for Enabling Scalable STTRAM
Abstract
Conventionally, systems have relied on technology scaling to provide smaller cells, which helps in increasing the capacity of on-chip and off-chip structures. Unfortunately, scaling technology to smaller nodes causes increased susceptibility to faults. We study the problem of efficiently tolerating transient failures using scalable Spin-Transfer Torque RAM (STTRAM) as an example. At smaller feature sizes, the energy required to flip a STTRAM cell reduces, which makes these cells more susceptible to random failures caused by thermal noise. Such failures can be tolerated by periodic scrubbing and provisioning each line with Error Correction Code (ECC). However, to tolerate the desired bit-error-rate, the cache needs ECC-6 (six bit error correction) per line, incurring impractical storage overheads. Ideally, we want to tolerate these faults without relying on multi-bit ECC. We propose SuDoku, a design that provisions each line with ECC-1 and a strong error detection code, and relies on a region-based RAID-4 to perform correction of multi-bit errors. Unfortunately, simply having such a RAID-4 based architecture is ineffective at tolerating a high-rate of transient faults and provides an MTTF in the order of only a few seconds. We describe a novel data resurrection scheme that can repair multiple faulty lines in a RAID-4 region to increase the MTTF to several hours. We propose an extension of SuDoku, which hashes a given line into two regions of RAID-4 to significantly enhance reliability and increase the MTTF to trillions of hours. Our evaluations show that SuDoku provides 874x higher reliability than ECC-6, incurs 30% less storage than ECC-6, and performs within 0.1% of an ideal fault-free baseline.
Year
DOI
Venue
2019
10.1109/DSN.2019.00048
2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Keywords
DocType
ISSN
STTRAM,Reliability,Transient-Failure
Conference
1530-0889
ISBN
Citations 
PageRank 
978-1-7281-0058-6
0
0.34
References 
Authors
35
3
Name
Order
Citations
PageRank
Prashant J. Nair134615.74
Bahar Asgari2276.61
Moinuddin K. Qureshi32639110.61