Title
A Holistic Solution for Reliability of 3D Parallel Systems
Abstract
AbstractMonolithic 3D technology is emerging as a promising solution that can bring massive opportunities, but the gains can be hindered due to the reliability issues exaggerated by high temperature. Conventional reliability solutions focus on one specific feature and assume that the other required features would be provided by different solutions. Hence, this assumption has resulted in solutions that are proposed in isolation of each other and fail to consider the overall compatibility and the implied overheads of multiple isolated solutions for one system.This article proposes a holistic reliability management engine, R2D3, for post-Moore’s M3D parallel systems that have low yield and high failure rate. The proposed engine, comprising a controller, reconfigurable crossbars, and detection circuitry, provides concurrent single-replay detection and diagnosis, fault-mitigating repair, and aging-aware lifetime management at runtime. This holistic view enables us to create a solution that is highly effective while achieving a low overhead. Our solution achieves 96% coverage of defect; reduces Vth degradation by 53%, leading to a 78% performance improvement on average over 8 years for an eight-core system; and ultimately yields a 2.16× longer mean-time-to-failure (MTTF) while incurring an overhead of 7.4% in area, 6.5% in power, and an 8.2% decrease in frequency.
Year
DOI
Venue
2022
10.1145/3488900
ACM Journal on Emerging Technologies in Computing Systems
DocType
Volume
Issue
Journal
18
1
ISSN
Citations 
PageRank 
1550-4832
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Javad Bagherzadeh100.34
Aporva Amarnath2395.18
Jielun Tan332.41
Subhankar Pal400.34
Ronald G. Dreslinski500.68