Title
Maestro: orchestrating lifetime reliability in chip multiprocessors
Abstract
As CMOS feature sizes venture deep into the nanometer regime, wearout mechanisms including negative-bias temperature instability and time-dependent dielectric breakdown can severely reduce processor operating lifetimes and performance. This paper presents an introspective reliability management system, Maestro, to tackle reliability challenges in future chip multiprocessors (CMPs) head-on. Unlike traditional approaches, Maestro relies on low-level sensors to monitor the CMP as it ages (introspection). Leveraging this real-time assessment of CMP health, runtime heuristics identify wearout-centric job assignments (management). By exploiting the complementary effects of the natural heterogeneity (due to process variation and wearout) that exists in CMPs and the diversity found in system workloads, Maestro composes job schedules that intelligently control the aging process. Monte Carlo experiments show that Maestro significantly enhances lifetime reliability through intelligent wear-leveling, increasing the expected service life of a population of 16-core CMPs by as much as 38% compared to a naive, round-robin scheduler. Furthermore, in the presence of process variation, Maestro's wearout-centric scheduling outperformed both performance counter and temperature sensor based schedulers, achieving an order of magnitude more improvement in lifetime throughput – the amount of useful work done by a system prior to failure.
Year
DOI
Venue
2010
10.1007/978-3-642-11515-8_15
HiPEAC
Keywords
Field
DocType
maestro composes job schedule,cmp health,negative-bias temperature instability,lifetime throughput,process variation,lifetime reliability,reliability challenge,chip multiprocessors,system workloads,16-core cmps,introspective reliability management system,service life,job scheduling,management system,real time,intelligent control
Population,Scheduling (computing),Computer science,Parallel computing,Chip,Real-time computing,CMOS,Schedule,Heuristics,Process variation,Throughput,Embedded system
Conference
Volume
ISSN
ISBN
5952
0302-9743
3-642-11514-4
Citations 
PageRank 
References 
21
0.80
17
Authors
4
Name
Order
Citations
PageRank
Shuguang Feng130612.96
Shantanu Gupta239016.39
Amin Ansari336115.88
Scott Mahlke44811312.08