Title
Architectural Support for Fault Tolerance in a Teradevice Dataflow System.
Abstract
The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates.
Year
DOI
Venue
2016
10.1007/s10766-014-0312-y
International Journal of Parallel Programming
Keywords
Field
DocType
Coarse-grained dataflow, Fault tolerance, Fault detection, Recovery, Reliability
Architectural support,Fault detection and isolation,Computer science,Parallel computing,Dataflow,Fault tolerance
Journal
Volume
Issue
ISSN
44
2
1573-7640
Citations 
PageRank 
References 
10
0.55
40
Authors
6
Name
Order
Citations
PageRank
Sebastian Weis1697.15
Arne Garbade2645.34
Bernhard Fechner37812.18
Avi Mendelson451755.88
R. Giorgi512316.60
Theo Ungerer61262136.24