Title
Dealing with Transient Faults in the Interconnection Network of CMPs at the Cache Coherence Level
Abstract
The importance of transient faults is predicted to grow due to current technology trends of increased scale of integration. One of the components that will be significantly affected by transient faults is the interconnection network of chip multiprocessors (CMPs). To deal efficiently with these faults and differently from other authors, we propose to use fault-tolerant cache coherence protocols that ensure the correct execution of programs when not all messages are correctly delivered. We describe the extensions made to a directory-based cache coherence protocol to provide fault tolerance and provide a modified set of token counting rules which are useful to design fault-tolerant token-based cache coherence protocols. We compare the directory-based fault-tolerant protocol with a token-based fault-tolerant one. We also show how to adjust the fault tolerance parameters to achieve the desired level of fault tolerance and measure the overhead achieved to be able to support very high fault rates. Simulation results using a set of scientific, multimedia, and commercial applications show that the fault tolerance measures have virtually no impact on execution time with respect to a non-fault-tolerant protocol. Additionally, our protocols can support very high rates of transient faults at the cost of slightly increased network traffic.
Year
DOI
Venue
2010
10.1109/TPDS.2009.148
IEEE Trans. Parallel Distrib. Syst.
Keywords
Field
DocType
token-based fault-tolerant,directory-based cache coherence protocol,fault-tolerant token-based cache coherence,fault-tolerant cache coherence protocol,directory-based fault-tolerant protocol,cache coherence level,fault tolerance parameter,transient fault,transient faults,fault tolerance measure,interconnection network,high fault rate,fault tolerance,cache coherence,electromagnetic radiation,fault tolerant,protocols,time measurement,electromagnetic interference
Directory,Computer science,Electromagnetic interference,Software fault tolerance,Chip,Real-time computing,Fault tolerance,Interconnection,Security token,Distributed computing,Cache coherence
Journal
Volume
Issue
ISSN
21
8
1045-9219
Citations 
PageRank 
References 
1
0.36
21
Authors
4
Name
Order
Citations
PageRank
Ricardo Fernández1275.24
Jose M. Garcia214513.30
M. E. Acacio341941.45
Jose Duato489354.65