Title
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
Abstract
This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications.
Year
DOI
Venue
1996
10.1109/12.543705
IEEE Trans. Computers
Keywords
Field
DocType
shared memory multiprocessors,recovery scheme,error recovery mechanism,applications software,tolerating processor failure,tolerating processor failures,shared-memory multiprocessors,fault tolerant,performance study,shared memory multiprocessor,fault tolerance,memory multiprocessors,operating systems,computer architecture,application software,protocols,performance,shared memory,hardware,computational modeling,simulation
Uniform memory access,Shared memory,Computer science,Parallel computing,Cache-only memory architecture,Distributed memory,Real-time computing,Multiprocessing,Fault tolerance,Systems architecture,Embedded system,Cache coherence
Journal
Volume
Issue
ISSN
45
10
0018-9340
Citations 
PageRank 
References 
17
1.65
25
Authors
5
Name
Order
Citations
PageRank
Michel Banâtre130563.45
Alain Gefflaut217624.33
Philippe Joubert3172.33
Christine Morin422626.78
Peter A. Lee57413.70