Title
A fault tolerance infrastructure for dependable computing with high-performance COTS components
Abstract
The failure rates of current COTS processors have dropped to 100 FITs (failures per 109 hours), indicating a potential MTTF of over 1100 years. However our recent study of Intel P6 family processors has shown that they have very limited error detection and recovery capabilities and contain numerous design faults (“errata”). Other limitations are susceptibility to transient faults and uncertainty about “wearout” that could increase the failure rate in time. Because of these limitations, an external fault tolerance infrastructure is needed to assure the dependability of a system with such COTS components. The paper describes a fault-tolerant “infrastructure” system of fault tolerance functions that makes possible the use of low-coverage COTS processors in a fault-tolerant, self-repairing system. The custom hardware supports transient recovery design fault tolerance, and self-repair by scaring and replacement. Fault tolerance functions are implemented by four types of hardware are processors of low complexity that are fault-tolerant. High error detection coverage, including design faults, is attained by diversity and replication
Year
DOI
Venue
2000
10.1109/ICDSN.2000.857581
DSN
Keywords
Field
DocType
design fault,failure rates,error detection,intel p6 family processors,fault tolerant computing,uncertainty,recovery capabilities,design fault tolerance,low-coverage cots processor,error detection coverage,high-performance cots components,failure rate,replication,computational complexity,fault tolerance infrastructure,transient fault,numerous design fault,dependable computing,external fault tolerance infrastructure,transient faults,fault tolerance function,self-repairing system,cots component,current cots processor,fault detection,logic design,semiconductor devices,fault tolerance,software design,error correction,hardware
Logic synthesis,Mean time between failures,Dependability,Software design,Computer science,Fault detection and isolation,Software fault tolerance,Failure rate,Real-time computing,Fault tolerance,Reliability engineering,Embedded system
Conference
ISBN
Citations 
PageRank 
0-7695-0707-7
14
1.24
References 
Authors
6
1
Name
Order
Citations
PageRank
Algirdas Avizienis13116351.14