Title
System-Level Fault-Tolerance in Large-Scale Parallel Machines with Buffered Coscheduling
Abstract
As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component count will inevitably lead to frequent failures. It is therefore of paramount importance to develop new software solutions to deal with the unavoidable reality of hardware faults. In this paper we will first describe the nature of the failures of current large-scale machines, and extrapolate these results to future machines. Based on this preliminary analysis we will present a new technology that we are currently devel- oping, buffered coscheduling, which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Preliminary results show that this is attainable with current hardware.
Year
DOI
Venue
2004
10.1109/IPDPS.2004.1303239
IPDPS
Keywords
Field
DocType
operating systems,fault-tolerance,checkpointing,large-scale parallel computers,communication protocols.,failure characterization,computer aided manufacturing,resource allocation,mean time between failure,communication protocol,application software,concurrent computing,hardware,parallel computer,fault tolerant,operating system
Mean time between failures,Computer-aided manufacturing,Coscheduling,Computer science,Parallel computing,Fault tolerance,Resource allocation,Concurrent computing,Application software,Scalability,Distributed computing
Conference
Citations 
PageRank 
References 
10
0.75
10
Authors
3
Name
Order
Citations
PageRank
Fabrizio Petrini12050165.82
Kei Davis2100.75
José Carlos Sancho338229.97