Title
Software Schemes of Reconfiguration and Recovery in Distributed Memory Multicomputers Using the Actor Model
Abstract
Abstract: Ideally, a multicomputer system should cope with a processor failure by reconstructing itself-and the application running on itself-in order to maintain the available computational power of the remaining processors. We discuss the continuance of running applications through permanent processor failures. We take advantage of the characteristics of the actor model of parallel computation and dynamically checkpoint the activity of the application. Consequently, the runtime system is able to continue an application through multiple nonconcurrent processor failures. We have implemented our techniques through modifications of the runtime system of the parallel language Charm on an Intel iPSC/s hypercube. After discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of one or more faults.
Year
DOI
Venue
1995
10.1109/FTCS.1995.466950
FTCS
Keywords
Field
DocType
actor model,multicomputer system,permanent processor failure,parallel computation,remaining processor,parallel language charm,intel ipsc,processor failure,multiple nonconcurrent processor failure,runtime system,software schemes,reconfiguration,distributed computing,reliability,computational modeling,concurrent computing,recovery,parallel processing,fault tolerant,fault tolerance,parallel computer,overhead,software maintenance
Parallel language,Computer science,Distributed memory,Fault tolerance,Actor model,Control reconfiguration,Intel iPSC,Fault injection,Runtime system,Distributed computing
Conference
ISSN
Citations 
PageRank 
0731-3071
3
0.67
References 
Authors
15
2
Name
Order
Citations
PageRank
M. Peercy181.79
Banerjee, P.2669.21