Title
Crash Management for Distributed Parallel Systems
Abstract
With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organic computing's self-x features. Self-healing in this context means that computer clusters should detect and handle failures automatically. This paper presents a self-healing mechanism based on checkpointing, so that a cluster remains operative even if some sites or the connections between them fail. The proposed method has been implemented and tested on the Self Distributing Virtual Machine (SDVM).
Year
Venue
Keywords
2004
GI-Jahrestagung
parallel systems
Field
DocType
Citations 
Crash,Computer science,Parallel computing,Bulk synchronous parallel,Distributed computing
Conference
1
PageRank 
References 
Authors
0.39
4
2
Name
Order
Citations
PageRank
Jan Haase1166.08
Frank Eschmann2203.56