Title
Migol: A fault-tolerant service framework for MPI applications in the grid
Abstract
Especially for sciences the provision of massive parallel CPU capacity is one of the most attractive features of a grid. A major challenge in a distributed, inherently dynamic grid is fault tolerance. The more resources and components involved, the more complicated and error-prone becomes the system. In a grid with potentially thousands of machines connected to each other the reliability of individual resources cannot be guaranteed. The benefit of the grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently. In this article, we present Migol, a fault-tolerant and self-healing grid middleware for MPI applications. Migol is based on open standards and extends the services of the Globus toolkit to support the fault tolerance of grid applications. Further, the Migol framework itself is designed with special focus on fault tolerance. For example, Migol replicates critical services and uses a ring-based replication protocol to achieve data consistency.
Year
DOI
Venue
2008
10.1016/j.future.2007.03.007
PVM/MPI
Keywords
DocType
Volume
migration,fault tolerant,mpi,fault tolerance,grid computing
Journal
24
Issue
ISSN
ISBN
2
Future Generation Computer Systems
3-540-29009-5
Citations 
PageRank 
References 
22
1.24
27
Authors
2
Name
Order
Citations
PageRank
André Luckow18410.58
Bettina Schnor214226.36