Title
Proactive process-level live migration and back migration in HPC environments
Abstract
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.
Year
DOI
Venue
2012
10.1016/j.jpdc.2011.10.009
J. Parallel Distrib. Comput.
Keywords
Field
DocType
process migration,migration approach,continued execution,process level,mpi execution environment,hpc environment,migration mechanism,proactive process-level live migration,reactive ft,proactive ft,live process migration,outstanding execution,operating system,fault tolerant,high performance computing,fault tolerance
Virtualization,Resubmission,Supercomputer,Live migration,Computer science,Process migration,Parallel computing,Fault tolerance,Technical report,Operating system,Distributed computing
Journal
Volume
Issue
ISSN
72
2
0743-7315
Citations 
PageRank 
References 
19
0.71
51
Authors
4
Name
Order
Citations
PageRank
Chao Wang140427.12
Frank Mueller23497219.77
Christian Engelmann395360.46
Stephen L. Scott474870.99