Title
Design of Fault Tolerant Pwrake Workflow System Supported by Gfarm File System.
Abstract
We have been developing a light-weight workflow system called Pwrake to execute data-intensive many-task workflows with the help of high-performance parallel I/O of Gfarm file system. This paper discusses the design of fault tolerance mechanism implemented in Pwrake. To avoid a workflow abort in the occurrence of a worker node failure, Pwrake detects a node failure based on the result of a task retry. To avoid loss of files when a worker node fails, we make use of automatic file replication of Gfarm file system. To resume an interrupted workflow correctly, we introduce a Pwrake option to rename or remove an output file of a failed task. In the experiment, we confirmed that the overhead of Gfarm automatic file replication in workflow execution time is less than 10%, and that workflow continues and returns right results even after the occurrence of an artificial failure in a worker node.
Year
DOI
Venue
2016
10.1109/MTAGS.2016.7
MTAGS@SC
Keywords
DocType
ISBN
Scientific Workflow System,Fault Tolerance,Distributed File System,Many-Task Computing
Conference
978-1-5090-5213-4
Citations 
PageRank 
References 
0
0.34
0
Authors
2
Name
Order
Citations
PageRank
Masahiro Tanaka100.34
Osamu Tatebe230942.94