Abstract | ||
---|---|---|
We have been developing a light-weight workflow system called Pwrake to execute data-intensive many-task workflows with the help of high-performance parallel I/O of Gfarm file system. This paper discusses the design of fault tolerance mechanism implemented in Pwrake. To avoid a workflow abort in the occurrence of a worker node failure, Pwrake detects a node failure based on the result of a task retry. To avoid loss of files when a worker node fails, we make use of automatic file replication of Gfarm file system. To resume an interrupted workflow correctly, we introduce a Pwrake option to rename or remove an output file of a failed task. In the experiment, we confirmed that the overhead of Gfarm automatic file replication in workflow execution time is less than 10%, and that workflow continues and returns right results even after the occurrence of an artificial failure in a worker node. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1109/MTAGS.2016.7 | MTAGS@SC |
Keywords | DocType | ISBN |
Scientific Workflow System,Fault Tolerance,Distributed File System,Many-Task Computing | Conference | 978-1-5090-5213-4 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Masahiro Tanaka | 1 | 0 | 0.34 |
Osamu Tatebe | 2 | 309 | 42.94 |