Title
Lineage stash: fault tolerance off the critical path
Abstract
As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad approaches for fault tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal operation but high overhead during recovery, while lineage-based solutions make the opposite tradeoff. We propose the lineage stash, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency. With the lineage stash, instead of recording the task's information before the task is executed, we record it asynchronously and forward the lineage along with the task. This makes it possible to support large-scale, low-latency (millisecond-level) data processing applications with low runtime and recovery overheads. Experimental results for applications in distributed training and stream processing show that the lineage stash provides task execution latencies similar to checkpointing alone, while incurring a recovery overhead as low as traditional lineage-based approaches.
Year
DOI
Venue
2019
10.1145/3341301.3359653
Proceedings of the 27th ACM Symposium on Operating Systems Principles
Field
DocType
ISBN
Computer science,Fault tolerance,Critical path method,Distributed computing
Conference
978-1-4503-6873-5
Citations 
PageRank 
References 
2
0.36
0
Authors
7
Name
Order
Citations
PageRank
Stephanie Wang1133.93
John Liagouris2729.04
Robert Nishihara3885.84
philipp moritz467727.91
Ujval Misra531.04
Alexey Tumanov655424.61
I. Stoica7214061710.11