Title
Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters.
Abstract
Datacenters are under-utilized, primarily due to unused resources on over-provisioned nodes of latency-critical jobs. Such idle resources can be used to run batch data analytic jobs to increase datacenter utilization, but these transient resources must be evicted whenever latency-critical jobs require them again. Resource evictions often lead to cascading recomputations, which is usually handled by checkpointing intermediate results on stable storages of eviction-free reserved resources. However, checkpointing has major shortcomings in its substantial overhead of transferring data back and forth. In this work, we step away from such approaches and focus on observing the job structure and the relationships between computations of the job. We carefully mark the computations that are most likely to cause a large number of recomputations upon evictions, to run them reliably using reserved resources. This lets us retain corresponding intermediate results effortlessly without any additional checkpointing. We design Pado, a general data processing engine, which carries out our idea with several optimizations that minimize the number of additional reserved nodes. Evaluation results show that Pado outperforms Spark 2.0.0 by up to 5.1×, and checkpoint-enabled Spark by up to 3.8×.
Year
DOI
Venue
2017
10.1145/3064176.3064181
EuroSys
Field
DocType
Citations 
Data processing,Spark (mathematics),Idle,Computer science,Real-time computing,Operating system,Computation,Distributed computing
Conference
10
PageRank 
References 
Authors
0.62
17
8
Name
Order
Citations
PageRank
Youngseok Yang1141.70
Geon-Woo Kim2100.62
Won Wook Song3110.98
Yunseong Lee4152.72
Andrew Chung5443.57
Zhengping Qian635017.04
Brian Cho719915.57
Byung-Gon Chun83832234.37