Title
Accelerating Big Data Analytics Using Scale-Up/Out Heterogeneous Clusters
Abstract
Production data analytic workloads typically consist of a majority of jobs with small input data sizes and a small number of jobs with large input data sizes. Recent works advocate scale-up/scale-out heterogeneous clusters (in short Hybrid clusters) to handle these heterogeneous workloads, since scaleup machines (i.e., adding more resources to a single machine) can process small jobs faster than simply scaling out the cluster with cheap machines. However, there are several challenges for job placement and data placement to implement such a Hybrid cluster. In this paper, we propose a job placement strategy and a data placement strategy to solve the challenges. The job placement strategy places a job to either scale-up or scale-out machines based on the job's characteristics, and migrates jobs from scale-up machines to under-utilized scale-out machines to achieve load balance. The data placement strategy allocates data replicas in the two types of machines accordingly to increase the data locality in Hybrid cluster. We implemented a Hybrid cluster on Apache YARN, and evaluated its performance using a Facebook production workload. With our proposed strategies, a Hybrid cluster can reduce the makespan of the workload up to 37% and the median job completion time up to 60%, compared to traditional scale-out clusters with state-of-the-art schedulers.
Year
DOI
Venue
2019
10.1109/ICCCN.2019.8847060
2019 28th International Conference on Computer Communication and Networks (ICCCN)
Keywords
Field
DocType
single machine,cheap machines,job placement strategy,data placement strategy,scale-out machines,data replicas,data locality,Facebook production workload,median job completion time,traditional scale-out clusters,big data analytics,heterogeneous clusters,production data analytic workloads,heterogeneous workloads,scaleup machines,short hybrid clusters,hybrid cluster,Apache YARN
Small number,Cluster (physics),Locality,Job shop scheduling,Yarn,Workload,Computer science,Load balancing (computing),Big data,Distributed computing
Conference
ISSN
ISBN
Citations 
1095-2055
978-1-7281-1857-4
0
PageRank 
References 
Authors
0.34
9
3
Name
Order
Citations
PageRank
Zhuozhao Li16911.61
Haiying Shen21355126.34
Lee Ward3496.70