Abstract | ||
---|---|---|
Co-locating the computation as close as possible to the data is an important consideration in the current data intensive systems. This is known as data locality problem. In this paper, we analyze the impact of data locality on YARN, which is the new version of Hadoop. We investigate YARN delay scheduler behavior with respect to data locality for a variety of workloads and configurations. We address in this paper three problems related to data locality. First, we study the trade-off between the data locality and the job completion time. Secondly, we observe that there is an imbalance of resource allocation when considering the data locality, which may under-utilize the cluster. Thirdly, we address the redundant I/O operations when different YARN containers request input data blocks on the same node. Additionally, we propose YARN Locality Simulator (YLocSim), a simulator tool that simulates the interactions between YARN components in a real cluster and reports the data locality percentages in real time. We validate YLocSim over a real cluster setup and use it in our study. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1109/BigDataCongress.2015.33 | BigData Congress |
Keywords | Field | DocType |
Hadoop, Data Locality, YARN, Simulation, Scheduling | Resource management,Data mining,Locality,Yarn,Scheduling (computing),Computer science,Bandwidth (signal processing),Resource allocation,Benchmark (computing),Computation,Distributed computing | Conference |
ISSN | Citations | PageRank |
2379-7703 | 2 | 0.38 |
References | Authors | |
8 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yehia Elshater | 1 | 11 | 1.74 |
patrick martin | 2 | 148 | 18.22 |
D. Rope | 3 | 7 | 1.98 |
Mike McRoberts | 4 | 2 | 0.72 |
Craig Statchuk | 5 | 8 | 3.40 |