Title
A Study of Data Locality in YARN
Abstract
Co-locating the computation as close as possible to the data is an important consideration in the current data intensive systems. This is known as data locality problem. In this paper, we analyze the impact of data locality on YARN, which is the new version of Hadoop. We investigate YARN delay scheduler behavior with respect to data locality for a variety of workloads and configurations. We address in this paper three problems related to data locality. First, we study the trade-off between the data locality and the job completion time. Secondly, we observe that there is an imbalance of resource allocation when considering the data locality, which may under-utilize the cluster. Thirdly, we address the redundant I/O operations when different YARN containers request input data blocks on the same node. Additionally, we propose YARN Locality Simulator (YLocSim), a simulator tool that simulates the interactions between YARN components in a real cluster and reports the data locality percentages in real time. We validate YLocSim over a real cluster setup and use it in our study.
Year
DOI
Venue
2015
10.1109/BigDataCongress.2015.33
BigData Congress
Keywords
Field
DocType
Hadoop, Data Locality, YARN, Simulation, Scheduling
Resource management,Data mining,Locality,Yarn,Scheduling (computing),Computer science,Bandwidth (signal processing),Resource allocation,Benchmark (computing),Computation,Distributed computing
Conference
ISSN
Citations 
PageRank 
2379-7703
2
0.38
References 
Authors
8
5
Name
Order
Citations
PageRank
Yehia Elshater1111.74
patrick martin214818.22
D. Rope371.98
Mike McRoberts420.72
Craig Statchuk583.40