S3: An Efficient Shared Scan Scheduler on MapReduce Framework - Citegraph

Paper Info

Title
S3: An Efficient Shared Scan Scheduler on MapReduce Framework

Abstract
Hadoop, an open-source implementation of Map-Reduce, has been widely used for data-intensive computing. In order to improve performance, multiple jobs operating on a common data file can be processed as a batch to eliminate redundant scanning. However, in practice, jobs often do not arrive at the same time, and batching them means longer waiting time for jobs that arrive earlier. In this paper, we propose S3 - a novel Shared Scan Scheduler for Hadoop - which allows sharing the scan of a common file for multiple jobs that may arrive at different time. Under S3, a job is split into a sequence of (independent) sub-jobs, each operating on a different portion of the data file, moreover, multiple sub-jobs (from different jobs) that access a common portion of a data file can be processed as a batch to share the scan of the accessed data. S3 operates as follows: at any time, the system may be processing a batch of sub-jobs (that access the same portion of data), at the same time, there are sub-jobs waiting in a job queue, as a new job arrives, its sub-jobs can be aligned with the waiting jobs in the queue, once the current batch of sub-jobs completes processing, the next batch of sub-jobs (which may include sub-jobs from newly arrived jobs) can be initiated for processing. In this way, an arriving job does not need to wait for a long time to be processed. We have implemented our S3 approach in Hadoop, and our experimental results on a cluster of over 40 nodes show that S3 outperforms the naive no-sharing scheme and the file-based shared-scan approach.

Year	DOI	Venue
2011	10.1109/ICPP.2011.42	ICPP
Keywords	Field	DocType
public domain software,current batch,mapreduce framework,round-robin data scan,efficient shared scan scheduler,mapreduce,long time,job processing,scheduling,shared scan scheduler,different time,multiple sub-jobs,data analysis,data file,s3 approach,common data file,redundant scanning elimination,accessed data,multiple job,data-intensive computing,next batch,shared scan scheduer,hadoop,open-source implementation,data intensive computing	Computer science,Scheduling (computing),Parallel computing,Queue,Job scheduler,Job queue,Data file,Operating system,Public domain software,Distributed computing	Conference
ISSN	ISBN	Citations
0190-3918 E-ISBN : 978-0-7695-4510-3	978-0-7695-4510-3	4
PageRank	References	Authors
1.05	9	3

Authors (3 rows)

Cited by (4 rows)

References (9 rows)

Name	Order	Citations	PageRank
Lei Shi	1	4	1.05
Xiaohui Li	2	11	5.42
Kian-Lee Tan	3	6962	776.65

1