Title
iHadoop: Asynchronous Iterations for MapReduce
Abstract
MapReduce is a distributed programming framework designed to ease the development of scalable data-intensive applications for large clusters of commodity machines. Most machine learning and data mining applications involve iterative computations over large datasets, such as the Web hyperlink structures and social network graphs. Yet, the MapReduce model does not efficiently support this important class of applications. The architecture of MapReduce, most critically its dataflow techniques and task scheduling, is completely unaware of the nature of iterative applications, tasks are scheduled according to a policy that optimizes the execution for a single iteration which wastes bandwidth, I/O, and CPU cycles when compared with an optimal execution for a consecutive set of iterations. This work presents iHadoop, a modified MapReduce model, and an associated implementation, optimized for iterative computations. The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop's task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application's latency. This paper also describes our implementation of the iHadoop model, and evaluates its performance against Hadoop, the widely used open source implementation of MapReduce. Experiments using different data analysis applications over real-world and synthetic datasets show that iHadoop performs better than Hadoop for iterative algorithms, reducing execution time of iterative applications by 25% on average. Furthermore, integrating iHadoop with HaLoop, a variant Hadoop implementation that caches invaria- t data between iterations, reduces execution time by 38% on average.
Year
DOI
Venue
2011
10.1109/CloudCom.2011.21
CloudCom
Keywords
Field
DocType
asynchronous iterations,execution time,distributed programming framework,mapreduce,caches invariant data,large datasets,iterative computations,dataflow techniques,mapreduce model,i/o cycles,scheduling,data analysis applications,parallel data processing,commodity machines,inter-iteration data locality,iterative algorithms,learning (artificial intelligence),synthetic datasets,asynchronous,ihadoop model schedule,single iteration,data analysis,distributed programming,social network graphs,software performance evaluation,iterative application,data concurrently,fast local data,electronic data interchange,data mining application,data flow analysis,bandwidth cycles,cpu cycles,hadoop implementation,open source implementation,optimal execution,data mining,performance evaluation,different data analysis application,graph theory,very large databases,ihadoop model,social networking (online),machine learning,cluster,web hyperlink structures,hypermedia,data mining applications,task scheduling,iterative computation,scalable data-intensive applications,iterative methods,data transfer,schedules,data processing,fault tolerant,programming,satisfiability,iteration method,social network,computer model,computational modeling,iterative algorithm,fault tolerance,fault tolerant system,learning artificial intelligence
Asynchronous communication,Scheduling (computing),Iterative method,Computer science,Parallel computing,Data-flow analysis,Dataflow,Schedule,Instruction cycle,Distributed computing,Scalability
Conference
ISBN
Citations 
PageRank 
978-1-4673-0090-2
24
0.84
References 
Authors
30
3
Name
Order
Citations
PageRank
Eslam Elnikety1584.37
Tamer Elsayed232636.39
Hany E. Ramadan31727.56