Title
A platform for scalable one-pass analytics using MapReduce
Abstract
Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the data set to be fully loaded into the cluster before running analytical queries. This paper examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.
Year
DOI
Venue
2011
10.1145/1989323.1989426
SIGMOD Conference
Keywords
Field
DocType
hadoop-based mapreduce system,in-memory processing,scalable one-pass analytics,new data analysis platform,one-pass analytics application,internal data spill,parallel processing,batch processing,mapreduce model,one-pass analytics,incremental one-pass analytics,data analysis,batch process,state space,programming model
Data mining,Architectural design,Programming paradigm,Computer science,Parallel processing,Hash function,Batch processing,Analytics,Database,Scalability
Conference
Citations 
PageRank 
References 
86
3.01
21
Authors
5
Name
Order
Citations
PageRank
Boduo Li12028.65
Edward Mazur21024.10
Yanlei Diao32234108.95
Andrew Mcgregor4134064.31
Prashant J. Shenoy56386521.30