Title
Towards Scalable One-Pass Analytics Using MapReduce
Abstract
An integral part of many data-intensive applications is the need to collect and analyze enormous datasets efficiently. Concurrent with such application needs is the increasing adoption of MapReduce as a programming model for processing large datasets using a cluster of machines. Current MapReduce systems, however, require the data set to be loaded into the cluster before running analytical queries, and thereby incur high delays to start query processing. Furthermore, existing systems are geared towards batch processing. In this paper, we seek to answer a fundamental question: what architectural changes are necessary to bring the benefits of the MapReduce computation model to incremental, one-pass analytics, i.e., to support stream processing and online aggregation? To answer this question, we first conduct a detailed empirical performance study of current MapReduce implementations including Hadoop and MapReduce Online using a variety of workloads. By doing so, we identify several drawbacks of existing systems for one-pass analytics. Based on the insights from our study, we conclude by listing key design requirements and arguing for architectural changes of MapReduce systems to overcome their current limitations and fully embrace incremental one-pass analytics and showing promising preliminary results.
Year
DOI
Venue
2011
10.1109/IPDPS.2011.251
IPDPS Workshops
Keywords
Field
DocType
current mapreduce,mapreduce computation model,stream processing,architectural change,batch processing,mapreduce system,mapreduce online,towards scalable one-pass,one-pass analytics,query processing,current mapreduce system,data processing,distributed processing,batch process,sorting,parallel processing,benchmark testing,computational modeling,programming model,fault tolerance,computer model,data analysis
Data science,Programming paradigm,Computer science,Implementation,Batch processing,Online aggregation,Stream processing,Analytics,Database,Computation,Scalability
Conference
Citations 
PageRank 
References 
5
0.52
30
Authors
4
Name
Order
Citations
PageRank
Edward Mazur11024.10
Boduo Li22028.65
Yanlei Diao31174.49
Prashant J. Shenoy46386521.30