Title
Toward intersection filter-based optimization for joins in MapReduce
Abstract
MapReduce has become an attractive and dominant model for processing large-scale datasets. However, this model is not designed to directly support operations with multiple inputs as joins. Many studies on join algorithms including Bloom join in MapReduce have been conducted but they still have too much non-joining data generated and transmitted over the network. This research will help us eliminate the problem by providing an intersection filter based on probabilistic models to remove most disjoint elements between two datasets. Namely, three ways are proposed to build the intersection Bloom filter. To apply the filter to joins, a corresponding MapReduce job will be adjusted in a consistent way without increasing related costs. We then consider two-way joins and join cascades and analyze their costs. As a result, thanks to the high accuracy intersection filter, join processing can minimize disk I/O and communication costs. Finally, the research is proved to be more effective than existing solutions through a cost-based comparison of joins using different approaches.
Year
DOI
Venue
2013
10.1145/2501928.2501932
Cloud-I
Keywords
Field
DocType
communication cost,probabilistic model,disjoint element,high accuracy intersection filter,intersection filter-based optimization,dominant model,intersection bloom filter,different approach,cost-based comparison,corresponding mapreduce job,large-scale datasets,bloom filter,data analysis,cloud computing
Bloom filter,Joins,Disjoint sets,Computer science,Sort-merge join,Probabilistic logic,Distributed computing,Cloud computing
Conference
Citations 
PageRank 
References 
4
0.44
21
Authors
3
Name
Order
Citations
PageRank
Thuong Cang Phan154.85
Laurent d'Orazio28115.38
Philippe Rigaux3444110.71