Title
Improving load balancing for MapReduce-based entity matching.
Abstract
The effectiveness and scalability of MapReduce-based implementations for data-intensive tasks depends on the data assignment made from map to reduce tasks. The robustness of this assignment strategy is crucial to achieve skewed data handling and balanced workload distribution among all reduce tasks. For the entity matching problem in the Big Data context, we propose BlockSlicer, a MapReduce-based approach that supports blocking techniques to reduce the entity matching search space. The approach utilizes a preprocessing MapReduce job to analyze the data distribution and provides an improved load balancing by applying an efficient block slice strategy as well as a well-known optimization algorithm to assign the generated match tasks. We evaluate the approach against an existing one that addresses the same problem on a real cloud infrastructure. The results show that our approach increases significantly the performance of distributed entity matching task by reducing the amount of data generated from the map phase and diminishing the overall execution time.
Year
DOI
Venue
2013
10.1109/ISCC.2013.6755016
ISCC
Keywords
DocType
ISSN
parallel programming,indexes,big data,load balancing,cloud computing,pattern matching,resource allocation,data handling,optimization,programming
Conference
1530-1346
Citations 
PageRank 
References 
8
0.52
7
Authors
2
Name
Order
Citations
PageRank
Demetrio Gomes Mestre1365.44
Carlos Eduardo Santos Pires25710.68