Abstract | ||
---|---|---|
The effectiveness and scalability of MapReduce-based implementations for data-intensive tasks depends on the data assignment made from map to reduce tasks. The robustness of this assignment strategy is crucial to achieve skewed data handling and balanced workload distribution among all reduce tasks. For the entity matching problem in the Big Data context, we propose BlockSlicer, a MapReduce-based approach that supports blocking techniques to reduce the entity matching search space. The approach utilizes a preprocessing MapReduce job to analyze the data distribution and provides an improved load balancing by applying an efficient block slice strategy as well as a well-known optimization algorithm to assign the generated match tasks. We evaluate the approach against an existing one that addresses the same problem on a real cloud infrastructure. The results show that our approach increases significantly the performance of distributed entity matching task by reducing the amount of data generated from the map phase and diminishing the overall execution time. |
Year | DOI | Venue |
---|---|---|
2013 | 10.1109/ISCC.2013.6755016 | ISCC |
Keywords | DocType | ISSN |
parallel programming,indexes,big data,load balancing,cloud computing,pattern matching,resource allocation,data handling,optimization,programming | Conference | 1530-1346 |
Citations | PageRank | References |
8 | 0.52 | 7 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Demetrio Gomes Mestre | 1 | 36 | 5.44 |
Carlos Eduardo Santos Pires | 2 | 57 | 10.68 |