Title
Distributed Data Deduplication.
Abstract
Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.
Year
DOI
Venue
2016
10.14778/2983200.2983203
PVLDB
DocType
Volume
Issue
Journal
9
11
ISSN
Citations 
PageRank 
2150-8097
12
0.52
References 
Authors
25
3
Name
Order
Citations
PageRank
Xu Chu11447.13
Ihab F. Ilyas22907117.27
Paraschos Koutris334726.63