Distributed Data Deduplication. - Citegraph

Paper Info

Title
Distributed Data Deduplication.

Abstract
Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.

Year	DOI	Venue
2016	10.14778/2983200.2983203	PVLDB
DocType	Volume	Issue
Journal	9	11
ISSN	Citations	PageRank
2150-8097	12	0.52
References	Authors
25	3

Authors (3 rows)

Cited by (12 rows)

References (25 rows)

Name	Order	Citations	PageRank
Xu Chu	1	144	7.13
Ihab F. Ilyas	2	2907	117.27
Paraschos Koutris	3	347	26.63

1