Title
Study and Implementation of Record De-duplication Algorithms.
Abstract
Record de-duplication is a process of identification and removal of duplicates from the given dataset in a data warehouse environment. The term record linkage is also used in the same context, the difference between record de-duplication and record linkage is that the former is used when the duplicates are to be removed from one dataset while the later is used when the duplicates are to be removed from several different datasets that refer to the same entity. Both the processes de-duplication and record linkage are important during data profiling stage of a data warehouse and assure the quality of data without repetition which in turn leads to better decision making. Record de-duplication is focused for the presented research. The Efficiency of Record de-duplication is based on several criteria such as number of comparisons needed, time and cost of comparison, accuracy level of de- duplication, time and space complexity for identification of true duplicates. In this paper we have explored the several indexing techniques which are intended to make less number of comparisons to identify duplicates from the given dataset. Peter Christen has surveyed and experimented six different indexing techniques [1] such as Sorted Neighborhood indexing, Suffix Array indexing, Q Gram based indexing, Canopy Clustering, Threshold based indexing, and String Map based indexing. In this paper, we have studied and implemented Sorted Neighborhood based de-duplication techniques in detail. During this implementation Adaptive and Non-Adaptive Sorted Neighborhood Methods are experimented and validated. Accumulative Adaptive SNM (AASNM), Incrementally Adaptive SNM (IASNM)[16] are adaptive versions of SNM while Duplicate Count Strategy (DCS) [4] is a Non Adaptive SNM. A Group based Accumulative Adaptive Method (GAASNM) is proposed to minimize the record comparisons.
Year
DOI
Venue
2016
10.1145/2905055.2905063
ICTCS
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
4
3
Name
Order
Citations
PageRank
Vaishali Wangikar100.34
Sachin N. Deshmukh201.01
Sunil G. Bhirud3153.11