Title
Improved suffix blocking for record linkage and entity resolution.
Abstract
Record linkage is the problem that identifies the different records that represent the same real-world object. Entity resolution is the problem that ensures that a real-world object is represented by a single record. The incremental versions of record linkage and entity resolution address the respective problems after the insertion of a new record in the dataset. Record linkage, entity resolution and their incremental versions are of paramount importance and arise in several contexts such as data warehouses, heterogeneous databases and data analysis. Blocking techniques are usually utilized to address these problems in order to avoid comparing all record pairs. Suffix blocking is one of the most efficient and accurate blocking techniques. In this paper, we consider the non-incremental variation of record linkage and present a method that is more than five times faster and achieves similar accuracy to the current state-of-the-art suffix-based blocking method. Then, we consider the incremental variation of record linkage and propose a novel incremental suffix-based blocking mechanism that outperforms existing incremental blocking methods in terms of blocking accuracy and efficiency. Finally, we consider incremental entity resolution and present two novel techniques based on suffix blocking that are able to handle the tested dataset in a few seconds (while a current state-of-the-art technique requires more than eight hours). Our second technique proposes a novel method that keeps a history of the deleted records and the merging process. Thus, we are able to discover alternative matches for the inserted record that are not possible for existing methods and improve the accuracy of the algorithm. We have implemented and extensively experimentally evaluated all our methods. We offer two implementations of our proposals. The first one is memory-based and offers the best efficiency while the second one is disk-based and scales seamlessly to very large datasets.
Year
DOI
Venue
2018
10.1016/j.datak.2018.07.005
Data & Knowledge Engineering
Keywords
Field
DocType
Database integration,Data warehouse and repository
Data warehouse,Record linkage,Data mining,Name resolution,Information retrieval,Suffix,Computer science,Implementation,Blocking techniques,Merge (version control)
Journal
Volume
Issue
ISSN
117
1
0169-023X
Citations 
PageRank 
References 
2
0.37
28
Authors
3
Name
Order
Citations
PageRank
Amin Allam1321.59
Spiros Skiadopoulos2113965.60
Panos Kalnis33297141.30