Title
Unsupervised Blocking Key Selection for Real-Time Entity Resolution
Abstract
Real-time entity resolution (ER) is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing is a major step in the ER process, aimed at reducing the search space by bringing similar records closer to each other using a blocking key criterion. Selecting these keys is crucial for the effectiveness and efficiency of the real-time ER process. Traditional indexing techniques require domain knowledge for optimal key selection. However, to make the ER process less dependent on human domain knowledge, automatic selection of optimal blocking keys is required. In this paper we propose an unsupervised learning technique that automatically selects optimal blocking keys for building indexes that can be used in real-time ER. We specifically learn multiple keys to be used with multi-pass sorted neighbourhood, one of the most efficient and widely used indexing techniques for ER. We evaluate the proposed approach using three real-world data sets, and compare it with an existing automatic blocking key selection technique. The results show that our approach learns optimal blocking/sorting keys that are suitable for real-time ER. The learnt keys significantly increase the efficiency of query matching while maintaining the quality of matching results.
Year
DOI
Venue
2015
10.1007/978-3-319-18032-8_45
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PART II
Keywords
Field
DocType
Record linkage,Unsupervised learning,Automatic blocking,Key selection,Sorted neighbourhood indexing
Data mining,Data set,Name resolution,Domain knowledge,Computer science,Search engine indexing,Sorting,Unsupervised learning,Artificial intelligence,Machine learning
Conference
Volume
ISSN
Citations 
9078
0302-9743
4
PageRank 
References 
Authors
0.40
19
2
Name
Order
Citations
PageRank
Banda Ramadan1181.50
Peter Christen21697107.21