Title
Context-Aware Approximate String Matching for Large-Scale Real-Time Entity Resolution.
Abstract
Techniques for approximate string matching have been widely studied over several decades. They are required in many applications, including entity resolution, spell checking, similarity joins, and biological sequence comparison. Most existing techniques for approximate string matching used in entity resolution only consider the two strings that are compared. They neglect contextual information such as the frequency of how often strings occur in a database, the likelihood of the character edits between strings, or how many other similar strings there are in a database. In this paper we investigate if incorporating such contextual information into edit distance based approximate string matching can improve matching quality for real-time entity resolution. In this application, query records have to be matched in sub-second time to records in a large database that refer to the same entity. We evaluate our approach on two large real data sets and compare it to several baseline approaches. Our results show that considering edit frequency and the neighborhood size of a string can improve matching results, while taking string frequencies into account can actually make results worse.
Year
DOI
Venue
2015
10.1109/ICDMW.2015.152
ICDM Workshops
Keywords
Field
DocType
Edit distance,data matching,real-time matching,string databases,similarity calculation
String searching algorithm,Edit distance,String-to-string correction problem,Data mining,Joins,Data set,Commentz-Walter algorithm,Computer science,Approximate string matching,String metric
Conference
Citations 
PageRank 
References 
0
0.34
15
Authors
2
Name
Order
Citations
PageRank
Peter Christen11697107.21
Ross W. Gayler2718.19