Title
BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution.
Abstract
Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches. Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features. In this paper, we demonstrate how \"loose\" schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for efficiently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.
Year
DOI
Venue
2016
10.14778/2994509.2994533
PVLDB
Field
DocType
Volume
Data integration,Data mining,Noisy data,Name resolution,Computer science,Blocking techniques,Redundancy (engineering),Schema (psychology),Big data,Database,Speedup
Journal
9
Issue
ISSN
Citations 
12
2150-8097
13
PageRank 
References 
Authors
0.61
13
3
Name
Order
Citations
PageRank
Giovanni Simonini13111.55
Sonia Bergamaschi21240297.26
H. V. Jagadish3111412495.67