Title
Active Learning for Large-Scale Entity Resolution.
Abstract
Entity resolution (ER) is the task of identifying different representations of the same real-world object across datasets. Designing and tuning ER algorithms is an error-prone, labor-intensive process, which can significantly benefit from data-driven, automated learning methods. Our focus is on "big data'' scenarios where the primary challenges include 1) identifying, out of a potentially massive set, a small subset of informative examples to be labeled by the user, 2) using the labeled examples to efficiently learn ER algorithms that achieve both high precision and high recall, and 3) executing the learned algorithm to determine duplicates at scale. Recent work on learning ER algorithms has employed active learning to partially address the above challenges by aiming to learn ER rules in the form of conjunctions of matching predicates, under precision guarantees. While successful in learning a single rule, prior work has been less successful in learning multiple rules that are sufficiently different from each other, thus missing opportunities for improving recall. In this paper, we introduce an active learning system that learns, at scale, multiple rules each having significant coverage of the space of duplicates, thus leading to high recall, in addition to high-precision. We show the superiority of our system on real-world ER scenarios of sizes up to tens of millions of records, over state-of-the-art active learning methods that learn either rules or committees of statistical classifiers for ER, and even over sophisticated methods based on first-order probabilistic models.
Year
DOI
Venue
2017
10.1145/3132847.3132949
CIKM
Keywords
Field
DocType
Entity Resolution, Large-Scale Data Cleansing
Name resolution,Active learning,Information retrieval,Computer science,Artificial intelligence,Predicate (grammar),Probabilistic logic,Big data,Recall,Machine learning,Database
Conference
ISBN
Citations 
PageRank 
978-1-4503-4918-5
6
0.41
References 
Authors
28
3
Name
Order
Citations
PageRank
Kun Qian182.81
Ling-ling Yan2127370.78
Prithviraj Sen383738.24