Title
Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial Applications
Abstract
Entity matching -- the task of clustering duplicated database records to underlying entities -- has become an increasingly critical component in modern data integration management. Amperity provides a platform for businesses to manage customer data that utilizes a machine-learning approach to entity matching, resolving billions of customer records on a daily basis. We face several challenges in deploying entity matching to industrial applications at scale, and they are less prominent in the literature. These challenges include: (1) Providing not just a single entity clustering, but supporting clusterings at multiple confidence levels to enable downstream applications with varying precision/recall trade-off needs. (2) Many customer record attributes may be systematically missing from different sources of data, creating many pairs of records in a cluster that appear to not match due to incomplete, rather than conflicting information. Allowing these records to connect transitively without introducing conflicts is invaluable to businesses because they can acquire a more comprehensive profile of their customers without incorrect entity merges. (3) How to cluster records over time and assign persistent cluster IDs that can be used for downstream use cases such as A/B tests or predictive model training; this is made more challenging by the fact that we receive new customer data every day and clusters naturally evolving over time still require persistent IDs that refer to the same entity. In this work, we describe Amperity's entity matching framework, Fusion, and how its design provides solutions to these challenges. In particular, we describe our pairwise matching model based on ordinal regression that permits a well-defined way to produce entity clusterings at different confidence levels, a novel clustering algorithm that separates conflicting record pairs in clusters while allowing for pairs that may appear dissimilar due to missing data, and a persistent ID generation algorithm which balances stability of the identifier with ever-evolving entities.
Year
DOI
Venue
2020
10.1145/3318464.3386143
SIGMOD/PODS '20: International Conference on Management of Data Portland OR USA June, 2020
DocType
ISBN
Citations 
Conference
978-1-4503-6735-6
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Yan Yan100.34
Stephen Meyles200.34
Aria Haghighi300.34
Dan Suciu496251349.54