Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial Applications - Citegraph

Paper Info

Title
Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial Applications

Abstract
Entity matching -- the task of clustering duplicated database records to underlying entities -- has become an increasingly critical component in modern data integration management. Amperity provides a platform for businesses to manage customer data that utilizes a machine-learning approach to entity matching, resolving billions of customer records on a daily basis. We face several challenges in deploying entity matching to industrial applications at scale, and they are less prominent in the literature. These challenges include: (1) Providing not just a single entity clustering, but supporting clusterings at multiple confidence levels to enable downstream applications with varying precision/recall trade-off needs. (2) Many customer record attributes may be systematically missing from different sources of data, creating many pairs of records in a cluster that appear to not match due to incomplete, rather than conflicting information. Allowing these records to connect transitively without introducing conflicts is invaluable to businesses because they can acquire a more comprehensive profile of their customers without incorrect entity merges. (3) How to cluster records over time and assign persistent cluster IDs that can be used for downstream use cases such as A/B tests or predictive model training; this is made more challenging by the fact that we receive new customer data every day and clusters naturally evolving over time still require persistent IDs that refer to the same entity. In this work, we describe Amperity's entity matching framework, Fusion, and how its design provides solutions to these challenges. In particular, we describe our pairwise matching model based on ordinal regression that permits a well-defined way to produce entity clusterings at different confidence levels, a novel clustering algorithm that separates conflicting record pairs in clusters while allowing for pairs that may appear dissimilar due to missing data, and a persistent ID generation algorithm which balances stability of the identifier with ever-evolving entities.

Year	DOI	Venue
2020	10.1145/3318464.3386143	SIGMOD/PODS '20: International Conference on Management of Data Portland OR USA June, 2020
DocType	ISBN	Citations
Conference	978-1-4503-6735-6	0
PageRank	References	Authors
0.34	0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yan Yan	1	0	0.34
Stephen Meyles	2	0	0.34
Aria Haghighi	3	0	0.34
Dan Suciu	4	9625	1349.54

1