Normalization of Duplicate Records from Multiple Sources - Citegraph

Paper Info

Title
Normalization of Duplicate Records from Multiple Sources

Abstract
Data consolidation is a challenging issue in data integration. The usefulness of data increases when it is linked and fused with other data from numerous (Web) sources. The promise of Big Data hinges upon addressing several big data integration challenges, such as record linkage at scale, real-time data fusion, and integrating Deep Web. Although much work has been conducted on these problems, there is limited work on creating a uniform, standard record from a group of records corresponding to the same real-world entity. We refer to this task as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">record normalization</italic> . Such a record representation, coined <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">normalized record</italic> , is important for both front-end and back-end applications. In this paper, we formalize the record normalization problem, present in-depth analysis of normalization granularity levels (e.g., <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">record</italic> , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">field</italic> , and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">value-component</italic> ) and of normalization forms (e.g., <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">typical</italic> versus <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">complete</italic> ). We propose a comprehensive framework for computing the normalized record. The proposed framework includes a suit of record normalization methods, from naive ones, which use only the information gathered from records themselves, to complex strategies, which globally mine a group of duplicate records before selecting a value for an attribute of a normalized record. We conducted extensive empirical studies with all the proposed methods. We indicate the weaknesses and strengths of each of them and recommend the ones to be used in practice.

Year	DOI	Venue
2019	10.1109/TKDE.2018.2844176	IEEE Transactions on Knowledge and Data Engineering
Keywords	Field	DocType
Data integration,Standards,Task analysis,Databases,Google,Data mining,Terminology	Data integration,Record linkage,Data mining,Normalization (statistics),Terminology,Task analysis,Computer science,Sensor fusion,Big data,Empirical research	Journal
Volume	Issue	ISSN
31	4	1041-4347
Citations	PageRank	References
2	0.43	0
Authors
3

Authors (3 rows)

Cited by (2 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yongquan Dong	1	2	0.77
Eduard Constantin Dragut	2	201	21.55
W. Meng	3	54	249.10

1