Title
Normalization of Duplicate Records from Multiple Sources
Abstract
Data consolidation is a challenging issue in data integration. The usefulness of data increases when it is linked and fused with other data from numerous (Web) sources. The promise of Big Data hinges upon addressing several big data integration challenges, such as record linkage at scale, real-time data fusion, and integrating Deep Web. Although much work has been conducted on these problems, there is limited work on creating a uniform, standard record from a group of records corresponding to the same real-world entity. We refer to this task as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">record normalization</italic> . Such a record representation, coined <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">normalized record</italic> , is important for both front-end and back-end applications. In this paper, we formalize the record normalization problem, present in-depth analysis of normalization granularity levels (e.g., <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">record</italic> , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">field</italic> , and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">value-component</italic> ) and of normalization forms (e.g., <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">typical</italic> versus <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">complete</italic> ). We propose a comprehensive framework for computing the normalized record. The proposed framework includes a suit of record normalization methods, from naive ones, which use only the information gathered from records themselves, to complex strategies, which globally mine a group of duplicate records before selecting a value for an attribute of a normalized record. We conducted extensive empirical studies with all the proposed methods. We indicate the weaknesses and strengths of each of them and recommend the ones to be used in practice.
Year
DOI
Venue
2019
10.1109/TKDE.2018.2844176
IEEE Transactions on Knowledge and Data Engineering
Keywords
Field
DocType
Data integration,Standards,Task analysis,Databases,Google,Data mining,Terminology
Data integration,Record linkage,Data mining,Normalization (statistics),Terminology,Task analysis,Computer science,Sensor fusion,Big data,Empirical research
Journal
Volume
Issue
ISSN
31
4
1041-4347
Citations 
PageRank 
References 
2
0.43
0
Authors
3
Name
Order
Citations
PageRank
Yongquan Dong120.77
Eduard Constantin Dragut220121.55
W. Meng354249.10