Title
A Unified Record Linkage Strategy for Web Service Data
Abstract
Record linkage, also known as duplicate detection, is a key process that ensures the quality of data stored for Web service data. Given two lists of records, record linkage consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes constituting the record. In this paper, we present a unified framework for recognizing clusters of near-duplicate records of multi-language data, specially for Chinese/English mixed Web data. The key ideas are: (1)Pre-processing multi-language data Using Chinese words segmentation and Chinese named entity recognition techniques; (2) Pair-wise comparison method based on domain- specific similarities, especially, the string kernel method; (3)a priority queue of duplicate clusters and representative records strategy to respond adaptively to the data scale. Experiments on real databases show that the proposed recode linkage strategy is efficiency and effectiveness.
Year
DOI
Venue
2010
10.1109/WKDD.2010.134
WKDD
Keywords
Field
DocType
Ignore
Record linkage,Data mining,Duplicate detection,Segmentation,Computer science,Priority queue,Artificial intelligence,String kernel,Web service,Named-entity recognition,Machine learning
Conference
ISBN
Citations 
PageRank 
978-1-4244-5398-6
0
0.34
References 
Authors
11
4
Name
Order
Citations
PageRank
Kan Q100.34
Yang Yu-Jiu28919.30
Zhen S300.34
Liu W.420922.42