Entity matching for semistructured data in the Cloud - Citegraph

Paper Info

Title
Entity matching for semistructured data in the Cloud

Abstract
The rapid expansion of available information, on the Web or inside companies, is increasing. With Cloud infrastructure maturing (including tools for parallel data processing, text analytics, clustering, etc.), there is more interest in integrating data to produce higher-value content. New challenges, notably include entity matching over large volumes of heterogeneous data. In this paper, we describe an approach for entity matching over large amounts of semistructured data in the Cloud. The approach combines ChuQL[4], a recently proposed extension of XQuery with MapReduce, and a blocking technique for entity matching which can be efficiently executed on top of MapReduce. We illustrate the proposed approach by applying it to extract automatically and enrich references in Wikipedia and report on an experimental evaluation of the approach.

Year	DOI	Venue
2012	10.1145/2245276.2245363	SAC
Keywords	Field	DocType
experimental evaluation,large volume,heterogeneous data,available information,cloud infrastructure,entity matching,large amount,semistructured data,parallel data processing,privacy,data processing,performance,cloud computing	Data mining,Data processing,Text mining,Information retrieval,Computer science,Cluster analysis,Cloud computing,XQuery	Conference
Citations	PageRank	References
1	0.39	8
Authors
5

Authors (5 rows)

Cited by (1 rows)

References (8 rows)

Name	Order	Citations	PageRank
Marcus Paradies	1	82	10.36
Susan Malaika	2	76	14.01
Jérôme Siméon	3	1515	210.75
Shahan Khatchadourian	4	102	8.57
Kai-uwe Sattler	5	1144	126.81

1