Title
A cluster-based approach to XML similarity joins
Abstract
A natural consequence of the widespread adoption of XML as standard for information representation and exchange is the redundant storage of large amounts of persistent XML documents. Compared to relational data tables, data represented in XML format can potentially be even more sensitive to data quality issues because structure, besides textual information, may cause variations in XML documents representing the same information entity. Therefore, correlating XML documents, which are similar in content an structure, is a fundamental operation. In this paper, we present an effective, flexible, and high-performance XML-based similarity join framework. We exploit structural summaries and clustering concepts to produce compact and high-quality XML document representations: our approach outperforms previous work both in terms of performance and accuracy. In this context, we explore different ways to weigh and combine evidence from textual and structural XML representations. Furthermore, we address user interaction, when the similarity framework is configured for a specific domain, and updatability of clustering information, when new documents enter datasets under consideration. We present a thorough experimental evaluation to validate our techniques in the context of a native XML DBMS.
Year
DOI
Venue
2009
10.1145/1620432.1620451
IDEAS
Keywords
Field
DocType
persistent xml document,xml format,correlating xml document,xml similarity,information entity,structural xml representation,information representation,clustering information,native xml dbms,xml document,cluster-based approach,high-quality xml document representation,entity resolution,xml database,clustering,xml databases,xml,data quality,relational data
Data mining,XML framework,XML Encryption,Efficient XML Interchange,Streaming XML,Information retrieval,Computer science,XML validation,Document Structure Description,XML schema,Database,XML Schema Editor
Conference
Citations 
PageRank 
References 
5
0.44
32
Authors
3
Name
Order
Citations
PageRank
Leonardo A. Ribeiro1181.02
Theo Härder21132307.12
Fernanda S. Pimenta3101.63