Title
Clustering XML documents by patterns.
Abstract
Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML’s semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework’s distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.
Year
DOI
Venue
2016
10.1007/s10115-015-0820-0
Knowledge and Information Systems
Keywords
Field
DocType
XML, Semi-structured document analysis, Pattern-based clustering, Pattern definition
Data mining,XML,Information retrieval,Computer science,Cluster analysis
Journal
Volume
Issue
ISSN
46
1
0219-3116
Citations 
PageRank 
References 
3
0.40
32
Authors
3
Name
Order
Citations
PageRank
Maciej Piernik1132.60
Dariusz Brzezinski221311.28
Tadeusz Morzy3487282.62