Title
A bottom-up approach for XML documents classification
Abstract
Extensible Markup Language (XML) is a simple and flexible text format derived from SGML [1]. It has been widely accepted as one of the crucial components in many information retrieval related applications, such as XML databases, web services, etc. One of the reasons for its wide acceptance is its customized format during data transmission or data storage. Classification is an important data mining task, which aims to assign unknown objects to classes which best characterize them. In this paper, we propose a method to classify XML documents under the assumption that they do not have a common schema, which may or may not be available. Our method is similarity-based. Its main characteristics is its way to handle the roles played by texts and the structural information. Unlike most existing methods, we use a bottom-up approach, i.e., we start from the text first, and then embed the structural information. This is based on the observation that in XML documents with diversified tag structures, the most informative information are carried by the terms in the texts. Our experiments show that this strategy can achieve a better performance than the existing methods for documents from sources that exhibit heterogeneous structures.
Year
DOI
Venue
2008
10.1145/1451940.1451960
IDEAS
Keywords
Field
DocType
information retrieval,existing method,customized format,structural information,xml databases,xml documents classification,important data mining task,informative information,data transmission,data storage,xml document,bottom-up approach,data mining,discretization,bottom up,classification,extensible markup language,mutual information,xml database,web service
Data mining,Efficient XML Interchange,Streaming XML,SGML,XML,Information retrieval,XML validation,Computer science,Document Structure Description,XML schema,Database,XML Schema Editor
Conference
Citations 
PageRank 
References 
3
0.42
15
Authors
2
Name
Order
Citations
PageRank
Junwei Wu130.42
Jian Tang2526148.30