Title
Finding Frequent Structural Features among Words in Tree-Structured Documents
Abstract
Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k greater than or equal to 2 be an integer and (W-1, W-2,..., W-k) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W-1, W-2, ..., W-k) is a sequence <t(1); t(2);...; t(k-1)> of labeled rooted ordered trees such that, for i = 1, 2,..., k - 1, (1) t(i) consists of only one node having the pair (W-i, Wi+1) as its label, or (2) t(i) has just two nodes whose degrees are one and which are labeled with W-i and Wi+1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.
Year
DOI
Venue
2004
10.1007/978-3-540-24775-3_43
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS
Keywords
Field
DocType
lexicographic order,tree structure
Integer,File format,Discrete mathematics,SGML,XML,Computer science,Electronic document,Information extraction,Tree structure,Lexicographical order
Conference
Volume
ISSN
Citations 
3056
0302-9743
4
PageRank 
References 
Authors
0.46
6
3
Name
Order
Citations
PageRank
Tomoyuki Uchida125535.06
Tomonori Mogawa240.46
Yasuaki Nakamura3105140.45