Finding Frequent Structural Features among Words in Tree-Structured Documents - Citegraph

Paper Info

Title
Finding Frequent Structural Features among Words in Tree-Structured Documents

Abstract
Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k greater than or equal to 2 be an integer and (W-1, W-2,..., W-k) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W-1, W-2, ..., W-k) is a sequence <t(1); t(2);...; t(k-1)> of labeled rooted ordered trees such that, for i = 1, 2,..., k - 1, (1) t(i) consists of only one node having the pair (W-i, Wi+1) as its label, or (2) t(i) has just two nodes whose degrees are one and which are labeled with W-i and Wi+1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.

Year	DOI	Venue
2004	10.1007/978-3-540-24775-3_43	ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS
Keywords	Field	DocType
lexicographic order,tree structure	Integer,File format,Discrete mathematics,SGML,XML,Computer science,Electronic document,Information extraction,Tree structure,Lexicographical order	Conference
Volume	ISSN	Citations
3056	0302-9743	4
PageRank	References	Authors
0.46	6	3

Authors (3 rows)

Cited by (4 rows)

References (6 rows)

Name	Order	Citations	PageRank
Tomoyuki Uchida	1	255	35.06
Tomonori Mogawa	2	4	0.46
Yasuaki Nakamura	3	105	140.45

1