Abstract | ||
---|---|---|
Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k greater than or equal to 2 be an integer and (W-1, W-2,..., W-k) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W-1, W-2, ..., W-k) is a sequence <t(1); t(2);...; t(k-1)> of labeled rooted ordered trees such that, for i = 1, 2,..., k - 1, (1) t(i) consists of only one node having the pair (W-i, Wi+1) as its label, or (2) t(i) has just two nodes whose degrees are one and which are labeled with W-i and Wi+1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents. |
Year | DOI | Venue |
---|---|---|
2004 | 10.1007/978-3-540-24775-3_43 | ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS |
Keywords | Field | DocType |
lexicographic order,tree structure | Integer,File format,Discrete mathematics,SGML,XML,Computer science,Electronic document,Information extraction,Tree structure,Lexicographical order | Conference |
Volume | ISSN | Citations |
3056 | 0302-9743 | 4 |
PageRank | References | Authors |
0.46 | 6 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Tomoyuki Uchida | 1 | 255 | 35.06 |
Tomonori Mogawa | 2 | 4 | 0.46 |
Yasuaki Nakamura | 3 | 105 | 140.45 |