Title
A New Frequent Similar Tree Algorithm Motivated by Dom Mining - Using RTDM and its New Variant - SiSTeR.
Abstract
The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). A restricted tree edit distance metric, called the Restricted Top Down Metric (RTDM), is most suitable for DOM mining as well as less computationally expensive than the general tree edit distance. Given two trees with input size n(1) and n(2), the current methods take time O(n(1) center dot n(2)) to compute RTDM. Consider, however, looking for patterns that form subtrees within a web page with n elements. The RTDM must be computed for all subtrees, and the running time becomes O(n(4)). This paper proposes a new algorithm which computes the distance between all the subtrees in a tree in time O(n(2)), which enables us to obtain better quality as well as better performance, on a DOM mining task. In addition, we propose a new tree edit-distance - SiSTeR (Similar Sibling Trees aware RTDM). This variant of RTDM allows considering the case were repetitious (very similar) subtrees of different quantity appear in two trees which are supposed to be considered as similar.
Year
Venue
Keywords
2011
KDIR 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND INFORMATION RETRIEVAL
Frequent trees,Tree edit distance,RTDM,DOM,Web mining,Web data records
Field
DocType
Citations 
Data mining,Web page,Computer science,Top-down and bottom-up design,Algorithm,Sister,Artificial intelligence,Tree edit distance,Document Object Model,Web application,Machine learning
Conference
0
PageRank 
References 
Authors
0.34
14
3
Name
Order
Citations
PageRank
Omer Barkol11027.78
Ruth Bergman2457.05
Shahar Golan3575.72