STEM: a suffix tree-based method for web data records extraction - Citegraph

Paper Info

Title
STEM: a suffix tree-based method for web data records extraction

Abstract
To automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications. © 2017, Springer-Verlag London.

Year	DOI	Venue
2018	10.1007/s10115-017-1062-0	Knowledge and Information Systems
Keywords	Field	DocType
Web data extraction,Suffix tree,HTML tag path,Data Record pattern	HTML element,Data mining,Data set,Web page,Identifier,Computer science,Suffix tree,Time complexity,Compressed suffix array,Data records	Journal
Volume	Issue	ISSN
55	2	02191377
Citations	PageRank	References
0	0.34	50
Authors
5

Authors (5 rows)

Cited by (0 rows)

References (50 rows)

Name	Order	Citations	PageRank
Yixiang Fang	1	227	23.06
Xiaoqin Xie	2	18	10.36
Zhang Xiaofeng	3	101	18.32
Reynold Cheng	4	3069	154.13
Zhang Zhiqiang	5	0	0.34

1