An unsupervised technique to extract information from semi-structured web pages - Citegraph

Paper Info

Title
An unsupervised technique to extract information from semi-structured web pages

Abstract
We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors.

Year	DOI	Venue
2012	10.1007/978-3-642-35063-4_46	WISE
Keywords	Field	DocType
semi-structured web page,web page,real-world web site,server-side template,similar page,html error,relevant information,unsupervised technique,regular expression,unsupervised learning	Data mining,Regular expression,Information retrieval,Web page,Computer science,Website Parse Template,Unsupervised learning	Conference
Citations	PageRank	References
6	0.40	14
Authors
2

Authors (2 rows)

Cited by (6 rows)

References (14 rows)

Name	Order	Citations	PageRank
Hassan A. Sleiman	1	103	8.33
Rafael Corchuelo	2	389	49.87

1