Title
An unsupervised technique to extract information from semi-structured web pages
Abstract
We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors.
Year
DOI
Venue
2012
10.1007/978-3-642-35063-4_46
WISE
Keywords
Field
DocType
semi-structured web page,web page,real-world web site,server-side template,similar page,html error,relevant information,unsupervised technique,regular expression,unsupervised learning
Data mining,Regular expression,Information retrieval,Web page,Computer science,Website Parse Template,Unsupervised learning
Conference
Citations 
PageRank 
References 
6
0.40
14
Authors
2
Name
Order
Citations
PageRank
Hassan A. Sleiman11038.33
Rafael Corchuelo238949.87