Abstract | ||
---|---|---|
We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors. |
Year | DOI | Venue |
---|---|---|
2012 | 10.1007/978-3-642-35063-4_46 | WISE |
Keywords | Field | DocType |
semi-structured web page,web page,real-world web site,server-side template,similar page,html error,relevant information,unsupervised technique,regular expression,unsupervised learning | Data mining,Regular expression,Information retrieval,Web page,Computer science,Website Parse Template,Unsupervised learning | Conference |
Citations | PageRank | References |
6 | 0.40 | 14 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Hassan A. Sleiman | 1 | 103 | 8.33 |
Rafael Corchuelo | 2 | 389 | 49.87 |