Abstract | ||
---|---|---|
Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as their internal semantic structures is critical to many online information systems. Wrapper induction is one of the most effective methods for such tasks. However, most traditional wrapper techniques have issues dealing with web records since they are designed to extract information from a page, not a record. We propose a record-level wrapper system. In our system, we use a novel ``broom'' structure to represent both records and generated wrappers. With such representation, our system is able to effectively extract records and identify their internal semantics at the same time. We test our system on 16 real-life websites from four different domains. Experimental results demonstrate 99\% extraction accuracy in terms of F1-Value. |
Year | DOI | Venue |
---|---|---|
2009 | 10.1145/1645953.1645962 | CIKM |
Keywords | Field | DocType |
traditional wrapper technique,efficient record-level wrapper induction,wrapper induction,relevant record,internal semantic structure,web record,record-level wrapper system,product record,web information,related information need,online information system,extraction,information extraction,information need,information system | Information system,Data mining,World Wide Web,Information needs,Information retrieval,Web page,Computer science,Information extraction,Web information,Semantics | Conference |
Citations | PageRank | References |
25 | 0.86 | 22 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shuyi Zheng | 1 | 256 | 11.22 |
Ruihua Song | 2 | 1138 | 59.33 |
Ji-Rong Wen | 3 | 4431 | 265.98 |
C. Lee Giles | 4 | 11154 | 1549.48 |