Title
Efficient record-level wrapper induction
Abstract
Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as their internal semantic structures is critical to many online information systems. Wrapper induction is one of the most effective methods for such tasks. However, most traditional wrapper techniques have issues dealing with web records since they are designed to extract information from a page, not a record. We propose a record-level wrapper system. In our system, we use a novel ``broom'' structure to represent both records and generated wrappers. With such representation, our system is able to effectively extract records and identify their internal semantics at the same time. We test our system on 16 real-life websites from four different domains. Experimental results demonstrate 99\% extraction accuracy in terms of F1-Value.
Year
DOI
Venue
2009
10.1145/1645953.1645962
CIKM
Keywords
Field
DocType
traditional wrapper technique,efficient record-level wrapper induction,wrapper induction,relevant record,internal semantic structure,web record,record-level wrapper system,product record,web information,related information need,online information system,extraction,information extraction,information need,information system
Information system,Data mining,World Wide Web,Information needs,Information retrieval,Web page,Computer science,Information extraction,Web information,Semantics
Conference
Citations 
PageRank 
References 
25
0.86
22
Authors
4
Name
Order
Citations
PageRank
Shuyi Zheng125611.22
Ruihua Song2113859.33
Ji-Rong Wen34431265.98
C. Lee Giles4111541549.48