Title
Site-Wide Wrapper Induction for Life Science Deep Web Databases
Abstract
We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. However, Life Science Web sites typically contain structurally diverse web pages from multiple classes making the problem more challenging. Furthermore, we observed that such Life Science Web sites do not just provide mere data, but they also tend to provide schema information in terms of data labels --- giving further cues for solving the Web site wrapping task. Our solution to this novel challenge of Site-Wide wrapper induction consists of a sequence of steps: 1. classification of similar Web pages into classes, 2. discovery of these classes and 3. wrapper induction for each class. Our approach thus allows us to perform unsupervised information retrieval from across an entire Web site. We test our algorithm against three real-world biochemical deep Web sources and report our preliminary results, which are very promising.
Year
DOI
Venue
2009
10.1007/978-3-642-02879-3_9
DILS
Keywords
Field
DocType
site-wide wrapper induction,similar web page,real-world biochemical deep web,wrapper induction,life science web site,entire web site,life science deep web,traditional wrapper induction,deep web life science,web site,web page,deep web,information extraction,web pages,information retrieval,database
Static web page,Data mining,Web intelligence,Web page,Semantic Web Stack,Computer science,Web modeling,World Wide Web,Information retrieval,Data Web,Web navigation,Database,Web server
Conference
Volume
ISSN
Citations 
5647
0302-9743
1
PageRank 
References 
Authors
0.35
32
3
Name
Order
Citations
PageRank
Saqib Mir122519.96
Steffen Staab26658593.89
Isabel Rojas336630.30