Extracting structured data from Web pages - Citegraph

Paper Info

Title
Extracting structured data from Web pages

Abstract
Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

Year	DOI	Venue
2003	10.1145/872757.872799	SIGMOD Conference
Keywords	Field	DocType
template-generated page,similar human input,large set,real input page collection,common template,unknown template,large number,template-generated web page,structured data,database value,web page,extracts data,coalescing,web pages,temporal databases,granularity,incomplete information	Static web page,Data mining,HITS algorithm,Information retrieval,Web page,Computer science,Website Parse Template,Temporal database,Data model,Complete information,Database	Conference
ISBN	Citations	PageRank
1-58113-634-X	394	16.28
References	Authors
21	3

Search Limit

100394

Authors (3 rows)

Cited by (100 rows)

References (21 rows)

Name	Order	Citations	PageRank
Arvind Arasu	1	2475	141.59
Héctor García-Molina	2	24359	5652.13
Stanford University	3	394	16.28

1