Semi-Automated Extraction of Targeted Data fromWeb Pages - Citegraph

Paper Info

Title
Semi-Automated Extraction of Targeted Data fromWeb Pages

Abstract
TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.

Year	DOI	Venue
2006	10.1109/ICDEW.2006.135	Atlanta, GA, USA
Keywords	Field	DocType
html tree structure,targeted data fromweb pages,semi-automated extraction,automatic computing,similar web document,embedded data,html document,mapping rule,socalled mapping rule,semantic interpretation,interpretation part,html-embedded data,information management,xml,software agents,tree structure,data mining,world wide web,computer science,html	Data mining,Semantic Web Stack,Web page,Computer science,Web mapping,Data Web,Semantic Web,Web modeling,HTML,Client-side scripting,Database	Conference
ISBN	Citations	PageRank
0-7695-2571-7	1	0.40
References	Authors
19	4

Authors (4 rows)

Cited by (1 rows)

References (19 rows)

Name	Order	Citations	PageRank
Fabrice Estievenart	1	21	2.20
Jean-Roch Meurisse	2	11	1.26
Jean-Luc Hainaut	3	901	254.54
Philippe Thiran	4	575	46.19

1