Data Extraction From Repositories On The Web: A Semi-Automatic Approach - Citegraph

Paper Info

Title
Data Extraction From Repositories On The Web: A Semi-Automatic Approach

Abstract
The World Wide Web (WWW) is becoming the most important source of information for business intelligence and information dissemination. Past information gathering techniques like surfing and sifting are proving insufficient in processing the vast volumes of data readily available from the Web. In addition, companies are being forced to integrate this vast data repository within specific cost, time, and reliability spectrums. This paper presents the fundamentals of a system called "Browser Harness" (B2H) that extracts the requested data from Web sites in a supervised fashion. The algorithmic background of this system is based on the tag structure of web pages, as HTML is the predominate choice for rendering web page content on the WWW. B2H is an interactive tool for harnessing data from semi-structured and structured web pages by analyzing the tag structure of the input page and locating the data in the HTML code. The extracted data is then exported to XML, delimited text, or database tables.

Year	Venue	Keywords
2003	Transactions of the SDPS	information dissemination,data extraction,web page,vast data repository,harnessing data,requested data,tag structure,structured web page,past information gathering technique,semi-automatic approach,web site,world wide web,web mining
Field	DocType	Volume
Static web page,Web design,World Wide Web,Web intelligence,Web page,Computer science,Web standards,Data Web,Web modeling,Web navigation	Journal	7
Issue	Citations	PageRank
4	1	0.36
References	Authors
10	3

Authors (3 rows)

Cited by (1 rows)

References (10 rows)

Name	Order	Citations	PageRank
Coşkun Bayrak	1	197	26.47
Hayrettin Kolukísaoğlu	2	1	0.36
Steve Sieloff	3	1	0.36

1