A conceptual framework for efficient web crawling in virtual integration contexts - Citegraph

Paper Info

Title
A conceptual framework for efficient web crawling in virtual integration contexts

Abstract
Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user.

Year	DOI	Venue
2011	10.1007/978-3-642-23982-3_35	WISM (2)
Keywords	Field	DocType
relevant page,web page classifier,crawling tool,crawling area,conceptual framework,page relevance,efficient web,efficiency problem,different kind,virtual integration context,virtual integration system,unnecessary page,web crawling	Data mining,World Wide Web,Crawling,Information retrieval,Web page,Computer science,Download,Web navigation,Classifier (linguistics),Conceptual framework,Web crawler,Distributed web crawling	Conference
Volume	ISSN	Citations
6988	0302-9743	1
PageRank	References	Authors
0.36	19	4

Authors (4 rows)

Cited by (1 rows)

References (19 rows)

Name	Order	Citations	PageRank
Inma Hernández	1	76	10.72
Hassan A. Sleiman	2	103	8.33
David Ruiz	3	152	20.62
Rafael Corchuelo	4	389	49.87

1