Title
Finding seeds to bootstrap focused crawlers
Abstract
Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.
Year
DOI
Venue
2016
10.1007/s11280-015-0331-7
World Wide Web
Keywords
DocType
Volume
Web crawling,Focused crawling,Relevance feedback
Journal
19
Issue
ISSN
Citations 
3
1386-145X
5
PageRank 
References 
Authors
0.40
24
5
Name
Order
Citations
PageRank
Karane Vieira1743.57
Luciano Barbosa271443.86
Altigran Soares da Silva371865.15
Juliana Freire43956270.89
Edleno Silva de Moura598875.44