Title
A genetic programming framework to schedule webpage updates.
Abstract
The quality of a Web search engine is influenced by several factors, including coverage and the freshness of the content gathered by the web crawler. Focusing particularly on freshness, one key challenge is to estimate the likelihood of a previously crawled webpage being modified. Such estimates are used to define the order in which those pages should be visited, and thus, can be exploited to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a Genetic Programming framework, called —, to generate score functions that produce accurate rankings of pages regarding their probabilities of having been modified. We compare with state-of-the-art methods using a large dataset of webpages crawled from the Brazilian Web. Our evaluation includes multiple performance metrics and several variations of our framework, built from exploring different sets of terminals and fitness functions. In particular, we evaluate using the ChangeRate and Normalized Discounted Cumulative Gain (NDCG) metrics as both objective function and evaluation metric. We show that, in comparison with ChangeRate, NDCG has the ability of better evaluating the effectiveness of scheduling strategies, since it is able to take the produced by the scheduling into account.
Year
DOI
Venue
2015
10.1007/s10791-014-9248-5
Inf. Retr. Journal
Keywords
Field
DocType
Web crawling,Scheduling functions,Genetic Programming
Web search engine,Data mining,Learning to rank,Web page,Scheduling (computing),Computer science,Genetic programming,Artificial intelligence,Crawling,Information retrieval,Ranking,Web crawler,Machine learning
Journal
Volume
Issue
ISSN
18
1
1386-4564
Citations 
PageRank 
References 
4
0.43
18
Authors
6