Abstract | ||
---|---|---|
A key challenge endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled webpage being modified on the web. This estimate is used to define the order in which those pages should be visited, and can be explored to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a novel approach to generate score functions that produce accurate rankings of pages regarding their probability of being modified when compared to their previously crawled versions. We propose a flexible framework that uses genetic programming to evolve score functions to estimate the likelihood that a webpage has been modified. We present a thorough experimental evaluation of the benefits of our framework over five state-of-the-art baselines. |
Year | DOI | Venue |
---|---|---|
2013 | 10.1007/978-3-319-02432-5_30 | SPIRE |
Field | DocType | Volume |
Information retrieval,Web page,Computer science,Scheduling (computing),Genetic programming,Score | Conference | 8214 |
ISSN | Citations | PageRank |
0302-9743 | 4 | 0.42 |
References | Authors | |
11 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Aécio S. R. Santos | 1 | 22 | 4.84 |
Nivio Ziviani | 2 | 1598 | 154.65 |
Jussara M. Almeida | 3 | 3044 | 310.86 |
Cristiano Carvalho | 4 | 9 | 1.55 |
Edleno Silva de Moura | 5 | 988 | 75.44 |
Altigran Soares da Silva | 6 | 718 | 65.15 |