Title
Optimal Freshness Crawl Under Politeness Constraints
Abstract
A Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine's index, in order to keep the index fresh by picking up the pages' changed content. Determining how often to recrawl pages requires making tradeoffs based on the pages' relative importance and change rates, subject to multiple resource constraints - the limited daily budget of crawl requests on the search engine's end and politeness constraints restricting the rate at which pages can be requested from a given host. In this paper, we introduce PoliteBinaryLambdaCrawl, the first optimal algorithm for freshness crawl scheduling in the presence of politeness constraints as well as non-uniform page importance scores and the crawler's own crawl request limit. We also propose an approximation for it, stating its theoretical optimality conditions and in the process discovering a connection to an approach previously thought of as a mere heuristic for freshness crawl scheduling. We explore the relative performance of PoliteBinaryLambdaCrawl and other methods for handling politeness constraints on a dataset collected by crawling over 18.5M URLs daily over 14 weeks.
Year
DOI
Venue
2019
10.1145/3331184.3331241
Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
Keywords
Field
DocType
convex optimization, lagrange multiplier, planning under uncertainty, politeness constraint, search engine, web crawling
Search engine,Information retrieval,Computer science,Lagrange multiplier,Politeness,Web crawler,Convex optimization
Conference
ISBN
Citations 
PageRank 
978-1-4503-6172-9
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Andrey Kolobov134523.07
Yuval Peres252353.68
Eyal Lubetzky335528.87
Eric Horvitz494021058.25