Title
Adversarial Bandits Policy for Crawling Commercial Web Content
Abstract
The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the K-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets.
Year
DOI
Venue
2020
10.1145/3366423.3380125
WWW '20: The Web Conference 2020 Taipei Taiwan April, 2020
Keywords
DocType
ISBN
Predictive Crawling, Commercial Web Crawling, Adversarial Bandit
Conference
978-1-4503-7023-3
Citations 
PageRank 
References 
0
0.34
0
Authors
7
Name
Order
Citations
PageRank
Shuguang Han116818.43
Michael Bendersky298648.69
Przemek Gajda300.34
Sergey Novikov400.68
Marc A. Najork52538278.16
Bernhard Brodowsky600.34
Alexandrin Popescul771.12