Title
Agnostic topology-based spam avoidance in large-scale web crawls
Abstract
With the proliferation of web spam and questionable content with virtually infinite auto-generated structure, large-scale web crawlers now require low-complexity ranking methods to effectively budget their limited resources and allocate the majority of bandwidth to reputable sites. To shed light on Internet-wide spam avoidance, we study the domain-level graph from a 6.3B-page web crawl and compare several agnostic topology-based ranking algorithms on this dataset. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method that can achieve much better crawl prioritization in practice, especially in applications with limited hardware resources.
Year
DOI
Venue
2011
10.1109/INFCOM.2011.5935303
Shanghai
Keywords
Field
DocType
Internet,security of data,unsolicited e-mail,BFS-based technique,PageRank-style method,Web crawl,Web spam,agnostic topology-based ranking algorithm,spam avoidance
Data mining,Learning to rank,Topology,Algorithm design,Ranking,Computer science,Bandwidth (signal processing),Web crawler,Spamdexing,The Internet,Scalability
Conference
ISSN
ISBN
Citations 
0743-166X
978-1-4244-9919-9
0
PageRank 
References 
Authors
0.34
24
3
Name
Order
Citations
PageRank
Clint Sparkman100.34
Hsin-Tsang Lee2674.87
Dmitri Loguinov3129891.08