Abstract | ||
---|---|---|
With the proliferation of web spam and questionable content with virtually infinite auto-generated structure, large-scale web crawlers now require low-complexity ranking methods to effectively budget their limited resources and allocate the majority of bandwidth to reputable sites. To shed light on Internet-wide spam avoidance, we study the domain-level graph from a 6.3B-page web crawl and compare several agnostic topology-based ranking algorithms on this dataset. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method that can achieve much better crawl prioritization in practice, especially in applications with limited hardware resources. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1109/INFCOM.2011.5935303 | Shanghai |
Keywords | Field | DocType |
Internet,security of data,unsolicited e-mail,BFS-based technique,PageRank-style method,Web crawl,Web spam,agnostic topology-based ranking algorithm,spam avoidance | Data mining,Learning to rank,Topology,Algorithm design,Ranking,Computer science,Bandwidth (signal processing),Web crawler,Spamdexing,The Internet,Scalability | Conference |
ISSN | ISBN | Citations |
0743-166X | 978-1-4244-9919-9 | 0 |
PageRank | References | Authors |
0.34 | 24 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Clint Sparkman | 1 | 0 | 0.34 |
Hsin-Tsang Lee | 2 | 67 | 4.87 |
Dmitri Loguinov | 3 | 1298 | 91.08 |