Abstract | ||
---|---|---|
Exponential growth of the web continues to present challenges to the design and scalability of web crawlers. Our previous work on a high-performance platform called IRLbot [28] led to the development of new algorithms for realtime URL manipulation, domain ranking, and budgeting, which were tested in a 6.3B-page crawl. Since very little is known about the crawl itself, our goal in this paper is to undertake an extensive measurement study of the collected dataset and document its crawl dynamics. We also propose a framework for modeling the scaling rate of various data structures as crawl size goes to infinity and offer a methodology for comparing crawl coverage to that of commercial search engines. |
Year | Venue | Field |
---|---|---|
2015 | 2015 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (INFOCOM) | Data structure,Search engines information retrieval,World Wide Web,Ranking,Computer science,Focused crawler,Web crawler,Scalability |
DocType | ISSN | Citations |
Conference | 0743-166X | 2 |
PageRank | References | Authors |
0.37 | 26 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sarker Tanzir Ahmed | 1 | 6 | 1.11 |
Clint Sparkman | 2 | 2 | 0.71 |
Hsin-Tsang Lee | 3 | 67 | 4.87 |
Dmitri Loguinov | 4 | 1298 | 91.08 |