Title
Geographically focused collaborative crawling
Abstract
A collaborative crawler is a group of crawling nodes, in which each crawling node is responsible for a specic portion of the web. We study the problem of collecting geographi- cally-aware pages using collaborative crawling strategies. We rst propose several collaborative crawling strategies for the geographically focused crawling, whose goal is to collect web pages about specied geographic locations, by considering features like URL address of page, content of page, extended anchor text of link, and others. Later, we propose vari- ous evaluation criteria to qualify the performance of such crawling strategies. Finally, we experimentally study our crawling strategies by crawling the real web data showing that some of our crawling strategies greatly outperform the simple URL-hash based partition collaborative crawling, in which the crawling assignments are determined according to the hash-value computation over URLs. More precisely, features like URL address of page and extended anchor text of link are shown to yield the best overall performance for the geographically focused crawling.
Year
DOI
Venue
2006
10.1145/1135777.1135822
WWW
Keywords
Field
DocType
extended anchor text,partition collaborative crawling,collaborative crawler,geographically focused crawling,real web data,url address,geographi-cally-aware page,crawling node,collaborative crawling,collaborative crawling strategy,crawling strategy,crawling assignment,ge- ographic entities,anchor text,web pages
World Wide Web,Crawling,Web page,Information retrieval,Computer science,Anchor text,Web crawler,Distributed web crawling
Conference
ISBN
Citations 
PageRank 
1-59593-323-9
16
1.49
References 
Authors
21
3
Name
Order
Citations
PageRank
Weizheng Gao1242.70
Hyun Chul Lee220515.50
Yingbo Miao3332.69