Measuring the web crawler ethics - Citegraph

Paper Info

Title
Measuring the web crawler ethics

Abstract
Web crawlers are highly automated and seldom regulated manually. The diversity of crawler activities often leads to ethical problems such as spam and service attacks. In this research, quantitative models are proposed to measure the web crawler ethics based on their behaviors on web servers. We investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We propose a vector space model to represent crawler behavior and measure the ethics of web crawlers based on the behavior vectors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers' behaviors are ethical. However, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules. We also measure the ethics of big search engine crawlers in terms of return on investment. The results show that Google has a higher score than other search engines for a US website but has a lower score than Baidu for Chinese websites.

Year	DOI	Venue
2010	10.1145/1772690.1772824	WWW
Keywords	Field	DocType
big search engine crawler,crawler behavior,commercial crawler,commercial web crawler,web server,crawler ethic,web crawler,web crawler ethic,chinese web,crawler activity,return on investment,vector space model,robots txt,privacy,search engine	Web search engine,Data mining,Site map,World Wide Web,Computer science,Robots exclusion standard,Focused crawler,Vector space model,Spider trap,Web crawler,Web server	Conference
Citations	PageRank	References
4	0.47	5
Authors
3

Authors (3 rows)

Cited by (4 rows)

References (5 rows)

Name	Order	Citations	PageRank
C. Lee Giles	1	11154	1549.48
Yang Sun	2	46	15.21
Isaac G. Councill	3	469	27.27

1