Title
A technique for measuring the relative size and overlap of public Web search engines
Abstract
Abstract Search engines,are among,the most,useful,and popular,services,on the Web. Users are eager,to know how they,compare.,Which one has the largest,coverage? Have they,indexed the same,portion,of the Web? How many,pages,are out there? Although,these,questions have been debated in the popular and technical press, no objective evaluation methodology,has been,proposed,and few clear answers have emerged. In this,paper,we describe a standardized, statistical way of measuring search engine coverage and overlap through random queries.,Our technique,does,not require privileged access to any database.,It can be implemented,by third-party evaluators using only public query interfaces.,We present results,from,our experiments,showing,size and overlap,estimates for HotBot, AltaVista, Excite, and Infoseek as percentages of their total joint coverage,in mid 1997 and in November 1997. Our method,does,not provide,absolute values. However using data from other sources we estimate,that,as of November,1997 the number of pages indexed by HotBot, AltaVista, Excite, and Infoseek were respectively roughly 77M, 100M, 32M, and 17M and the joint total coverage was 160 million pages. We further conjecture that the size of the static, public Web as of November was over 200 million,pages. The most startling finding is that the overlap,is very small: less,than 1.4% of the total coverage, or about 2.2 million pages were indexed by all four engines. Keywords Search engines;,Coverage; Web page,sampling,
Year
DOI
Venue
1998
10.1016/S0169-7552(98)00127-5
Computer Networks and Isdn Systems
Keywords
Field
DocType
web page sampling,coverage,relative size,search engines,public web search engine,search engine,indexation,web pages,web search engine
Data mining,World Wide Web,Search engine,Information retrieval,Know-how,Computer science,Privileged access,The Internet
Journal
Volume
Issue
ISSN
30
1-7
0169-7552
Citations 
PageRank 
References 
197
49.24
3
Authors
2
Search Limit
100197
Name
Order
Citations
PageRank
Krishna A. Bharat11211252.86
Andrei Broder27357920.20