Title
Estimating corpus size via queries
Abstract
We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.
Year
DOI
Venue
2006
10.1145/1183614.1183699
CIKM
Keywords
Field
DocType
estimator,random sampling,search engine
Data mining,Search engine,Information retrieval,Computer science,Uncorrelated,Ranking (information retrieval),Correlation,Sampling (statistics),Estimator
Conference
ISBN
Citations 
PageRank 
1-59593-433-2
35
1.96
References 
Authors
14
9
Name
Order
Citations
PageRank
Andrei Broder17357920.20
Marcus Fontoura2111661.74
Vanja Josifovski32265148.84
Ravi Kumar4139321642.48
R Motwani5190851986.61
Shubha U. Nabar638220.79
Rina Panigrahy73203269.05
Andrew Tomkins893881401.23
Ying Xu924216.10