Estimating corpus size via queries - Citegraph

Paper Info

Title
Estimating corpus size via queries

Abstract
We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.

Year	DOI	Venue
2006	10.1145/1183614.1183699	CIKM
Keywords	Field	DocType
estimator,random sampling,search engine	Data mining,Search engine,Information retrieval,Computer science,Uncorrelated,Ranking (information retrieval),Correlation,Sampling (statistics),Estimator	Conference
ISBN	Citations	PageRank
1-59593-433-2	35	1.96
References	Authors
14	9

Authors (9 rows)

Cited by (35 rows)

References (14 rows)

Name	Order	Citations	PageRank
Andrei Broder	1	7357	920.20
Marcus Fontoura	2	1116	61.74
Vanja Josifovski	3	2265	148.84
Ravi Kumar	4	13932	1642.48
R Motwani	5	19085	1986.61
Shubha U. Nabar	6	382	20.79
Rina Panigrahy	7	3203	269.05
Andrew Tomkins	8	9388	1401.23
Ying Xu	9	242	16.10

1