Compressed web indexes - Citegraph

Paper Info

Title
Compressed web indexes

Abstract
Web search engines use indexes to eciently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well com- pactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability dis- tribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experi- ments on several document collections show that the distri- bution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.

Year	DOI	Venue
2009	10.1145/1526709.1526770	WWW
Keywords	Field	DocType
similar power law,arbitrary distribution,double-pareto,fast retrieval,power law,boolean index,specified page,index size,document collection,double-pareto law,web search engine,compression,web index,probability distribution,indexation,rule of thumb	Zipf's law,Data mining,World Wide Web,Search engine,Information retrieval,Computer science,Search engine indexing,Probability distribution,Rule of thumb,Artificial intelligence,Power law,Machine learning	Conference
Citations	PageRank	References
8	0.66	13
Authors
3

Authors (3 rows)

Cited by (8 rows)

References (13 rows)

Name	Order	Citations	PageRank
Flavio Chierichetti	1	626	39.42
Ravi Kumar	2	13932	1642.48
Prabhakar Raghavan	3	13351	2776.61

1