Title
Indexing shared content in information retrieval systems
Abstract
Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.
Year
DOI
Venue
2006
10.1007/11687238_21
EDBT
Keywords
Field
DocType
information retrieval system,modern document collection,new document representation model,newsgroup search,shared content,related document,representation model,multiple time,inverted index,indexation
Inverted index,Information system,Content analysis,Indexation,Information retrieval,Computer science,Document Structure Description,Search engine indexing,Document representation,Database,Encoding (memory)
Conference
Volume
ISSN
ISBN
3896
0302-9743
3-540-32960-9
Citations 
PageRank 
References 
22
1.48
19
Authors
8
Name
Order
Citations
PageRank
Andrei Broder17357920.20
Nadav Eiron280765.42
Marcus Fontoura3111661.74
Michael Herscovici465148.52
Ronny Lempel51273112.55
John McPherson6221.48
Runping Qi75915.99
Eugene J. Shekita83630574.21