Abstract | ||
---|---|---|
Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it. |
Year | DOI | Venue |
---|---|---|
2006 | 10.1007/11687238_21 | EDBT |
Keywords | Field | DocType |
information retrieval system,modern document collection,new document representation model,newsgroup search,shared content,related document,representation model,multiple time,inverted index,indexation | Inverted index,Information system,Content analysis,Indexation,Information retrieval,Computer science,Document Structure Description,Search engine indexing,Document representation,Database,Encoding (memory) | Conference |
Volume | ISSN | ISBN |
3896 | 0302-9743 | 3-540-32960-9 |
Citations | PageRank | References |
22 | 1.48 | 19 |
Authors | ||
8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Andrei Broder | 1 | 7357 | 920.20 |
Nadav Eiron | 2 | 807 | 65.42 |
Marcus Fontoura | 3 | 1116 | 61.74 |
Michael Herscovici | 4 | 651 | 48.52 |
Ronny Lempel | 5 | 1273 | 112.55 |
John McPherson | 6 | 22 | 1.48 |
Runping Qi | 7 | 59 | 15.99 |
Eugene J. Shekita | 8 | 3630 | 574.21 |