Indexing shared content in information retrieval systems - Citegraph

Paper Info

Title
Indexing shared content in information retrieval systems

Abstract
Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

Year	DOI	Venue
2006	10.1007/11687238_21	EDBT
Keywords	Field	DocType
information retrieval system,modern document collection,new document representation model,newsgroup search,shared content,related document,representation model,multiple time,inverted index,indexation	Inverted index,Information system,Content analysis,Indexation,Information retrieval,Computer science,Document Structure Description,Search engine indexing,Document representation,Database,Encoding (memory)	Conference
Volume	ISSN	ISBN
3896	0302-9743	3-540-32960-9
Citations	PageRank	References
22	1.48	19
Authors
8

Authors (8 rows)

Cited by (22 rows)

References (19 rows)

Name	Order	Citations	PageRank
Andrei Broder	1	7357	920.20
Nadav Eiron	2	807	65.42
Marcus Fontoura	3	1116	61.74
Michael Herscovici	4	651	48.52
Ronny Lempel	5	1273	112.55
John McPherson	6	22	1.48
Runping Qi	7	59	15.99
Eugene J. Shekita	8	3630	574.21

1