Title
Using the past to score the present: extending term weighting models through revision history analysis
Abstract
The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks.
Year
DOI
Venue
2010
10.1145/1871437.1871519
CIKM
Keywords
DocType
Citations 
modern retrieval model,standard retrieval task,document topic,term weighting model,actual document generation process,revision history,generative process,information retrieval model,revision history analysis,state-of-the-art text retrieval model,state-of-the-art retrieval model,retrieval model,information retrieval,term frequency
Conference
17
PageRank 
References 
Authors
0.84
30
4
Name
Order
Citations
PageRank
Ablimit Aji127714.26
Yu Wang21386.99
Eugene Agichtein34549269.70
Evgeniy Gabrilovich44573224.48