Title
How Informative is a Term?: Dispersion as a measure of Term Specificity.
Abstract
Similarity functions assign scores to documents in response to queries. These functions require as input statistics about the terms in the queries and documents, where the intention is that the statistics are estimates of the relative informativeness of the terms. Common measures of informativeness use the number of documents containing each term (the document frequency) as a key measure. We argue in this paper that the distribution of within-document frequencies across a collection is also pertinent to informativeness, a measure that has not been considered in prior work: the most informative words tend to be those whose frequency of occurrence has high variance. We propose use of relative standard deviation (RSD) as a measure of variability incorporating within-document frequencies, and show that RSD compares favourably with inverse document frequency (IDF), in both in-principle analysis and in practice in retrieval, with small but consistent gains.
Year
DOI
Venue
2016
10.1145/2911451.2914687
SIGIR
Field
DocType
Citations 
Dispersion (optics),Data mining,tf–idf,Computer science,Statistics,Relative standard deviation
Conference
0
PageRank 
References 
Authors
0.34
6
3
Name
Order
Citations
PageRank
Rodney McDonell100.34
Justin Zobel211.11
Bodo Billerbeck327214.24