Abstract | ||
---|---|---|
Similarity functions assign scores to documents in response to queries. These functions require as input statistics about the terms in the queries and documents, where the intention is that the statistics are estimates of the relative informativeness of the terms. Common measures of informativeness use the number of documents containing each term (the document frequency) as a key measure. We argue in this paper that the distribution of within-document frequencies across a collection is also pertinent to informativeness, a measure that has not been considered in prior work: the most informative words tend to be those whose frequency of occurrence has high variance. We propose use of relative standard deviation (RSD) as a measure of variability incorporating within-document frequencies, and show that RSD compares favourably with inverse document frequency (IDF), in both in-principle analysis and in practice in retrieval, with small but consistent gains. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/2911451.2914687 | SIGIR |
Field | DocType | Citations |
Dispersion (optics),Data mining,tf–idf,Computer science,Statistics,Relative standard deviation | Conference | 0 |
PageRank | References | Authors |
0.34 | 6 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Rodney McDonell | 1 | 0 | 0.34 |
Justin Zobel | 2 | 1 | 1.11 |
Bodo Billerbeck | 3 | 272 | 14.24 |