Abstract | ||
---|---|---|
Many large-scale Web applications that require ranked top-k retrieval are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document associations. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space as a matrix factorization problem, and prove that finding the optimal solution is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1145/2063576.2063628 | conference on information and knowledge management |
Keywords | DocType | Volume |
compression approach,approach maps term,non-zero element,sparse term-document matrix,matrix factorization problem,compact inverted index,np-hard problem,factorization-based lossless compression,new term space,term-document association,inverted index,information retrieval,negative affect,greedy algorithm,lossless compression,matrix factorization,online advertising,compression ratio,indexation,np hard problem,side effect | Conference | abs/1108.1956 |
Citations | PageRank | References |
3 | 0.42 | 17 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
George Beskales | 1 | 700 | 19.33 |
Marcus Fontoura | 2 | 1116 | 61.74 |
Maxim Gurevich | 3 | 347 | 18.96 |
Sergei Vassilvitskii | 4 | 2750 | 139.31 |
Vanja Josifovski | 5 | 2265 | 148.84 |