Title
Factorization-based lossless compression of inverted indices
Abstract
Many large-scale Web applications that require ranked top-k retrieval are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document associations. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space as a matrix factorization problem, and prove that finding the optimal solution is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach.
Year
DOI
Venue
2011
10.1145/2063576.2063628
conference on information and knowledge management
Keywords
DocType
Volume
compression approach,approach maps term,non-zero element,sparse term-document matrix,matrix factorization problem,compact inverted index,np-hard problem,factorization-based lossless compression,new term space,term-document association,inverted index,information retrieval,negative affect,greedy algorithm,lossless compression,matrix factorization,online advertising,compression ratio,indexation,np hard problem,side effect
Conference
abs/1108.1956
Citations 
PageRank 
References 
3
0.42
17
Authors
5
Name
Order
Citations
PageRank
George Beskales170019.33
Marcus Fontoura2111661.74
Maxim Gurevich334718.96
Sergei Vassilvitskii42750139.31
Vanja Josifovski52265148.84