Title
CHICO: A Compressed Hybrid Index for Repetitive Collections.
Abstract
Indexing text collections to support pattern matching queries is a fundamental problem in computer science. New challenges keep arising as databases grow, and for repetitive collections, compressed indexes become relevant. To successfully exploit the regularities of repetitive collections different approaches have been proposed. Some of these are Compressed Suffix Array, Lempel-Ziv, and Grammar based indexes. In this paper, we present an implementation of an hybrid index that combines the effectiveness of Lempel-Ziv factorization with a modular design. This allows to easily substitute some components of the index, such as the Lempel-Ziv factorization algorithm, or the pattern matching machinery. Our implementation reduces the size up to a $$50\\,\\%$$50% over its predecessor, while improving query times up to a $$15\\,\\%$$15%. Also, it is able to successfully index thousands of genomes in a commodity desktop, and it scales up to multi-terabyte collections, provided there is enough secondary memory. As a byproduct, we developed a parallel version of Relative Lempel-Ziv compression algorithm.
Year
DOI
Venue
2016
10.1007/978-3-319-38851-9_22
SEA
Field
DocType
Volume
Data mining,Computer science,Search engine indexing,Suffix array,Theoretical computer science,Factorization,Modular design,Compressed suffix array,Data compression,Pattern matching,Auxiliary memory
Conference
9685
ISSN
Citations 
PageRank 
0302-9743
3
0.41
References 
Authors
22
1
Name
Order
Citations
PageRank
Daniel Valenzuela1356.65