Title | ||
---|---|---|
A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings. |
Abstract | ||
---|---|---|
Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this fact has motivated the development of algorithms, to compute the BWT, that work almost entirely in external memory. On the other hand, the related problem of computing the Longest Common Prefix (LCP) array is often instrumental in several algorithms on collection of strings, such as those that compute the suffix-prefix overlap among strings, which is an essential step for many genome assembly algorithms. The best current lightweight approach to compute BWT and LCP array on a set of $m$ strings, each one $k$ characters long, has I/O complexity that is $O(mk^2 log |Sigma|)$ (where $|Sigma|$ is the size of the alphabet), thus it is not optimal. In this paper we propose a novel approach to build BWT and LCP array (simultaneously) with $O(kmL(log k +log sigma))$ I/O complexity, where $L$ is the length of longest substring that appears at least twice in the input strings. |
Year | Venue | Field |
---|---|---|
2016 | arXiv: Data Structures and Algorithms | LCP array,Discrete mathematics,Substring,Combinatorics,Generalization,Search engine indexing,Algorithm,Sigma,Mathematics,Auxiliary memory,Alphabet |
DocType | Volume | Citations |
Journal | abs/1607.08342 | 0 |
PageRank | References | Authors |
0.34 | 13 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Paola Bonizzoni | 1 | 502 | 52.23 |
Gianluca Della Vedova | 2 | 342 | 36.39 |
Serena Nicosia | 3 | 0 | 0.34 |
Marco Previtali | 4 | 22 | 5.45 |
Raffaella Rizzi | 5 | 130 | 13.58 |