Abstract | ||
---|---|---|
Sequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices
$$j \ne i$$
such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of
$$k=1$$
. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for
$$k=O(1)$$
, works in
$$O(n)$$
space and, with high probability, in
$$O(n \cdot \min \{m^k,\log ^k n\})$$
time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop
$$O(n^2)$$
-time algorithms to compute all (k, m)-mappability tables for a fixed m and all
$$k\in \{0,\ldots ,m\}$$
or a fixed k and all
$$m\in \{k,\ldots ,n\}$$
. Finally, we show that, for
$$k,m = \Theta (\log n)$$
, the (k, m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1007/s00453-022-00934-y | Algorithmica |
Keywords | DocType | Volume |
Sequence mappability,
k-errata tree, Hamming distance | Journal | 84 |
Issue | ISSN | Citations |
5 | 0178-4617 | 0 |
PageRank | References | Authors |
0.34 | 20 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Panagiotis Charalampopoulos | 1 | 3 | 4.12 |
Costas S. Iliopoulos | 2 | 1534 | 167.43 |
Tomasz Kociumaka | 3 | 0 | 0.34 |
Solon P. Pissis | 4 | 0 | 0.34 |
Jakub Radoszewski | 5 | 624 | 50.36 |
Juliusz Straszynski | 6 | 2 | 5.15 |