Title
Efficient Computation of Sequence Mappability
Abstract
Sequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices $$j \ne i$$ such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $$k=1$$ . We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for $$k=O(1)$$ , works in $$O(n)$$ space and, with high probability, in $$O(n \cdot \min \{m^k,\log ^k n\})$$ time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop $$O(n^2)$$ -time algorithms to compute all (k, m)-mappability tables for a fixed m and all $$k\in \{0,\ldots ,m\}$$ or a fixed k and all $$m\in \{k,\ldots ,n\}$$ . Finally, we show that, for $$k,m = \Theta (\log n)$$ , the (k, m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018.
Year
DOI
Venue
2022
10.1007/s00453-022-00934-y
Algorithmica
Keywords
DocType
Volume
Sequence mappability, k-errata tree, Hamming distance
Journal
84
Issue
ISSN
Citations 
5
0178-4617
0
PageRank 
References 
Authors
0.34
20
6