On overabundant words and their application to biological sequence analysis - Citegraph

Paper Info

Title
On overabundant words and their application to biological sequence analysis

Abstract
The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (1986) [1]. We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (2017) [2]. In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n−4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.

Year	DOI	Venue
2019	10.1016/j.tcs.2018.09.011	Theoretical Computer Science
Keywords	Field	DocType
Overabundant words,Avoided words,Pattern matching,Suffix tree,DNA sequence analysis	Integer,Discrete mathematics,Combinatorics,Suffix,Upper and lower bounds,Prefix,Infix,Statistical model,Suffix tree,Pattern matching,Mathematics	Journal
Volume	ISSN	Citations
792	0304-3975	0
PageRank	References	Authors
0.34	8	7

Authors (7 rows)

Cited by (0 rows)

References (8 rows)

Name	Order	Citations	PageRank
Yannis Almirantis	1	78	6.84
Panagiotis Charalampopoulos	2	29	9.41
Jia Gao	3	11	2.26
C. S. Iliopoulos	4	52	6.67
Manal Mohamed	5	102	12.62
Solon P. Pissis	6	281	57.09
Dimitris Polychronopoulos	7	76	5.12

1