Title
Efficient representation of DNA data for pattern recognition using failure factor oracles
Abstract
In indexing of and pattern matching on DNA sequences, representing all factors of a sequence is important. One efficient, compact representation is the factor oracle (FO). At the same time, any classical deterministic finite automata (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The algorithm is suitable for long sequences. We empirically compared the resulting FFOs and FOs on number of transitions for many DNA sequences of lengths 4--512, showing gains of up to 10% in total number of transitions, with failure transitions also taking up less space than symbol transitions. Preliminary results on sequence processing runtimes when using FFOs originally showed these to be multiples of those when using FOs, but partial memoization already leads to drastic improvements. Altogether the results are promising, particularly for the use of FFOs for (repeated) factor detection, where recognition speed may be less important than memory use.
Year
DOI
Venue
2013
10.1145/2513456.2513488
SAICSIT Conf.
Keywords
Field
DocType
failure factor oracle,pattern recognition,factor detection,sequence processing runtimes,so-called failure,compact representation,failure transition,long sequence,memory use,factor oracle,dna sequence,dna data,efficient representation,pattern matching,algorithmics,dictionary,dna sequences
Algorithmics,Computer science,Deterministic finite automaton,Factor oracle,Algorithm,Search engine indexing,Sequence processing,Memoization,Pattern matching
Conference
Citations 
PageRank 
References 
1
0.37
8
Authors
3
Name
Order
Citations
PageRank
Loek Cleophas14113.89
Derrick G. Kourie222333.10
Bruce W. Watson333853.24