Title
DNA sequence compression using the normalized maximum likelihood model for discrete regression
Abstract
We discuss how to use the normalized maximum likelihood (NML) model for encodingsequences known to have regularities in the form of approximate repetitions. We present aparticular version of the NML model for discrete regression, which is shown to provide avery powerful yet simple model for encoding the approximate repeats in DNA sequences.Combining the model of repeats with a simple first order Markov model we obtain a fastlossless compression method, which compares favorably with the existing DNA compressionprograms. It is remarkable that a simple model, which recursively updates a small numberof parameters, is able to reach the state of the art compression ratio for DNA sequencesobtained with much more complex models. Being a minimum description length (MDL)model, the NML model may later prove to be useful in studying global and local featuresof DNA or possibly of other biological sequences.
Year
DOI
Venue
2003
10.1109/DCC.2003.1194016
DCC
Keywords
Field
DocType
art compression ratio,complex model,order markov model,dna sequence,simple model,normalized maximum likelihood model,existing dna compressionprograms,nml model,approximate repetition,local featuresof dna,discrete regression,dna sequence compression,approximate repeat,history,maximum likelihood estimation,lossless compression,sequences,markov model,first order,data compression,entropy,compression ratio,minimum description length,dictionaries,dna,encoding,markov processes
Markov process,Regression,Markov model,Minimum description length,Theoretical computer science,Compression ratio,Data compression,Mathematics,Lossless compression,Encoding (memory)
Conference
ISSN
ISBN
Citations 
1068-0314
0-7695-1896-6
19
PageRank 
References 
Authors
1.10
10
3
Name
Order
Citations
PageRank
Ioan Tabus127638.23
Gergely Korodi2785.57
Jorma Rissanen31665798.14