Title
Multivariate entropy distance method for prokaryotic gene identification.
Abstract
A new simple method is found for efficient and accurate identification of coding sequences in prokaryotic genome. The method employs a Shannon description of artificial language for DNA sequences. It consists in translating a DNA sequence into a pseudo-amino acid sequence with 20 fundamental words according to the universal genetic code. With an entropy-density profile (EDP), the method maps a sequence of finite length to a vector and then analyzes its position in the 20-dimensional phase space depending on its nature. It is found that the ratio of the relative distance to an averaged coding and non-coding EDP over a small number (up to one) of open reading frames (ORFs) can serve as a good coding potential. An iterative algorithm is designed for finding a set of "root" sequences using this coding potential. A multivariate entropy distance (MED) algorithm is then proposed for the identification of prokaryotic genes; it has a feature to combine the use of a coding potential and an EDP-based sequence similarity analysis. The current version of MED is unsupervised, parameter-free and simple to implement. It is demonstrated to be able to detect 95-99% genes with 10-30% of additional genes when tested against the RefSeq database of NCBI and to detect 97.5-99.8% of confirmed genes with known functions. It is also shown to be able to find a set of (functionally known) genes that are missed by other well-known gene finding algorithms. All measurements show that the MED algorithm reaches a similar performance level as the algorithms like GeneMark and Glimmer for prokaryotic gene prediction.
Year
DOI
Venue
2004
10.1142/S0219720004000624
J. Bioinformatics and Computational Biology
Keywords
Field
DocType
entropy,gene finding algorithm,linguistic description of dna
Genome,Sequence alignment,Small number,Gene,Biology,Iterative method,Genetic code,Coding (social sciences),DNA sequencing,Bioinformatics
Journal
Volume
Issue
ISSN
2
2
0219-7200
Citations 
PageRank 
References 
7
1.11
6
Authors
4
Name
Order
Citations
PageRank
Zhengqing Ouyang181.47
Huaiqiu Zhu216215.27
Jin Wang371.11
Zhen-Su She41259.43