Abstract | ||
---|---|---|
Motivated by the need for fast and accurate classification of unlabeled nucleotide sequences on a large scale, we propose a new classification method that captures the probabilistic structure of a sequence family as a compact context-tree model and uses it efficiently to test proximity and membership of a query sequence. The proposed nucleic acid sequence classification by universal probability (NASCUP) method crucially utilizes the notion of universal probability from information theory in model-building and classification processes, delivering BLAST-like accuracy in orders-of-magnitude reduced runtime for large-scale databases. A comprehensive experimental study involving seven public databases for functional non-coding RNA classification and microbial taxonomy classification demonstrates the advantages of NASCUP over widely-used alternatives in efficiency, accuracy, and scalability across all datasets considered. [availability: http://data.snu.ac.kr/nascup] |
Year | Venue | Field |
---|---|---|
2015 | CoRR | Information theory,Data mining,Anomaly detection,Computer science,Nucleic acid sequence,Bioinformatics |
DocType | Volume | Citations |
Journal | abs/1511.04944 | 0 |
PageRank | References | Authors |
0.34 | 1 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sunyoung Kwon | 1 | 9 | 2.31 |
gyuwan kim | 2 | 1 | 2.04 |
Byunghan Lee | 3 | 110 | 7.98 |
Sungroh Yoon | 4 | 566 | 78.80 |
Young-Han Kim | 5 | 318 | 48.11 |