Title
A Novel Compression Algorithm For High-Throughput Dna Sequence Based On Huffman Coding Method
Abstract
NGS (Next generation sequencing) technology can concurrently accomplish sequencing of a large scale of DNA data in one time, resulting in a large number of DNA short reads. The transportation and processing of DNA data are thus faced with difficulties. There are two kinds of compression methods for high-throughput DNA data, reference-based method and reference-free method. Reference-free method is adaptive for compressing DNA data from different species without storing large genome for reference. In this paper, we proposed a reference-free algorithm, named HDC, realizing high-throughput DNA compression based on Huffman coding and dictionary method. The algorithm builds multiple dictionaries through Huffman coding and uses the dictionary to finish the compression and decompression. By testing on the genomes of human, green monkey and horse, HDC's lowest compression rate reaches 0.192 when compressing the human genome with chromosome as compression unit. We also compared HDC with a conventional compression algorithm gzip, and two reference-free DNA compression algorithms Leon and ORCOM. The result demonstrates that the HDC algorithm performs significantly best among the three algorithms.
Year
DOI
Venue
2018
10.1109/CISP-BMEI.2018.8633219
2018 11TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2018)
Keywords
Field
DocType
High-throughput sequencing, DNA data compression, Huffman coding
Genome,Data compression ratio,Pattern recognition,Computer science,Huffman coding,Artificial intelligence,DNA sequencing,Throughput,Human genome,Data compression
Conference
Citations 
PageRank 
References 
0
0.34
0
Authors
2
Name
Order
Citations
PageRank
Chuan He1325.43
Huaiqiu Zhu216215.27