Title
ntPack: A Software Package for Big Data in Genomics
Abstract
Establishing a fundamental understanding of the specific molecular biology of a species benefits substantially from reconstructing its genome (DNA) and transcriptome (RNA). These efforts are enabled by modern high throughput sequencing technologies. For over a decade, assembling the generated data into coherent information has been a primary focus of the bioinformatics field. However, the expanding data volume in the field and growing read lengths from evolving sequencing platforms require adapting bioinformatics tools to properly leverage the potential of new genomics technologies. This study is about efficient and scalable algorithms to perform a set of unit operations in genomics studies to guide sequence assembly. Here, we report on a software package, ntPack, with two components: ntHash, for nucleotide hashing, and ntCard for cardinality estimation. We characterize the statistical properties of these algorithms, and demonstrate their application on whole genome shotgun sequencing datasets describing the roundworm, human, and Canadian white spruce genomes. The software that implements these algorithms can be downloaded from our github repository at https://github.com/bcgsc.
Year
DOI
Venue
2018
10.1109/BDCAT.2018.00014
2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT)
Keywords
Field
DocType
high throughput bioinformatics, sequencing reads, hashing, cardinality estimation
Data science,Genome,Data mining,Shotgun sequencing,Computer science,Genomics,Software,Hash function,DNA sequencing,Big data,Sequence assembly
Conference
ISBN
Citations 
PageRank 
978-1-5386-5503-0
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Inanc Birol1789.34
Hamid Mohamadi2665.37
Justin Chu3114.70