ntPack: A Software Package for Big Data in Genomics - Citegraph

Paper Info

Title
ntPack: A Software Package for Big Data in Genomics

Abstract
Establishing a fundamental understanding of the specific molecular biology of a species benefits substantially from reconstructing its genome (DNA) and transcriptome (RNA). These efforts are enabled by modern high throughput sequencing technologies. For over a decade, assembling the generated data into coherent information has been a primary focus of the bioinformatics field. However, the expanding data volume in the field and growing read lengths from evolving sequencing platforms require adapting bioinformatics tools to properly leverage the potential of new genomics technologies. This study is about efficient and scalable algorithms to perform a set of unit operations in genomics studies to guide sequence assembly. Here, we report on a software package, ntPack, with two components: ntHash, for nucleotide hashing, and ntCard for cardinality estimation. We characterize the statistical properties of these algorithms, and demonstrate their application on whole genome shotgun sequencing datasets describing the roundworm, human, and Canadian white spruce genomes. The software that implements these algorithms can be downloaded from our github repository at https://github.com/bcgsc.

Year	DOI	Venue
2018	10.1109/BDCAT.2018.00014	2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT)
Keywords	Field	DocType
high throughput bioinformatics, sequencing reads, hashing, cardinality estimation	Data science,Genome,Data mining,Shotgun sequencing,Computer science,Genomics,Software,Hash function,DNA sequencing,Big data,Sequence assembly	Conference
ISBN	Citations	PageRank
978-1-5386-5503-0	0	0.34
References	Authors
0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Inanc Birol	1	78	9.34
Hamid Mohamadi	2	66	5.37
Justin Chu	3	11	4.70

1