Abstract | ||
---|---|---|
Establishing a fundamental understanding of the specific molecular biology of a species benefits substantially from reconstructing its genome (DNA) and transcriptome (RNA). These efforts are enabled by modern high throughput sequencing technologies. For over a decade, assembling the generated data into coherent information has been a primary focus of the bioinformatics field. However, the expanding data volume in the field and growing read lengths from evolving sequencing platforms require adapting bioinformatics tools to properly leverage the potential of new genomics technologies. This study is about efficient and scalable algorithms to perform a set of unit operations in genomics studies to guide sequence assembly. Here, we report on a software package, ntPack, with two components: ntHash, for nucleotide hashing, and ntCard for cardinality estimation. We characterize the statistical properties of these algorithms, and demonstrate their application on whole genome shotgun sequencing datasets describing the roundworm, human, and Canadian white spruce genomes. The software that implements these algorithms can be downloaded from our github repository at https://github.com/bcgsc. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1109/BDCAT.2018.00014 | 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT) |
Keywords | Field | DocType |
high throughput bioinformatics, sequencing reads, hashing, cardinality estimation | Data science,Genome,Data mining,Shotgun sequencing,Computer science,Genomics,Software,Hash function,DNA sequencing,Big data,Sequence assembly | Conference |
ISBN | Citations | PageRank |
978-1-5386-5503-0 | 0 | 0.34 |
References | Authors | |
0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Inanc Birol | 1 | 78 | 9.34 |
Hamid Mohamadi | 2 | 66 | 5.37 |
Justin Chu | 3 | 11 | 4.70 |