Abstract | ||
---|---|---|
Motivation: k-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g. for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations and data structures. In this article, we show how to build a k-mer-counting and multiset-representation system using the counting quotient filter, a feature-rich approximate membership query data structure. We introduce the k-mer-counting/querying system Squeakr (Simple Quotient filter-based Exact and Approximate Kmer Representation), which is based on the counting quotient filter. This off-the-shelf data structure turns out to be an efficient (approximate or exact) representation for sets or multisets of k-mers. Results: Squeakr takes 2x-4.3x less time than the state-of-the-art to count and perform a random-point-query workload. Squeakr is memory-efficient, consuming 1.5x-4.3x less memory than the state-of-the-art. It offers competitive counting performance. In fact, it is faster for larger k-mers, and answers point queries (i.e. queries for the abundance of a particular k-mer) over an order-of-magnitude faster than other systems. The Squeakr representation of the k-mer multiset turns out to be immediately useful for downstream processing (e.g. de Bruijn graph traversal) because it supports fast queries and dynamic k-mer insertion, deletion, and modification. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1093/bioinformatics/btx636 | BIOINFORMATICS |
Field | DocType | Volume |
Data structure,Gamut,Tree traversal,Multiset,Computer science,Algorithm,Quotient filter,Error detection and correction,Theoretical computer science,De Bruijn graph,Bioinformatics,k-mer | Journal | 34 |
Issue | ISSN | Citations |
4 | 1367-4803 | 4 |
PageRank | References | Authors |
0.43 | 17 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Prashant Pandey | 1 | 54 | 6.17 |
Michael A. Bender | 2 | 2144 | 138.24 |
Rob Johnson | 3 | 562 | 39.43 |
Rob Patro | 4 | 111 | 12.98 |