Title
Squeakr: an exact and approximate k-mer counting system.
Abstract
Motivation: k-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g. for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations and data structures. In this article, we show how to build a k-mer-counting and multiset-representation system using the counting quotient filter, a feature-rich approximate membership query data structure. We introduce the k-mer-counting/querying system Squeakr (Simple Quotient filter-based Exact and Approximate Kmer Representation), which is based on the counting quotient filter. This off-the-shelf data structure turns out to be an efficient (approximate or exact) representation for sets or multisets of k-mers. Results: Squeakr takes 2x-4.3x less time than the state-of-the-art to count and perform a random-point-query workload. Squeakr is memory-efficient, consuming 1.5x-4.3x less memory than the state-of-the-art. It offers competitive counting performance. In fact, it is faster for larger k-mers, and answers point queries (i.e. queries for the abundance of a particular k-mer) over an order-of-magnitude faster than other systems. The Squeakr representation of the k-mer multiset turns out to be immediately useful for downstream processing (e.g. de Bruijn graph traversal) because it supports fast queries and dynamic k-mer insertion, deletion, and modification.
Year
DOI
Venue
2018
10.1093/bioinformatics/btx636
BIOINFORMATICS
Field
DocType
Volume
Data structure,Gamut,Tree traversal,Multiset,Computer science,Algorithm,Quotient filter,Error detection and correction,Theoretical computer science,De Bruijn graph,Bioinformatics,k-mer
Journal
34
Issue
ISSN
Citations 
4
1367-4803
4
PageRank 
References 
Authors
0.43
17
4
Name
Order
Citations
PageRank
Prashant Pandey1546.17
Michael A. Bender22144138.24
Rob Johnson356239.43
Rob Patro411112.98