Title
Fast Metagenomic Binning via Hashing and Bayesian Clustering.
Abstract
We introduce GATTACA, a framework for fast unsupervised binning of metagenomic contigs. Similar to recent approaches, GATTACA clusters contigs based on their coverage profiles across a large cohort of metagenomic samples; however, unlike previous methods that rely on read mapping, GATTACA quickly estimates these profiles from kmer counts stored in a compact index. This approach can result in over an order of magnitude speedup, while matching the accuracy of earlier methods on synthetic and real data benchmarks. It also provides a way to index metagenomic samples (e.g., from public repositories such as the Human Microbiome Project) offline once and reuse them across experiments; furthermore, the small size of the sample indices allows them to be easily transferred and stored. Leveraging the MinHash technique, GATTACA also provides an efficient way to identify publicly available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy indexing and reuse of publicly available metagenomic data sets, GATTACA makes accurate metagenomic analyses accessible to a much wider range of researchers.
Year
DOI
Venue
2018
10.1089/cmb.2017.0250
JOURNAL OF COMPUTATIONAL BIOLOGY
Keywords
Field
DocType
Bayesian clustering,kmer counting,metagenomic binning,MinHash,minimal perfect hash functions
Data mining,MinHash,Metagenomics,Hash function,Artificial intelligence,Cluster analysis,Machine learning,Mathematics,Speedup,Bayesian probability
Journal
Volume
Issue
ISSN
25.0
7
1066-5277
Citations 
PageRank 
References 
0
0.34
11
Authors
4
Name
Order
Citations
PageRank
Victoria Popic1824.28
Volodymyr Kuleshov210410.82
Michael Snyder313826.15
Serafim Batzoglou480685.80