Title
Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.
Abstract
Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.
Year
DOI
Venue
2012
10.1093/bioinformatics/bts054
BIOINFORMATICS
Keywords
Field
DocType
genome
Genome browser,Computer science,MIT License,Software,Bioinformatics,Data access,Database,Scalability
Journal
Volume
Issue
ISSN
28
6
1367-4803
Citations 
PageRank 
References 
21
1.68
7
Authors
6
Name
Order
Citations
PageRank
Matti Niemenmaa1653.91
Aleksi Kallio2855.75
André Schumacher3717.26
Petri Klemelä4211.68
Eija Korpelainen51038.95
Keijo Heljanko675147.90