Title
Lazer: Distributed Memory-Efficient Assembly Of Large-Scale Genomes
Abstract
Genome sequencing technology has witnessed tremendous progress in terms of throughput as well as cost per base pair, resulting in an explosion in the size of data. Consequently, typical sequence assembly tools demand a lot of processing power and memory and are unable to assemble big datasets unless run on hundreds of nodes. In this paper, we present a distributed assembler that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (452 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours.
Year
Venue
Keywords
2016
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)
genome assembly, big data
Field
DocType
Citations 
Data structure,Data mining,Instruction set,Computer science,Distributed memory,Memory management,Throughput,De Bruijn sequence,Sequence assembly,Scalability
Conference
0
PageRank 
References 
Authors
0.34
9
5
Name
Order
Citations
PageRank
Sayan Goswami100.34
Arghya Kusum Das262.18
Richard Platania3103.26
Kisung Lee434227.05
Seung-Jong Park531931.12