Abstract | ||
---|---|---|
Spurred by a widening gap between hardware accelerators and traditional processors, numerous bioinformatics applications have harnessed the computing power of GPUs and reported substantial performance improvements compared to their CPU-based counterparts. However, most of these GPU-based applications only focus on the read alignment problem, while the field of de novo assembly still relies mostly on CPU-based solutions. This is primarily due to the nature of the assembly workload which is not only compute-intensive but also extremely data-intensive. Such workloads require large memories, making it difficult to adapt them to use GPUs with their limited memory capacities. To the best of our knowledge, no GPU-based assembler reported in the recent literature has attempted to assemble datasets larger than a few tens of gigabytes, whereas real sequence datasets are often several hundreds of gigabytes in size. In this paper, we present a new GPU-accelerated genome assembler called LaSAGNA, which can assemble large-scale sequence datasets using a single GPU by building string graphs from approximate all-pair overlaps. LaSAGNA can also run on multiple GPUs across multiple compute nodes connected by a high-speed network to expedite the assembly process. To utilize the limited memory on GPUs efficiently, LaSAGNA uses a semi-streaming approach that makes at most a logarithmic number of passes over the input data based on the available memory. Moreover, we propose a two-level streaming model, from disk to host memory and from host memory to device memory, to minimize disk I/O. Using LaSAGNA, we can assemble a 400 GB human genome dataset on a single NVIDIA K40 GPU in 17 hours, and in a little over 5 hours on an 8-node cluster of NVIDIA K20s. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1109/IPDPS.2018.00091 | 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) |
Keywords | Field | DocType |
Genomics,Computational biology,Memory management,Big data,Parallel processing | Graph,Computer science,Gigabyte,Parallel computing,Memory management,Sequence assembly | Conference |
ISSN | ISBN | Citations |
1530-2075 | 978-1-5386-4369-3 | 0 |
PageRank | References | Authors |
0.34 | 18 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sayan Goswami | 1 | 2 | 1.39 |
Kisung Lee | 2 | 342 | 27.05 |
Shayan Shams | 3 | 8 | 2.51 |
Seung-Jong Park | 4 | 319 | 31.12 |