Title
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU
Abstract
BLAST, short for Basic Local Alignment Search Tool, is a fundamental algorithm in the life sciences that compares biological sequences. However, with the advent of next-generation sequencing (NGS) and increase in sequence read-lengths, whether at the outset or downstream from NGS, the exponential growth of sequence databases is arguably outstripping our ability to analyze the data. Though several recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies used coarse-grained parallel approaches, where one sequence alignment is mapped to only one thread. Moreover, due to the irregular memory access patterns in BLASTP, there remain significant challenges to map the most time-consuming phases (i.e., hit detection and ungapped extension) to the GPU using a fine-grained multithreaded approach. To address the above issues, we propose cuBLASTP, an efficient fine-grained BLASTP implementation for the GPU using CUDA. Our cuBLASTP realization encompasses many research contributions, including (1) memory-access reordering to reorder hits from column-major order to diagonal-major order, (2) position-based indexing to map a hit with a packed data structure to a bin, (3) aggressive hit filtering to eliminate hits beyond the threshold distance along the diagonal, (4) diagonal-based parallelism and hit-based parallelism for ungapped extension to extend sequences with different lengths in databases, and (5) hierarchical buffering to reduce memory-access overhead for the core data structures. The experimental results show that on a NVIDIA Kepler GPU, cuBLASTP delivers up to a 5.0-fold speedup over sequential FSA-BLAST and a 3.7-fold speedup over multithreaded NCBI-BLAST for the overall program execution. In addition, compared with GPU-BLASTP (the fastest GPU implementation of BLASTP to date), cuBLASTP achieves up to a 2.8-fold speedup for the kernel execution on the GPU and a 1.8-fold speedup for the overall program execution.
Year
DOI
Venue
2014
10.1109/IPDPS.2014.36
IPDPS
Keywords
Field
DocType
life sciences,graphics processing unit,ungapped extension,sequence read-lengths,core data structures,gpu,next-generation sequencing,sequence databases,cuda,sequence alignment,diagonal-major order,hit detection,protein sequence search,storage management,memory-access overhead reduction,parallel architectures,data structures,graphics processing units,data analysis,proteins,multi-threading,program execution,blast algorithm,threshold distance,coarse-grained parallel approach,packed data structure,nvidia kepler gpu,memory-access reordering,ngs,molecular biophysics,ungapped sequence extension,fine-grained blastp implementation,hit-based parallelism,blastp, gpu, bioinformatics, life sciences, next-generation sequencing, hit detection, ungapped extension,aggressive hit filtering,position-based indexing,fine-grained parallelization,biological sequences,irregular memory access patterns,blastp,fine-grained multithreaded approach,bioinformatics,hierarchical buffering,column-major order,basic local alignment search tool,diagonal-based parallelism,cublastp,column major order,multi threading,instruction sets,algorithm design and analysis,acceleration,next generation sequencing,databases
Multithreading,Data structure,Row-major order,Computer science,Instruction set,CUDA,Parallel computing,Search engine indexing,Graphics processing unit,Speedup,Distributed computing
Conference
ISSN
Citations 
PageRank 
1530-2075
5
0.46
References 
Authors
11
4
Name
Order
Citations
PageRank
Jing Zhang1706.53
Hao Wang26010.44
Heshan Lin337523.13
Wu-chun Feng42812232.50