Title
SparkLeBLAST: Scalable Parallelization of BLAST Sequence Alignment Using Spark
Abstract
The exponential growth of genomic data presents challenges in analyzing and computing on such biological data at scale. While NCBI’s BLAST is a widely used pairwise sequence alignment tool, it does not scale to large datasets that are hundreds of gigabytes (GB) in size. To address this scalability problem, mpiBLAST emerged and became widely used, enabling scaling to 65,536 processes. However, mpiBLAST suffers from being tightly coupled with a specific implementation of BLAST, rendering it difficult to upgrade with the ever-evolving NCBI BLAST code. To address this shortcoming, recent parallel BLAST tools, such as SparkBLAST, consist of wrappers that are decoupled from the BLAST code but suffer from poor scalability with large sequence databases. Thus, there does not exist any parallel BLAST tool that can simultaneously address the issues of performance, scalability, programmability, and upgradability. To address this void, we propose SparkLeBLAST, a parallel BLAST tool that leverages our performance modeling and the Spark framework to deliver the performance and scalability of mpiBLAST and the ease of programming and upgradability of SparkBLAST, respectively. Ultimately, SparkLeBLAST delivers a 10x speedup relative to the state-of-the-art SparkBLAST and nearly a 2x speedup relative to the latest version of mpiBLAST.
Year
DOI
Venue
2020
10.1109/CCGrid49817.2020.00-39
2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
Keywords
DocType
ISBN
scalable genome analysis,BLAST,Spark,distributed computing,parallel computing,bioinformatics,sequence alignment,mpiBLAST,SparkBLAST
Conference
978-1-7281-6095-5
Citations 
PageRank 
References 
1
0.36
0
Authors
2
Name
Order
Citations
PageRank
Karim Youssef110.36
Wu-chun Feng22812232.50