SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale - Citegraph

Paper Info

Title
SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

Abstract
In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very parallelizable though. In this paper, we present a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques. In order to reduce the analysis cost, the framework can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Our solution is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of the software described in this paper is publicly available at https://github.com/HamidMushtaq/SparkGA1.git.

Year	DOI	Venue
2017	10.1145/3107411.3107438	BCB
Field	DocType	ISBN
Pipeline transport,Spark (mathematics),Source code,Computer science,Load balancing (computing),Gigabyte,Software,Bioinformatics,Big data,Scalability	Conference	978-1-4503-4722-8
Citations	PageRank	References
6	0.63	8
Authors
6

Authors (6 rows)

Cited by (6 rows)

References (8 rows)

Name	Order	Citations	PageRank
Hamid Mushtaq	1	40	6.34
Frank Liu	2	526	45.14
Carlos H. A. Costa	3	20	3.26
Gang Liu	4	93	29.33
Peter Hofstee	5	14	2.80
Zaid Al-Ars	6	560	78.62

1