Title
Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study.
Abstract
This paper presents a joint effort between a group of computer scientists and bioinformaticians to take an important step towards a general big data platform for genome analysis pipelines. The key goals of this study are to develop a thorough understanding of the strengths and limitations of big data technology for genomic data analysis, and to identify the key questions that the research community could address to realize the vision of personalized genomic medicine. Our platform, called Gesall, is based on the new \"Wrapper Technology\" that supports existing genomic data analysis programs in their native forms, without having to rewrite them. To do so, our system provides several layers of software, including a new Genome Data Parallel Toolkit (GDPT), which can be used to \"wrap\" existing data analysis programs. This platform offers a concrete context for evaluating big data technology for genomics: we report on super-linear speedup and sublinear speedup for various tasks, as well as the reasons why a parallel program could produce different results from those of a serial program. These results lead to key research questions that require a synergy between genomics scientists and computer scientists to find solutions.
Year
DOI
Venue
2017
10.1145/3035918.3064048
SIGMOD Conference
Field
DocType
Citations 
Data science,Genome,Data mining,Computer science,Massively parallel,Genomics,Software,Data management,Big data,Benchmarking,Database,Speedup
Conference
5
PageRank 
References 
Authors
0.45
13
7
Name
Order
Citations
PageRank
Abhishek Roy145132.21
Yanlei Diao22234108.95
Uday S Evani3342.42
Avinash Abhyankar450.45
Clinton Howarth550.45
Rémi Le Priol650.45
Toby Bloom79924.74