Title
PAGE: A Framework for Easy PArallelization of GEnomic Applications
Abstract
With the availability of high-throughput and low-cost sequencing technologies, an increasing amount of genetic data is becoming available to researchers. There is clearly a potential for significant new scientific and medical advances by analysis of such data, however, it is imperative to exploit parallelism and achieve effective utilization of the computing resources to be able to handle massive datasets. Thus, frameworks that can help researchers develop parallel applications without dealing with low-level details of parallel coding are very important for advances in genetic research. In this study, we develop a middleware, PAGE, which supports 'map reduce-like' processing, but with significant differences from a system like Hadoop, to be useful and effective for parallelizing analysis of genomic data. Particularly, it can work with map functions written in any language, thus allowing utilization of existing serial tools (even those for which only an executable is available) as map functions. Thus, it can greatly simplify parallel application development for scenarios where complex data formats and/or nuanced serial algorithms are involved, as is often the case for genomic data. It allows parallelization by partitioning by-locus or partitioning by-chromosome, provides different scheduling schemes, and execution models, to match the nature of algorithms common in genetic research. We have evaluated the middleware system using four popular genomic applications, including VarScan, Unified Genotyper, Realigner Target Creator, and Indel Realigner, and compared the achieved performance against with two popular frameworks (Hadoop and GATK). We show that our middleware outperforms GATK and Hadoop and it is able to achieve high parallel efficiency and scalability.
Year
DOI
Venue
2014
10.1109/IPDPS.2014.19
Phoenix, AZ
Keywords
Field
DocType
biology computing,genomics,middleware,GATK,Hadoop,PAGE,VarScan,by-chromosome partitioning,by-locus partitioning,complex data formats,execution models,genetic data,genomic data,indel realigner,low-cost sequencing technologies,map functions,mapreduce-like processing,middleware system,nuanced serial algorithms,parallel efficiency,parallel scalability,parallelization of genomic applications,realigner target creator,scheduling schemes,serial tools,unified genotyper
Middleware,Computer science,Scheduling (computing),Parallel computing,As is,Complex data type,Coding (social sciences),Exploit,Executable,Scalability,Distributed computing
Conference
ISSN
Citations 
PageRank 
1530-2075
3
0.39
References 
Authors
14
2
Name
Order
Citations
PageRank
mucahid kutlu13814.16
Gagan Agrawal22058209.59