Title
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Data
Abstract
As development of high-throughput and low-cost sequencing technologies is leading to massive volumes of genomic data, new solutions for handling data-intensive applications on parallel platforms are urgently required. Particularly, the nature of processing leads to both load balancing and I/O contention challenges. In this paper, we have developed a novel middleware system, RE-PAGE, which allows parallelization of applications that process genomic data with a simple, high-level API. To address load balancing and I/O contention, the features of the middleware include: 1) use of domain-specific information in the formation of data chunks (which can be of non-uniform sizes), 2) replication and placement of each chunk on a small number of nodes, performed in an intelligent way, and 3) scheduling schemes for achieving load balance, when data movement costs out-weigh processing costs and the chunks are of non-uniform sizes. We have evaluated our framework using three genomic applications, which are VarScan, Unified Genotyper, and Coverage Analyzer. We show that our approach leads to better performance than conventional MapReduce scheduling approaches and systems that access data from a centralized store. We also compare against popular frameworks, Hadoop and GATK, and show that our middleware outperforms both, achieving high parallel efficiency and scalability.
Year
DOI
Venue
2015
10.1109/CLUSTER.2015.54
Cluster Computing
Keywords
Field
DocType
Parallel Computing,Middleware Systems,Genomic Applications
Middleware,Load management,Middleware (distributed applications),Load balancing (computing),Scheduling (computing),Computer science,Parallel processing,Parallel computing,Real-time computing,Processor scheduling,Distributed computing,Scalability
Conference
ISSN
Citations 
PageRank 
1552-5244
0
0.34
References 
Authors
27
2
Name
Order
Citations
PageRank
mucahid kutlu13814.16
Gagan Agrawal22058209.59