Title
Cloud technologies for bioinformatics applications
Abstract
Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI in one case. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a "doubly data parallel step" involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). Applying CAP3 program (for EST) to a collection of files containing gene reads, and performing gene sequence alignments using Smith-Waterman (local) algorithm to find dissimilarities between genes and constructing a dissimilarity matrix have similar initial structure. In the first example, each parallel task executes the CAP3 program on an input data file independently of others and there is no "reduction" or "aggregation" necessary at the end of the computation, where as in the Alu case, a global aggregation is necessary at the end of the independent computations to produce the resulting dissimilarity matrix that is fed into traditional high performance MPI. The PhyloD has a similar initial stage which is followed by an aggregation step. The simple structure of the data/compute flow and the minimum inter- task communicational requirements of these "pleasingly parallel" applications enable them to be implemented using a wide variety of technologies. The support for handling large data sets, the concept of moving computation to data, and the better quality of services provided by the cloud technologies, simplify the implementation of some problems over traditional systems. We find that different programming constructs available in cloud technologies such as independent "maps" in MapReduce, "homomorphic Apply" in Dryad, and the "worker roles" in Azure are all suitable for implementing applications of the type we examine. In the Alu case, we show that Dryad can be programmed to prepare data for use in later parallel MPI/threaded applications used for further analysis.
Year
DOI
Venue
2009
10.1145/1646468.1646474
Supercomputing Conference
Keywords
Field
DocType
mpi,multicore,independent task,cloud technology,dryad,bioinformatics application,phylod statistical package,azure cloud,different databases,independent data,different structure,apache hadoop mapreduce implementation,cloud,pairwise alu gene alignment,bioinformatics,data parallel step,quality of service,expressed sequence tag,sequence alignment
Dryad (programming),Pairwise comparison,Computer science,Parallel computing,Bioinformatics,Multi-core processor,Computer cluster,Operating system,Sequence assembly,Distributed computing,Cloud computing
Conference
Citations 
PageRank 
References 
62
3.94
14
Authors
7
Name
Order
Citations
PageRank
Xiaohong Qiu115116.30
Jaliya Ekanayake2104060.58
Scott Beason3786.04
Thilina Gunarathne474438.87
Geoffrey Fox54070575.38
Roger S. Barga654132.84
Dennis Gannon72514330.26