Cloud technologies for bioinformatics applications - Citegraph

Paper Info

Title
Cloud technologies for bioinformatics applications

Abstract
Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI in one case. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a "doubly data parallel step" involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). Applying CAP3 program (for EST) to a collection of files containing gene reads, and performing gene sequence alignments using Smith-Waterman (local) algorithm to find dissimilarities between genes and constructing a dissimilarity matrix have similar initial structure. In the first example, each parallel task executes the CAP3 program on an input data file independently of others and there is no "reduction" or "aggregation" necessary at the end of the computation, where as in the Alu case, a global aggregation is necessary at the end of the independent computations to produce the resulting dissimilarity matrix that is fed into traditional high performance MPI. The PhyloD has a similar initial stage which is followed by an aggregation step. The simple structure of the data/compute flow and the minimum inter- task communicational requirements of these "pleasingly parallel" applications enable them to be implemented using a wide variety of technologies. The support for handling large data sets, the concept of moving computation to data, and the better quality of services provided by the cloud technologies, simplify the implementation of some problems over traditional systems. We find that different programming constructs available in cloud technologies such as independent "maps" in MapReduce, "homomorphic Apply" in Dryad, and the "worker roles" in Azure are all suitable for implementing applications of the type we examine. In the Alu case, we show that Dryad can be programmed to prepare data for use in later parallel MPI/threaded applications used for further analysis.

Year	DOI	Venue
2009	10.1145/1646468.1646474	Supercomputing Conference
Keywords	Field	DocType
mpi,multicore,independent task,cloud technology,dryad,bioinformatics application,phylod statistical package,azure cloud,different databases,independent data,different structure,apache hadoop mapreduce implementation,cloud,pairwise alu gene alignment,bioinformatics,data parallel step,quality of service,expressed sequence tag,sequence alignment	Dryad (programming),Pairwise comparison,Computer science,Parallel computing,Bioinformatics,Multi-core processor,Computer cluster,Operating system,Sequence assembly,Distributed computing,Cloud computing	Conference
Citations	PageRank	References
62	3.94	14
Authors
7

Authors (7 rows)

Cited by (62 rows)

References (14 rows)

Name	Order	Citations	PageRank
Xiaohong Qiu	1	151	16.30
Jaliya Ekanayake	2	1040	60.58
Scott Beason	3	78	6.04
Thilina Gunarathne	4	744	38.87
Geoffrey Fox	5	4070	575.38
Roger S. Barga	6	541	32.84
Dennis Gannon	7	2514	330.26

1