Title
A Scalable Parallel Approach for Peptide Identification from Large-Scale Mass Spectrometry Data
Abstract
Identifying peptides, which are short polymeric chains of amino acid residues in a protein sequence, is of fundamental importance in systems biology research. The most popular approach to identify peptides is through database search. In this approach, an experimental spectrum ("query'') generated from fragments of a target peptide using mass spectrometry is computationally compared with a database of already known protein sequences. The goal is to detect database peptides that are most likely to have generated the target peptide. The exponential growth rates and overwhelming sizes of biomolecular databases make this an ideal application to benefit from parallel computing. However, the present generation of software tools is not expected to scale to the magnitudes and complexities of data that will be generated in the next few years. This is because they are all either serial algorithms or parallel strategies that have been designed over inherently serial methods, thereby requiring high space- and time- requirements. In this paper, we present an efficient parallel approach for peptide identification through database search. Three key factors distinguish our approach from that of existing solutions: (i) (space) Given p processors and a database with N residues, we provide the first space-optimal algorithm (O(N/p)) under distributed memory machine model; (ii) (time) Our algorithm uses a combination of parallel techniques such as one-sided communication and masking of communication with computation to ensure that the overhead introduced due to parallelism is minimal; and (iii) (quality) The run-time savings achieved using parallel processing has allowed us to incorporate highly accurate statistical models that have previously been demonstrated to ensure high quality prediction albeit on smaller scale data. We present the design and evaluation of two different algorithms to implement our approach. Experimental results using 2.65 million microbial proteins show linear s- caling up to 128 processors of a Linux commodity cluster, with parallel efficiency at ~50%. We expect that this new approach will be critical to meet the data-intensive and qualitative demands stemming from this important application domain.
Year
DOI
Venue
2009
10.1109/ICPPW.2009.41
ICPP Workshops
Keywords
Field
DocType
biology computing,mass spectroscopy,parallel databases,parallel programming,biomolecular databases,large-scale mass spectrometry data,parallel computing,peptide database search,peptide identification,quality factor,scalable parallel approach,space factor,systems biology research,time factor,mass spectrometry,parallel peptide identification
Linear scale,Computer science,Database search engine,Parallel computing,Distributed memory,Software,Statistical model,Application domain,Scalability,Computation
Conference
ISSN
ISBN
Citations 
1530-2016 E-ISBN : 978-0-7695-3803-7
978-0-7695-3803-7
2
PageRank 
References 
Authors
0.65
5
4
Name
Order
Citations
PageRank
Gaurav Ramesh Kulkarni120.65
Kalyanaraman, Ananth222131.95
William R. Cannon36910.68
Douglas J. Baxter4224.98