A Scalable Parallel Approach for Peptide Identification from Large-Scale Mass Spectrometry Data - Citegraph

Paper Info

Title
A Scalable Parallel Approach for Peptide Identification from Large-Scale Mass Spectrometry Data

Abstract
Identifying peptides, which are short polymeric chains of amino acid residues in a protein sequence, is of fundamental importance in systems biology research. The most popular approach to identify peptides is through database search. In this approach, an experimental spectrum ("query'') generated from fragments of a target peptide using mass spectrometry is computationally compared with a database of already known protein sequences. The goal is to detect database peptides that are most likely to have generated the target peptide. The exponential growth rates and overwhelming sizes of biomolecular databases make this an ideal application to benefit from parallel computing. However, the present generation of software tools is not expected to scale to the magnitudes and complexities of data that will be generated in the next few years. This is because they are all either serial algorithms or parallel strategies that have been designed over inherently serial methods, thereby requiring high space- and time- requirements. In this paper, we present an efficient parallel approach for peptide identification through database search. Three key factors distinguish our approach from that of existing solutions: (i) (space) Given p processors and a database with N residues, we provide the first space-optimal algorithm (O(N/p)) under distributed memory machine model; (ii) (time) Our algorithm uses a combination of parallel techniques such as one-sided communication and masking of communication with computation to ensure that the overhead introduced due to parallelism is minimal; and (iii) (quality) The run-time savings achieved using parallel processing has allowed us to incorporate highly accurate statistical models that have previously been demonstrated to ensure high quality prediction albeit on smaller scale data. We present the design and evaluation of two different algorithms to implement our approach. Experimental results using 2.65 million microbial proteins show linear s- caling up to 128 processors of a Linux commodity cluster, with parallel efficiency at ~50%. We expect that this new approach will be critical to meet the data-intensive and qualitative demands stemming from this important application domain.

Year	DOI	Venue
2009	10.1109/ICPPW.2009.41	ICPP Workshops
Keywords	Field	DocType
biology computing,mass spectroscopy,parallel databases,parallel programming,biomolecular databases,large-scale mass spectrometry data,parallel computing,peptide database search,peptide identification,quality factor,scalable parallel approach,space factor,systems biology research,time factor,mass spectrometry,parallel peptide identification	Linear scale,Computer science,Database search engine,Parallel computing,Distributed memory,Software,Statistical model,Application domain,Scalability,Computation	Conference
ISSN	ISBN	Citations
1530-2016 E-ISBN : 978-0-7695-3803-7	978-0-7695-3803-7	2
PageRank	References	Authors
0.65	5	4

Authors (4 rows)

Cited by (2 rows)

References (5 rows)

Name	Order	Citations	PageRank
Gaurav Ramesh Kulkarni	1	2	0.65
Kalyanaraman, Ananth	2	221	31.95
William R. Cannon	3	69	10.68
Douglas J. Baxter	4	22	4.98

1