Title
High performance computing workflow for protein functional annotation
Abstract
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the PSU (Protein Sequence Universe) expands exponentially. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible whereas a high compute cost limits the utility of existing automated approaches. In this study, we built an automated workflow to enable large-scale protein annotation into existing orthologous groups using HPC (High Performance Computing) architectures. We developed a low complexity classification algorithm to assign proteins into bacterial COGs (Clusters of Orthologous Groups of proteins). Based on the PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool), the algorithm was validated on simulated and archaeal data to ensure at least 80% specificity and sensitivity. The workflow with highly scalable parallel applications for classification and sequence alignment was developed on XSEDE (Extreme Science and Engineering Discovery Environment) supercomputers. Using the workflow, we have classified one million newly sequenced bacterial proteins. With the rapid expansion of the PSU, the proposed workflow will enable scientists to annotate big genome data.
Year
DOI
Venue
2013
10.1145/2484762.2484809
XSEDE
Keywords
Field
DocType
bacterial protein,bacterial cogs,protein data,automated workflow,protein functional annotation,functional annotation,bacterial genomes,proposed workflow,big genome data,archaeal data,high performance computing workflow,data generation,cog,petascale,psu,blast
Data mining,Annotation,Protein sequencing,Computer science,Protein Annotation,Petascale computing,Workflow,Test data generation,Bacterial genome size,Scalability
Conference
Citations 
PageRank 
References 
0
0.34
15
Authors
6
Name
Order
Citations
PageRank
Larissa Stanberry1295.14
Yuan Liu220.71
Bhanu Rekepalli3133.23
Paul Giblock4101.12
Roger Higdon5436.96
William Broomall6294.13