Title
3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families.
Abstract
Protein domain families are usually classified on the basis of similarity of amino acid sequences. Selection of a single representative sequence for each family provides targets for structure determination or modeling and also enables fast sequence searches to associate new members to a family. Such a selection could be challenging since some of these domain families exhibit huge variation depending on the number of members in the family, the average family sequence length or the extent of sequence divergence within a family. We had earlier created 3PFDB database as a repository of best representative sequences, selected from each PFAM domain family on the basis of high coverage. In this study, we have improved the database using more efficient strategies for the initial generation of sequence profiles and implement two independent methods, FASSM and HMMER, for identifying family members. HMMER employs a global sequence similarity search, while FASSM relies on motif identification and matching. This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage. The representative sequence is also highlighted in a two-dimensional plot, which reflects the relative divergence between family members. Representatives belonging to small families with short sequences are mainly associated with low coverage. The set of sequences not recognized by the family representative profiles, highlight several potential false or weak family associations in PFAM. Partial domains and fragments dominate such cases, along with sequences that are highly diverged or different from other family members. Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences.
Year
DOI
Venue
2014
10.1093/database/bau026
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
Field
DocType
Volume
Sequence logo,Protein domain,Protein sequencing,Computer science,Representative sequences,Motif (music),Bioinformatics,Nearest neighbor search,Peptide sequence,Sequence analysis
Journal
2014
ISSN
Citations 
PageRank 
1758-0463
0
0.34
References 
Authors
7
4
Name
Order
Citations
PageRank
Agnel Praveen Joseph170.90
Prashant Shingate2101.52
Atul K. Upadhyay300.34
Ramanathan Sowdhamini421521.20