Title
A unified sequence-structure classification of protein sequences: combining sequence and structure in a map of the protein space
Abstract
We analyze all known protein sequences in search for a global map of protein space that is consistent in terms of both sequence and structure. Our goal is to define clusters of homologous protein domains, beyond those detected by sequence-based methods alone, and then to build a three-dimensional (3D) model for each of the sequences that are homologous to sequences of known 3D structure. This analysis uses both sequence and structure based metrics in the analysis of all protein sequences in a non-redundant (NR) database, comprising all major sequence databases.The analysis starts from the sequences of the SCOP database domains, which have known three-dimensional structures These sequences are clustered first into families based on sequence similarity alone, without incorporating any information from the SCOP classification. Each sequence-based family is represented by a profile, and this profile is used to search the NR database, using PSI-BLAST. Since PSI-BLAST can lead to false similarities, several different indices of validity are used to control the procedure Each of the detected sequences is marked and a profile is built for the whole cluster of similar sequences. A 3D model is then built for each sequence in the cluster using an alignment made using the profile as well as the known structures of the SCOP representatives in the cluster Clusters based on SCOP domains are called type-I clusters In all we find 1421 type-I clusters with total of 168,431 sequences (44.5% of our NR database)After all members of type-I clusters have been marked, we analyze the remaining sequences. The PSI-BLAST procedure is applied repeatedly, each time with a different query, to search what is left over from the previous run. This give type-II clusters, which may overlap.Type-I and type-II clusters are then grouped using higher level measures of similarity. Those pairs of clusters that contain the same common protein (significant overlap in membership), are marked first. The pairs of clusters are then compared using either a structure metric (when 3D structures are known) or a novel sequence profile metric, and clustered into superfamilies and “fold” families.This analysis avoids the limitation of classifications that are based just on sequence comparison, and allows us to construct a 3D model for a substantial portion of the sequences in the NR database.
Year
DOI
Venue
2000
10.1145/332306.332569
RECOMB
Keywords
Field
DocType
* corresponding author,email:golan~gimmel.stanford.edu,protein space,unified sequence-structure classification,protein sequence,fold recognition,energy minimization,protein domains,structural classification of proteins,protein threading,nmr,three dimensional
Cluster (physics),Sequence logo,Global Map,Biology,Threading (protein sequence),Protein superfamily,Bioinformatics,Multiple sequence alignment,Structural Classification of Proteins database,Energy minimization
Conference
ISBN
Citations 
PageRank 
1-58113-186-0
4
0.50
References 
Authors
12
2
Name
Order
Citations
PageRank
G Yona165145.52
Michael Levitt258799.00