Title
Practical Graph Isomorphism for Graphlet Data Mining in Protein Structures
Abstract
The tertiary structure of proteins is composed of a-helices and beta-sheets being referred to as Secondary Structure Elements (SSE). SSE are evolutionary conserved and define the overall fold of a protein. Therefore they can be used to classify protein families. SSE form pairwise energetical interactions which can be described by graphs. Neighbourhood graphs employ edge conditions to filter relevant interactions out of a set of pairwise relations. Graphlet analysis then employs stochastic sampling of subgraphs to identify overrepresented interaction patterns. To distinguish graphlets while sampling requires an efficient algorithm for graph isomorphism. In this Chapter, we describe a graph isomorphism algorithm that is easy to implement. In a preprocessing phase, the presented algorithm combines marks assigned to vertices to more informative marks. Propagated over edges, marks collect information about the structure of the graph and hence allow to efficiently find isomorphisms in a subsequent backtracking step. Applying graphlet analysis to neighbourhood graphs for structures from the ICGEB Protein Benchmark database and the Super-Secondary Structure database (SSSDB), we identify 627 significant graphlets. Subsequently trained decision trees on these features predict the four SCOP levels and SSSDB classes with a mean Area Under Curve (mAUC) better than 0.89 (5-fold CV). Regularizing these decision trees to avoid overfitting reveals that for reliable prediction of structural features about 20 graphlets are sufficient. Especially, we find that graphlets composed of five secondary structure elements are most informative for classification. Conversely, using decision trees trained on the mere sequence of SSE obtained from the protein sequence we are also able to predict graphlets directly from secondary structure annotation. Optimal prediction performance thereby reaches up to a Matthews Correlation Coefficient (MCC) of 0.7. From our experiments in this Chapter, we conclude that SSE interactions form patterns significantly different from random. These patterns are both useful to predict structural protein features as well as they can be predicted from protein sequence. Therefore they can be used as constraints to facilitate the de novo prediction of unknown protein structures.
Year
DOI
Venue
2010
10.1007/978-3-642-27534-0_23
Studies in Computational Intelligence
Keywords
Field
DocType
Graphlets,Data Mining,Relative Neighbourhood Graph,Secondary Structure Elements,Decision Tree Model Selection
Pairwise comparison,Graph,Combinatorics,Protein tertiary structure,Graph isomorphism,Sampling (statistics),Protein secondary structure,Mathematics,Protein structure
Conference
Volume
ISSN
Citations 
399
1860-949X
0
PageRank 
References 
Authors
0.34
8
3
Name
Order
Citations
PageRank
Carsten Henneges1120.76
Christoph Behle2234.14
Andreas Zell31419137.58