Practical Graph Isomorphism for Graphlet Data Mining in Protein Structures - Citegraph

Paper Info

Title
Practical Graph Isomorphism for Graphlet Data Mining in Protein Structures

Abstract
The tertiary structure of proteins is composed of a-helices and beta-sheets being referred to as Secondary Structure Elements (SSE). SSE are evolutionary conserved and define the overall fold of a protein. Therefore they can be used to classify protein families. SSE form pairwise energetical interactions which can be described by graphs. Neighbourhood graphs employ edge conditions to filter relevant interactions out of a set of pairwise relations. Graphlet analysis then employs stochastic sampling of subgraphs to identify overrepresented interaction patterns. To distinguish graphlets while sampling requires an efficient algorithm for graph isomorphism. In this Chapter, we describe a graph isomorphism algorithm that is easy to implement. In a preprocessing phase, the presented algorithm combines marks assigned to vertices to more informative marks. Propagated over edges, marks collect information about the structure of the graph and hence allow to efficiently find isomorphisms in a subsequent backtracking step. Applying graphlet analysis to neighbourhood graphs for structures from the ICGEB Protein Benchmark database and the Super-Secondary Structure database (SSSDB), we identify 627 significant graphlets. Subsequently trained decision trees on these features predict the four SCOP levels and SSSDB classes with a mean Area Under Curve (mAUC) better than 0.89 (5-fold CV). Regularizing these decision trees to avoid overfitting reveals that for reliable prediction of structural features about 20 graphlets are sufficient. Especially, we find that graphlets composed of five secondary structure elements are most informative for classification. Conversely, using decision trees trained on the mere sequence of SSE obtained from the protein sequence we are also able to predict graphlets directly from secondary structure annotation. Optimal prediction performance thereby reaches up to a Matthews Correlation Coefficient (MCC) of 0.7. From our experiments in this Chapter, we conclude that SSE interactions form patterns significantly different from random. These patterns are both useful to predict structural protein features as well as they can be predicted from protein sequence. Therefore they can be used as constraints to facilitate the de novo prediction of unknown protein structures.

Year	DOI	Venue
2010	10.1007/978-3-642-27534-0_23	Studies in Computational Intelligence
Keywords	Field	DocType
Graphlets,Data Mining,Relative Neighbourhood Graph,Secondary Structure Elements,Decision Tree Model Selection	Pairwise comparison,Graph,Combinatorics,Protein tertiary structure,Graph isomorphism,Sampling (statistics),Protein secondary structure,Mathematics,Protein structure	Conference
Volume	ISSN	Citations
399	1860-949X	0
PageRank	References	Authors
0.34	8	3

Authors (3 rows)

Cited by (0 rows)

References (8 rows)

Name	Order	Citations	PageRank
Carsten Henneges	1	12	0.76
Christoph Behle	2	23	4.14
Andreas Zell	3	1419	137.58

1