Context Sampling Strategies for Generating Linked Data Graph Embeddings. - Citegraph

Paper Info

Title
Context Sampling Strategies for Generating Linked Data Graph Embeddings.

Abstract
Linked data is a data publishing method that can be used to connect any kind of globally available data into a single multigraph. This kind of graph provides enormous opportunities for machine learning and data mining techniques to train models with large heterogenous types of data and find new relationships. Both are however strongly dependent on the engineering of high quality features and therefore requires knowledge of the domain. Recent advances in the field of representation learning has led to significant progress in automating the feature engineering process. Neural word embedding techniques from the natural language processing domain have been used to learn representations of graph nodes and subsequently applied to linked data nodes. In contrast to natural language where sentences serve as natural boundary for the context of a word, in a graph - boundaries are not clearly defined and multiple context sampling strategies exist. Applying different context sampling strategies on graph nodes result in different context sentences and subsequently different features. In this work, we explore two different context sampling strategies: predicate removal from random walks as well as breadth first search based sampling and compare them to the state of the art based on random walks. The quality of the generated features is evaluated indirectly by measuring the performance of machine learning models on a classification task across multiple data sets. Furthermore, we explore the effect of generating embeddings only for the entities that have to be classified and their neighbors, instead of generating embeddings for every node in a possibly large RDF graph. The results suggest that for classification of same typed entities the inclusion of predicates in the sampled walks for generating embeddings is of little use and can be omitted without losing classification accuracy. Results also show that the in-degree and out-degree of the entities may be useful hint for selecting the optimal sampling technique.

Year	DOI	Venue
2018	10.3233/978-1-61499-900-3-559	Frontiers in Artificial Intelligence and Applications
Keywords	Field	DocType
Linked open data,Graph Mining,Data Mining,Context Sampling Strategies	Graph,Computer science,Linked data,Theoretical computer science,Sampling (statistics)	Conference
Volume	ISSN	Citations
303	0922-6389	0
PageRank	References	Authors
0.34	0	2

Authors (2 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yordan Terziev	1	1	1.04
volker gruhn	2	1584	221.96

1