Title
Context Sampling Strategies for Generating Linked Data Graph Embeddings.
Abstract
Linked data is a data publishing method that can be used to connect any kind of globally available data into a single multigraph. This kind of graph provides enormous opportunities for machine learning and data mining techniques to train models with large heterogenous types of data and find new relationships. Both are however strongly dependent on the engineering of high quality features and therefore requires knowledge of the domain. Recent advances in the field of representation learning has led to significant progress in automating the feature engineering process. Neural word embedding techniques from the natural language processing domain have been used to learn representations of graph nodes and subsequently applied to linked data nodes. In contrast to natural language where sentences serve as natural boundary for the context of a word, in a graph - boundaries are not clearly defined and multiple context sampling strategies exist. Applying different context sampling strategies on graph nodes result in different context sentences and subsequently different features. In this work, we explore two different context sampling strategies: predicate removal from random walks as well as breadth first search based sampling and compare them to the state of the art based on random walks. The quality of the generated features is evaluated indirectly by measuring the performance of machine learning models on a classification task across multiple data sets. Furthermore, we explore the effect of generating embeddings only for the entities that have to be classified and their neighbors, instead of generating embeddings for every node in a possibly large RDF graph. The results suggest that for classification of same typed entities the inclusion of predicates in the sampled walks for generating embeddings is of little use and can be omitted without losing classification accuracy. Results also show that the in-degree and out-degree of the entities may be useful hint for selecting the optimal sampling technique.
Year
DOI
Venue
2018
10.3233/978-1-61499-900-3-559
Frontiers in Artificial Intelligence and Applications
Keywords
Field
DocType
Linked open data,Graph Mining,Data Mining,Context Sampling Strategies
Graph,Computer science,Linked data,Theoretical computer science,Sampling (statistics)
Conference
Volume
ISSN
Citations 
303
0922-6389
0
PageRank 
References 
Authors
0.34
0
2
Name
Order
Citations
PageRank
Yordan Terziev111.04
volker gruhn21584221.96