Title
DistSim - Scalable Distributed in-Memory Semantic Similarity Estimation for RDF Knowledge Graphs
Abstract
In this paper, we present DistSim, a Scalable Distributed in-Memory Semantic Similarity Estimation framework for Knowledge Graphs. DistSim provides a multitude of state-of-the-art similarity estimators. We have developed the Similarity Estimation Pipeline by combining generic software modules. For large scale RDF data, DistSim proposes MinHash with locality sensitivity hashing to achieve better scalability over all-pair similarity estimations. The modules of DistSim can be set up using a multitude of (hyper)-parameters allowing to adjust the tradeoff between information taken into account, and processing time. Furthermore, the output of the Similarity Estimation Pipeline is native RDF. DistSim is integrated into the SANSA stack, documented in scala-docs, and covered by unit tests. Additionally, the variables and provided methods follow the Apache Spark MLlib name-space conventions. The performance of DistSim was tested over a distributed cluster, for the dimensions of data set size and processing power versus processing time, which shows the scalability of DistSim w.r.t. increasing data set sizes and processing power. DistSim is already in use for solving several RDF data analytics related use cases. Additionally, DistSim is available and integrated into the open-source GitHub project SANSA.
Year
DOI
Venue
2021
10.1109/ICSC50631.2021.00062
2021 IEEE 15th International Conference on Semantic Computing (ICSC)
Keywords
DocType
ISSN
Distributed RDF Analytics,Scalable Semantic Similarity Estimation,Knowledge Graph Data Analytics Pipeline,SANSA
Conference
2325-6516
ISBN
Citations 
PageRank 
978-1-7281-8900-0
0
0.34
References 
Authors
10
3
Name
Order
Citations
PageRank
Carsten Felix Draschner100.68
Jens Lehmann25375355.08
Hajira Jabeen303.04