Title
SDLER: stacked dedupe learning for entity resolution in big data era
Abstract
In the Big Data Era, Entity Resolution (ER) faces many challenges such as high scalability, the coexistence of complex similarity metrics, tautonymy and synonym, and the requirement of Data Quality Evaluation. Moreover, despite more than seventy years of development efforts, there is still a high demand for democratizing ER to reduce human participation in tuning parameters, data labeling, defining blocking functions, and feature engineering. This study aimed to explore a novel Stacked Dedupe Learning ER system with high accuracy and efficiency. The study evaluated sophisticated composition methods, such as Bidirectional Recurrent Neural Networks (BiRNNs) and Long Short-Term Memory (LSTM) hidden units, to renovate each tuple to word representation distribution in a sense to capture similarities amidst tuples. Also, pre-trained words embedding where they were not available, ways to learn and tune Word Representation Distribution customized for ER tasks under different scenarios were considered. More so, the Locality Sensitive Hashing (LSH) based blocking approach, which considered the entire attributes of a tuple and produced slighter blocks, compared with traditional methods with few attributes, were assessed. The algorithm was tested on multiple datasets namely benchmarks, and multi-lingual data. The experimental results showed that Stacked Dedupe Learning achieves high quality and good performance, and scales well compared to the existing solutions.
Year
DOI
Venue
2021
10.1007/s11227-021-03710-x
The Journal of Supercomputing
Keywords
DocType
Volume
Bidirectional RNN, Big data, Data quality, Entity resolution, Stacked Dedupe Learning (SDL), Word Representation Distribution (WRD)
Journal
77
Issue
ISSN
Citations 
10
0920-8542
1
PageRank 
References 
Authors
0.36
16
4
Name
Order
Citations
PageRank
Alladoumbaye Ngueilbaye121.10
Hongzhi Wang242173.72
Daouda Ahmat Mahamat310.36
Elgendy Ibrahim4395.42