Abstract | ||
---|---|---|
Entity matching is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to entity profiling, product recommendation, etc. The matching process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we design an unsupervised approach, called EMAN, to match entities across two or more heterogeneous data sources. The algorithm utilizes the locality sensitive hashing schema to reduce the candidate tuples and speed up the matching process. To handle the heterogeneous entity attributes, we employ the exponential family to model the similarities between the different attributes. EMAN is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EMAN on re-identifying entities from the same data source, as well as matching entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline. |
Year | Venue | Field |
---|---|---|
2016 | DASFAA | Locality-sensitive hashing,Data source,Data mining,Computer science,Tuple,Profiling (computer programming),Exponential family,Schema (psychology),Speedup |
DocType | Citations | PageRank |
Conference | 4 | 0.42 |
References | Authors | |
14 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Chao Kong | 1 | 4 | 1.44 |
Ming Gao | 2 | 76 | 9.41 |
Chen Xu | 3 | 31 | 4.43 |
Weining Qian | 4 | 1064 | 81.09 |
Aoying Zhou | 5 | 2632 | 238.85 |