Abstract | ||
---|---|---|
Traditional methods of IR-style keyword search/query in relational databases are based on clean data without entity resolution (ER), and as a result, their answers to a query may contain duplicates for dirty datasets with duplicate tuples that have different identifiers and refer to the same real-world entity. In this paper, we propose a method for processing top-N keyword queries with real-time ER. This method creates an index to obtain candidate tuples for a keyword query, defines a function to compute the similarities between the query and its candidate tuples, and designs a clustering algorithm with the Divide and Conquer mechanism to deduplicate the query results. Extensive experiments are conducted to confirm the effectiveness and efficiency of the method for both dirty and (almost) clean datasets. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1145/3195106.3195171 | PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING (ICMLC 2018) |
Keywords | Field | DocType |
Entity resolution, relational database, similarity, top-N keyword query | Name resolution,Identifier,Information retrieval,Relational database,Computer science,Tuple,Keyword search,Artificial intelligence,Divide and conquer algorithms,Cluster analysis,Machine learning | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
5 |