Abstract | ||
---|---|---|
Similarity join plays an important role in many applications, such as data cleaning and integration, to address the poor data quality problem. Most of the existing studies focused on performing similarity join on static datasets but few studies realized running it on dynamic data streams. With the development of network technology, the data accessing paradigm has transferred from disk-oriented mode to online data streams, which makes performing similarity join in continuous query on data streams become a novel query processing paradigm. Different from static dataset, data stream is unbounded, continuous and unpredictable. The significant differences pose serious challenges, such as real-time query performance. To this end, we study the problem of continuous similarity join on data streams in this paper, which is based on edit distance metric and filter-and-verify framework with sliding-window semantics. Two subcases of this problem are studied, including self similarity join on a single data stream and similarity join on two streams. We introduced the basic window based sliding window model to facilitate the update of sliding window and its index. More details of our method, including signature extraction schemes, filtering and verification algorithms, re-evaluation strategies are discussed respectively. Finally, extensive experimental results show that our method works efficiently on real data streams. |
Year | DOI | Venue |
---|---|---|
2014 | 10.1109/PADSW.2014.7097853 | ICPADS |
Keywords | Field | DocType |
sliding window,filter-and-verify framework,string matching,continuous similarity join,dynamic data stream,similarity join,signature extraction scheme,continuous query,online data stream,data accessing paradigm,real-time query performance,edit distance,static dataset,reevaluation strategy,verification algorithm,data quality problem,basic window based sliding window model,disk-oriented mode,network technology,sliding-window semantics,query processing paradigm,data integration,filtering algorithm,data stream,data cleaning,query processing,edit distance metric | Edit distance,Hash join,Data mining,Data stream mining,Sliding window protocol,Data quality,Data stream,Computer science,Theoretical computer science,Dynamic data,Self-similarity | Conference |
ISSN | Citations | PageRank |
1521-9097 | 0 | 0.34 |
References | Authors | |
25 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jia Cui | 1 | 7 | 3.54 |
Wang Weiping | 2 | 335 | 63.84 |
Dan Meng | 3 | 476 | 67.10 |
Zhenyan Liu | 4 | 0 | 0.34 |