Title
Continuous similarity join on data streams
Abstract
Similarity join plays an important role in many applications, such as data cleaning and integration, to address the poor data quality problem. Most of the existing studies focused on performing similarity join on static datasets but few studies realized running it on dynamic data streams. With the development of network technology, the data accessing paradigm has transferred from disk-oriented mode to online data streams, which makes performing similarity join in continuous query on data streams become a novel query processing paradigm. Different from static dataset, data stream is unbounded, continuous and unpredictable. The significant differences pose serious challenges, such as real-time query performance. To this end, we study the problem of continuous similarity join on data streams in this paper, which is based on edit distance metric and filter-and-verify framework with sliding-window semantics. Two subcases of this problem are studied, including self similarity join on a single data stream and similarity join on two streams. We introduced the basic window based sliding window model to facilitate the update of sliding window and its index. More details of our method, including signature extraction schemes, filtering and verification algorithms, re-evaluation strategies are discussed respectively. Finally, extensive experimental results show that our method works efficiently on real data streams.
Year
DOI
Venue
2014
10.1109/PADSW.2014.7097853
ICPADS
Keywords
Field
DocType
sliding window,filter-and-verify framework,string matching,continuous similarity join,dynamic data stream,similarity join,signature extraction scheme,continuous query,online data stream,data accessing paradigm,real-time query performance,edit distance,static dataset,reevaluation strategy,verification algorithm,data quality problem,basic window based sliding window model,disk-oriented mode,network technology,sliding-window semantics,query processing paradigm,data integration,filtering algorithm,data stream,data cleaning,query processing,edit distance metric
Edit distance,Hash join,Data mining,Data stream mining,Sliding window protocol,Data quality,Data stream,Computer science,Theoretical computer science,Dynamic data,Self-similarity
Conference
ISSN
Citations 
PageRank 
1521-9097
0
0.34
References 
Authors
25
4
Name
Order
Citations
PageRank
Jia Cui173.54
Wang Weiping233563.84
Dan Meng347667.10
Zhenyan Liu400.34