Title
CasJoin: a cascade chain for text similarity joins
Abstract
We are concerned with the problem of similarity joins of text data, where the task is to find all pairs of documents above an expected similarity. Such a problem often serves as an indispensable step in many web applications. A crucial issue is to preclude unnecessary candidate pairs as many as possible ahead of expensive similarity evaluation. In this paper, we initiate an idea of adopting a cascade structure in text joins for a large speedup, where a latter stage can exclude a considerable number of invalid pairs survived in former stages. The proposed algorithm is shortly referred to as CasJoin. We further adopt a prefix filter to build the stage of CasJoin by introducing a novel vision to the dynamic generation of document vector. Specifically, a vector is partitioned into a chain of multiple prefixes that are appended one by one for cascade joining. We evaluate our CasJoin on a typical web corpus, ODP. Experiments indicate that, comparing to the state-of-the-art prefix algorithms, CasJoin can achieve a drastic reduction of candidates by as much as 98.15% and a dramatic speedup of joining by up to 13.34x.
Year
DOI
Venue
2010
10.1145/1871437.1871714
CIKM
Keywords
Field
DocType
dramatic speedup,expensive similarity evaluation,cascade chain,latter stage,large speedup,document vector,cascade structure,multiple prefix,former stage,prefix filter,expected similarity,text similarity,similarity search
Data mining,Joins,Duplicate detection,Information retrieval,Computer science,Theoretical computer science,Prefix,Cascade,Web application,Nearest neighbor search,Speedup
Conference
Citations 
PageRank 
References 
1
0.41
7
Authors
5
Name
Order
Citations
PageRank
Xiaoxun Zhang133616.74
Zhili guo226412.46
Honglei Guo326514.36
Huijia Zhu41396.97
Zhong Su52282110.39