Abstract | ||
---|---|---|
We are concerned with the problem of similarity joins of text data, where the task is to find all pairs of documents above an expected similarity. Such a problem often serves as an indispensable step in many web applications. A crucial issue is to preclude unnecessary candidate pairs as many as possible ahead of expensive similarity evaluation. In this paper, we initiate an idea of adopting a cascade structure in text joins for a large speedup, where a latter stage can exclude a considerable number of invalid pairs survived in former stages. The proposed algorithm is shortly referred to as CasJoin. We further adopt a prefix filter to build the stage of CasJoin by introducing a novel vision to the dynamic generation of document vector. Specifically, a vector is partitioned into a chain of multiple prefixes that are appended one by one for cascade joining. We evaluate our CasJoin on a typical web corpus, ODP. Experiments indicate that, comparing to the state-of-the-art prefix algorithms, CasJoin can achieve a drastic reduction of candidates by as much as 98.15% and a dramatic speedup of joining by up to 13.34x. |
Year | DOI | Venue |
---|---|---|
2010 | 10.1145/1871437.1871714 | CIKM |
Keywords | Field | DocType |
dramatic speedup,expensive similarity evaluation,cascade chain,latter stage,large speedup,document vector,cascade structure,multiple prefix,former stage,prefix filter,expected similarity,text similarity,similarity search | Data mining,Joins,Duplicate detection,Information retrieval,Computer science,Theoretical computer science,Prefix,Cascade,Web application,Nearest neighbor search,Speedup | Conference |
Citations | PageRank | References |
1 | 0.41 | 7 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Xiaoxun Zhang | 1 | 336 | 16.74 |
Zhili guo | 2 | 264 | 12.46 |
Honglei Guo | 3 | 265 | 14.36 |
Huijia Zhu | 4 | 139 | 6.97 |
Zhong Su | 5 | 2282 | 110.39 |