CasJoin: a cascade chain for text similarity joins - Citegraph

Paper Info

Title
CasJoin: a cascade chain for text similarity joins

Abstract
We are concerned with the problem of similarity joins of text data, where the task is to find all pairs of documents above an expected similarity. Such a problem often serves as an indispensable step in many web applications. A crucial issue is to preclude unnecessary candidate pairs as many as possible ahead of expensive similarity evaluation. In this paper, we initiate an idea of adopting a cascade structure in text joins for a large speedup, where a latter stage can exclude a considerable number of invalid pairs survived in former stages. The proposed algorithm is shortly referred to as CasJoin. We further adopt a prefix filter to build the stage of CasJoin by introducing a novel vision to the dynamic generation of document vector. Specifically, a vector is partitioned into a chain of multiple prefixes that are appended one by one for cascade joining. We evaluate our CasJoin on a typical web corpus, ODP. Experiments indicate that, comparing to the state-of-the-art prefix algorithms, CasJoin can achieve a drastic reduction of candidates by as much as 98.15% and a dramatic speedup of joining by up to 13.34x.

Year	DOI	Venue
2010	10.1145/1871437.1871714	CIKM
Keywords	Field	DocType
dramatic speedup,expensive similarity evaluation,cascade chain,latter stage,large speedup,document vector,cascade structure,multiple prefix,former stage,prefix filter,expected similarity,text similarity,similarity search	Data mining,Joins,Duplicate detection,Information retrieval,Computer science,Theoretical computer science,Prefix,Cascade,Web application,Nearest neighbor search,Speedup	Conference
Citations	PageRank	References
1	0.41	7
Authors
5

Authors (5 rows)

Cited by (1 rows)

References (7 rows)

Name	Order	Citations	PageRank
Xiaoxun Zhang	1	336	16.74
Zhili guo	2	264	12.46
Honglei Guo	3	265	14.36
Huijia Zhu	4	139	6.97
Zhong Su	5	2282	110.39

1