Title
Gfsf: A Novel Similarity Join Method Based On Frequency Vector
Abstract
String similarity join is widely used in many fields, e.g. data cleaning, web search, pattern recognition and DNA sequence matching. During the recent years, many similarity join methods have been proposed, for example Pass-Join, Ed-Join, Trie-Join, and so on, among which the Pass-Join algorithm based on edit distance can achieve much better overall performance than the others. But Pass-Join can not effectively filter those candidate pairs which are partially similar. Here a novel algorithm called GFSF is proposed, which introduces two additional filtering steps based on character frequency vector. Through this way, the number of pairs which are only partially similar are greatly reduced, thus greatly reducing the total time of string similarity join process. The experimental results show that the overall performance of the proposed method is better than Pass-Join.
Year
DOI
Venue
2016
10.1007/978-3-319-39958-4_40
WEB-AGE INFORMATION MANAGEMENT, PT II
Field
DocType
Volume
Edit distance,Inverted index,Data mining,Pattern recognition,Computer science,Filter (signal processing),Artificial intelligence,String metric
Conference
9659
ISSN
Citations 
PageRank 
0302-9743
0
0.34
References 
Authors
17
3
Name
Order
Citations
PageRank
林子雨112910.80
Daowen Luo200.34
yongxuan lai311220.24