A Dictionary Matching Algorithm Fast on the Average for Terms of Varying Length - Citegraph

Paper Info

Title
A Dictionary Matching Algorithm Fast on the Average for Terms of Varying Length

Abstract
We examine the exact dictionary matching problem with dynamic text and static terms and propose a simple but efficient algorithm with sublinear (in size of text) average performance for a wide range of practical problems. The algorithm is based on the Commentz-Walter-Horspool algorithm (CWH), presented by Baeza-Yates and Re'gnier [10]. Typically, our refinement will prune out more than 30% of characters scanned by CWH, when searching for all occurrences of tags, which are of varying lengths and members of a set of moderate size, in natural language text. This problem arises frequently in practice in scanning text downloaded from the internet, and accounts for a major portion of the preprocessing time associated with indexing such text for later retrieval. Our approach, which we refer to as layering, keeps track of an upper bound on the maximal length of potential term prefixes ending at each given position in the text. This information is then used to mask out some of the terms and filter out unnecessary character comparisons during the search. A practical implementation is described, which increases the size of the existing data structures as well as the preprocessing cost only by a factor of the size of the longest term in the set.

Year	DOI	Venue
1998	10.1007/BFb0030779	CPM
Keywords	Field	DocType
data structure,natural language,indexation,upper bound	String searching algorithm,Data structure,Combinatorics,Search algorithm,Upper and lower bounds,Computer science,Algorithm,Search engine indexing,Preprocessor,Pattern matching,Blossom algorithm	Conference
Volume	ISSN	ISBN
1448	0302-9743	3-540-64739-2
Citations	PageRank	References
0	0.34	24
Authors
2

Authors (2 rows)

Cited by (0 rows)

References (24 rows)

Name	Order	Citations	PageRank
Michal Ziv-Ukelson	1	514	36.49
Aaron Kershenbaum	2	818	96.80

1