Title
A Dictionary Matching Algorithm Fast on the Average for Terms of Varying Length
Abstract
We examine the exact dictionary matching problem with dynamic text and static terms and propose a simple but efficient algorithm with sublinear (in size of text) average performance for a wide range of practical problems. The algorithm is based on the Commentz-Walter-Horspool algorithm (CWH), presented by Baeza-Yates and Re'gnier [10]. Typically, our refinement will prune out more than 30% of characters scanned by CWH, when searching for all occurrences of tags, which are of varying lengths and members of a set of moderate size, in natural language text. This problem arises frequently in practice in scanning text downloaded from the internet, and accounts for a major portion of the preprocessing time associated with indexing such text for later retrieval. Our approach, which we refer to as layering, keeps track of an upper bound on the maximal length of potential term prefixes ending at each given position in the text. This information is then used to mask out some of the terms and filter out unnecessary character comparisons during the search. A practical implementation is described, which increases the size of the existing data structures as well as the preprocessing cost only by a factor of the size of the longest term in the set.
Year
DOI
Venue
1998
10.1007/BFb0030779
CPM
Keywords
Field
DocType
data structure,natural language,indexation,upper bound
String searching algorithm,Data structure,Combinatorics,Search algorithm,Upper and lower bounds,Computer science,Algorithm,Search engine indexing,Preprocessor,Pattern matching,Blossom algorithm
Conference
Volume
ISSN
ISBN
1448
0302-9743
3-540-64739-2
Citations 
PageRank 
References 
0
0.34
24
Authors
2
Name
Order
Citations
PageRank
Michal Ziv-Ukelson151436.49
Aaron Kershenbaum281896.80