Abstract | ||
---|---|---|
We examine the exact dictionary matching problem with dynamic text and static terms and propose a simple but efficient algorithm with sublinear (in size of text) average performance for a wide range of practical problems. The algorithm is based on the Commentz-Walter-Horspool algorithm (CWH), presented by Baeza-Yates and Re'gnier [10]. Typically, our refinement will prune out more than 30% of characters scanned by CWH, when searching for all occurrences of tags, which are of varying lengths and members of a set of moderate size, in natural language text. This problem arises frequently in practice in scanning text downloaded from the internet, and accounts for a major portion of the preprocessing time associated with indexing such text for later retrieval. Our approach, which we refer to as layering, keeps track of an upper bound on the maximal length of potential term prefixes ending at each given position in the text. This information is then used to mask out some of the terms and filter out unnecessary character comparisons during the search. A practical implementation is described, which increases the size of the existing data structures as well as the preprocessing cost only by a factor of the size of the longest term in the set. |
Year | DOI | Venue |
---|---|---|
1998 | 10.1007/BFb0030779 | CPM |
Keywords | Field | DocType |
data structure,natural language,indexation,upper bound | String searching algorithm,Data structure,Combinatorics,Search algorithm,Upper and lower bounds,Computer science,Algorithm,Search engine indexing,Preprocessor,Pattern matching,Blossom algorithm | Conference |
Volume | ISSN | ISBN |
1448 | 0302-9743 | 3-540-64739-2 |
Citations | PageRank | References |
0 | 0.34 | 24 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Michal Ziv-Ukelson | 1 | 514 | 36.49 |
Aaron Kershenbaum | 2 | 818 | 96.80 |