Statistical substring reduction in linear time - Citegraph

Paper Info

Title
Statistical substring reduction in linear time

Abstract
We study the problem of efficiently removing equal frequency n-gram substrings from an n-gram set, formally called Statistical Substring Reduction (SSR). SSR is a useful operation in corpus based multi-word unit research and new word identification task of oriental language processing. We present a new SSR algorithm that has linear time (O(n)) complexity, and prove its equivalence with the traditional O(n2) algorithm. In particular, using experimental results from several corpora with different sizes, we show that it is possible to achieve performance close to that theoretically predicated for this task. Even in a small corpus the new algorithm is several orders of magnitude faster than the O(n2) one. These results show that our algorithm is reliable and efficient, and is therefore an appropriate choice for large scale corpus processing.

Year	DOI	Venue
2004	null	Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)
Field	DocType	Volume
Substring,Computer science,Radix sort,Algorithm,Equivalence (measure theory),Time complexity,Statistical unit	Conference	3248
Issue	ISSN	ISBN
null	null	3-540-24475-1
Citations	PageRank	References
17	1.63	4
Authors
2

Authors (2 rows)

Cited by (17 rows)

References (4 rows)

Name	Order	Citations	PageRank
Xueqiang Zhang	1	26	3.87
Junfeng Hu	2	18	4.36

1