Gazetteer Compression Technique Based on Substructure Recognition - Citegraph

Paper Info

Title
Gazetteer Compression Technique Based on Substructure Recognition

Abstract
Finite-state automata are state-of-the-art representation of dictionaries in natural language processing. We present a novel compression technique that is especially useful for gazetteers - a particular sort of dictionaries. We replace common substructures in the automaton by unique copies. To find them, we treat a transition vector as a string, and we apply a Ziv-Lempel-style text compression technique that uses suffix tree to find repetitions in lineaqr time. Empirical evaluation on real-world data reveals space savings of up to 18,6%, which makes this method highly attractive.

Year	DOI	Venue
2006	10.1007/3-540-33521-8_9	INTELLIGENT INFORMATION PROCESSING AND WEB MINING, PROCEEDINGS
Field	DocType	ISSN
Compression (physics),Text compression,Computer science,Automaton,sort,Theoretical computer science,Artificial intelligence,Suffix tree,Machine learning,Substructure	Conference	1615-3871
Citations	PageRank	References
3	0.41	6
Authors
2

Authors (2 rows)

Cited by (3 rows)

References (6 rows)

Name	Order	Citations	PageRank
Jan Daciuk	1	156	14.83
Jakub Piskorski	2	435	50.04

1