Title
Gazetteer Compression Technique Based on Substructure Recognition
Abstract
Finite-state automata are state-of-the-art representation of dictionaries in natural language processing. We present a novel compression technique that is especially useful for gazetteers - a particular sort of dictionaries. We replace common substructures in the automaton by unique copies. To find them, we treat a transition vector as a string, and we apply a Ziv-Lempel-style text compression technique that uses suffix tree to find repetitions in lineaqr time. Empirical evaluation on real-world data reveals space savings of up to 18,6%, which makes this method highly attractive.
Year
DOI
Venue
2006
10.1007/3-540-33521-8_9
INTELLIGENT INFORMATION PROCESSING AND WEB MINING, PROCEEDINGS
Field
DocType
ISSN
Compression (physics),Text compression,Computer science,Automaton,sort,Theoretical computer science,Artificial intelligence,Suffix tree,Machine learning,Substructure
Conference
1615-3871
Citations 
PageRank 
References 
3
0.41
6
Authors
2
Name
Order
Citations
PageRank
Jan Daciuk115614.83
Jakub Piskorski243550.04