Abstract | ||
---|---|---|
Finite-state automata are state-of-the-art representation of dictionaries in natural language processing. We present a novel compression technique that is especially useful for gazetteers - a particular sort of dictionaries. We replace common substructures in the automaton by unique copies. To find them, we treat a transition vector as a string, and we apply a Ziv-Lempel-style text compression technique that uses suffix tree to find repetitions in lineaqr time. Empirical evaluation on real-world data reveals space savings of up to 18,6%, which makes this method highly attractive. |
Year | DOI | Venue |
---|---|---|
2006 | 10.1007/3-540-33521-8_9 | INTELLIGENT INFORMATION PROCESSING AND WEB MINING, PROCEEDINGS |
Field | DocType | ISSN |
Compression (physics),Text compression,Computer science,Automaton,sort,Theoretical computer science,Artificial intelligence,Suffix tree,Machine learning,Substructure | Conference | 1615-3871 |
Citations | PageRank | References |
3 | 0.41 | 6 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jan Daciuk | 1 | 156 | 14.83 |
Jakub Piskorski | 2 | 435 | 50.04 |