Automatically constructing a normalisation dictionary for microblogs - Citegraph

Paper Info

Title
Automatically constructing a normalisation dictionary for microblogs

Abstract
Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highly-ranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing.

Year	Venue	Keywords
2012	EMNLP-CoNLL	microblog normalisation method,context information,simple string substitution,lexical variant,normalisation pair,normalisation dictionary,dictionary-based approach,string similarity,lexical normalisation,known word,highly-ranked pair
Field	DocType	Volume
Social media,Computer science,Word error rate,Microblogging,Speech recognition,Natural language processing,Artificial intelligence,String metric,Machine learning	Conference	D12-1
Citations	PageRank	References
73	3.13	27
Authors
3

Authors (3 rows)

Cited by (73 rows)

References (27 rows)

Name	Order	Citations	PageRank
Bo Han	1	593	29.85
Paul Cook	2	345	14.35
Timothy Baldwin	3	426	20.64

1