Title
Lexical normalisation of short text messages: makn sens a #twitter
Abstract
Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.
Year
Venue
Keywords
2011
ACL
sms corpus,large volume,probable correction candidate,word similarity,novel dataset,correction candidate,morphophonemic similarity,makn sens,ill-formed word,short text message,out-of-vocabulary word,lexical normalisation
Field
DocType
Volume
Information retrieval,Computer science,Noisy text,Morphophonology,Natural language processing,Artificial intelligence,Classifier (linguistics),Text normalization
Conference
P11-1
Citations 
PageRank 
References 
190
9.07
20
Authors
2
Search Limit
100190
Name
Order
Citations
PageRank
Bo Han159329.85
Timothy Baldwin245222.18