Abstract | ||
---|---|---|
The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(ò_ó)", "(=^-^=)"), non-standard letter casing (e.g. "dr. Fred") and unusual punctuation (e.g. ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant (e.g. "I;m"), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address complex cases successfully and which is relatively simple to set up and maintain. For that, we created a corpus consisting of 2500 manually tokenized Twitter messages -- a task that is simple for human annotators -- and we trained an SVM classifier for separating tokens at certain discontinuity characters. For comparison, we created a baseline rule-based system designed specifically for dealing with typical problematic situations. Results show that we can achieve F-measures of 96% with the classification-based approach, much above the performance obtained by the baseline rule-based tokenizer (85%). Also, subsequent analysis allowed us to identify typical tokenization errors, which we show that can be partially solved by adding some additional descriptive examples to the training corpus and re-training the classifier. |
Year | DOI | Venue |
---|---|---|
2010 | 10.1145/1871840.1871853 | AND |
Keywords | Field | DocType |
svm classifier,baseline rule-based tokenizer,twitter message,different tokenization requirement,classification-based approach,typical tokenization error,baseline rule-based system,non-standard language,manually-developed rule-based tokenizer system,text classification approach,non-standard letter casing,micro-blogging message,tokenization,rule based system,user generated content,micro blogging,corpus,rule based | Computer science,Artificial intelligence,Natural language processing,Lexical analysis,Classifier (linguistics),User-generated content,Tokenization (data security),Social media,Information retrieval,Microblogging,Topic model,Punctuation,Semantic role labeling | Conference |
Citations | PageRank | References |
22 | 1.36 | 10 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Gustavo Laboreiro | 1 | 58 | 4.51 |
Luís Sarmento | 2 | 377 | 31.16 |
Jorge Teixeira | 3 | 70 | 8.24 |
Eugénio Oliveira | 4 | 974 | 111.00 |