Tokenizing micro-blogging messages using a text classification approach - Citegraph

Paper Info

Title
Tokenizing micro-blogging messages using a text classification approach

Abstract
The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(ò_ó)", "(=^-^=)"), non-standard letter casing (e.g. "dr. Fred") and unusual punctuation (e.g. ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant (e.g. "I;m"), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address complex cases successfully and which is relatively simple to set up and maintain. For that, we created a corpus consisting of 2500 manually tokenized Twitter messages -- a task that is simple for human annotators -- and we trained an SVM classifier for separating tokens at certain discontinuity characters. For comparison, we created a baseline rule-based system designed specifically for dealing with typical problematic situations. Results show that we can achieve F-measures of 96% with the classification-based approach, much above the performance obtained by the baseline rule-based tokenizer (85%). Also, subsequent analysis allowed us to identify typical tokenization errors, which we show that can be partially solved by adding some additional descriptive examples to the training corpus and re-training the classifier.

Year	DOI	Venue
2010	10.1145/1871840.1871853	AND
Keywords	Field	DocType
svm classifier,baseline rule-based tokenizer,twitter message,different tokenization requirement,classification-based approach,typical tokenization error,baseline rule-based system,non-standard language,manually-developed rule-based tokenizer system,text classification approach,non-standard letter casing,micro-blogging message,tokenization,rule based system,user generated content,micro blogging,corpus,rule based	Computer science,Artificial intelligence,Natural language processing,Lexical analysis,Classifier (linguistics),User-generated content,Tokenization (data security),Social media,Information retrieval,Microblogging,Topic model,Punctuation,Semantic role labeling	Conference
Citations	PageRank	References
22	1.36	10
Authors
4

Authors (4 rows)

Cited by (22 rows)

References (10 rows)

Name	Order	Citations	PageRank
Gustavo Laboreiro	1	58	4.51
Luís Sarmento	2	377	31.16
Jorge Teixeira	3	70	8.24
Eugénio Oliveira	4	974	111.00

1