Title
Multimodular Text Normalization of Dutch User-Generated Content.
Abstract
As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.
Year
DOI
Venue
2016
10.1145/2850422
ACM TIST
Keywords
Field
DocType
Social media,text normalization,user-generated content
User-generated content,Lemmatisation,Normalization (statistics),Social media,Standard language,Computer science,Machine translation,Natural language processing,Artificial intelligence,Text normalization,Machine learning
Journal
Volume
Issue
ISSN
7
4
2157-6904
Citations 
PageRank 
References 
4
0.38
43
Authors
7
Name
Order
Citations
PageRank
Sarah Schulz1313.72
Guy Pauw27512.47
Orphée De Clercq31179.61
Bart Desmet4757.92
Véronique Hoste531935.92
Walter Daelemans62019269.73
Lieve Macken75110.81