Title
A customizable pipeline for social media text normalization.
Abstract
Social networks are persistently generating text-based data that encapsulate vast amounts of knowledge. However, the presence of non-standard terms and misspellings in texts originating from social networks poses a crucial challenge for natural language processing and machine learning systems that attempt to mine this knowledge. To address this problem, we propose a sequential, modular, and hybrid pipeline for social media text normalization. In the first phase, text preprocessing techniques and social media-specific vocabularies gathered from publicly available sources are used to transform, with high precision, out-of-vocabulary terms into in-vocabulary terms. A sequential language model, generated using the partially normalized texts from the first phase, is then utilized to normalize short, high-frequency, ambiguous terms. A supervised learning module is employed to normalize terms based on a manually annotated training corpus. Finally, a tunable, distributed language model-based backoff module at the end of the pipeline enables further customization of the system to specific domains of text. We performed intrinsic evaluations of the system on a publicly available domain-independent dataset from Twitter, and our system obtained an F-score of 0.836, outperforming other benchmark systems for the task. We further performed brief, task-oriented evaluations of the system to illustrate the customizability of the system to domain-specific tasks and the effects of normalization on downstream applications. The modular design enables the easy customization of the system to distinct types domain-specific social media text, in addition to its off-the-shelf application to generic social media text.
Year
DOI
Venue
2017
10.1007/s13278-017-0464-z
Social Netw. Analys. Mining
Keywords
Field
DocType
Social media text normalization,Lexical normalization,Social media data preparation,Social network mining,Text mining,Natural language processing
Normalization (statistics),Social media,Information retrieval,Computer science,Computational linguistics,Supervised learning,Natural language processing,Artificial intelligence,Modular design,Text normalization,Language model,Personalization
Journal
Volume
Issue
ISSN
7
1
1869-5450
Citations 
PageRank 
References 
1
0.37
22
Authors
1
Name
Order
Citations
PageRank
Abeed Sarker131.56