Abstract | ||
---|---|---|
Social networks are persistently generating text-based data that encapsulate vast amounts of knowledge. However, the presence of non-standard terms and misspellings in texts originating from social networks poses a crucial challenge for natural language processing and machine learning systems that attempt to mine this knowledge. To address this problem, we propose a sequential, modular, and hybrid pipeline for social media text normalization. In the first phase, text preprocessing techniques and social media-specific vocabularies gathered from publicly available sources are used to transform, with high precision, out-of-vocabulary terms into in-vocabulary terms. A sequential language model, generated using the partially normalized texts from the first phase, is then utilized to normalize short, high-frequency, ambiguous terms. A supervised learning module is employed to normalize terms based on a manually annotated training corpus. Finally, a tunable, distributed language model-based backoff module at the end of the pipeline enables further customization of the system to specific domains of text. We performed intrinsic evaluations of the system on a publicly available domain-independent dataset from Twitter, and our system obtained an F-score of 0.836, outperforming other benchmark systems for the task. We further performed brief, task-oriented evaluations of the system to illustrate the customizability of the system to domain-specific tasks and the effects of normalization on downstream applications. The modular design enables the easy customization of the system to distinct types domain-specific social media text, in addition to its off-the-shelf application to generic social media text. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1007/s13278-017-0464-z | Social Netw. Analys. Mining |
Keywords | Field | DocType |
Social media text normalization,Lexical normalization,Social media data preparation,Social network mining,Text mining,Natural language processing | Normalization (statistics),Social media,Information retrieval,Computer science,Computational linguistics,Supervised learning,Natural language processing,Artificial intelligence,Modular design,Text normalization,Language model,Personalization | Journal |
Volume | Issue | ISSN |
7 | 1 | 1869-5450 |
Citations | PageRank | References |
1 | 0.37 | 22 |
Authors | ||
1 |
Name | Order | Citations | PageRank |
---|---|---|---|
Abeed Sarker | 1 | 3 | 1.56 |