Language Identification for Social Media: Short Messages and Transliteration. - Citegraph

Paper Info

Title
Language Identification for Social Media: Short Messages and Transliteration.

Abstract
Conversations on social media and microblogging websites such as Twitter typically consist of short and noisy texts. Due to the presence of slang, misspellings, and special elements such as hashtags, user mentions and URLs, such texts present a challenging case for the task of language identification. Furthermore, the extensive use of transliteration for languages such as Arabic and Russian that do not use Latin script raises yet another problem. This work studies the performance of language identification algorithms applied to tweets, i.e. short messages on Twitter. It uses a previously trained general purpose language identification model to semi-automatically label a large corpus of tweets - in order to train a tweet-specific language identification model. It gives special attention to text written in transliterated Arabic and Russian.

Year	DOI	Venue
2016	10.1145/2872518.2890560	WWW '16: 25th International World Wide Web Conference Montréal Québec Canada April, 2016
Field	DocType	ISBN
Document classification,World Wide Web,Social media,Computer science,Microblogging,Latin script,Natural language processing,Artificial intelligence,Language identification,Slang,Transliteration,General-purpose language	Conference	978-1-4503-4144-8
Citations	PageRank	References
1	0.41	6
Authors
2

Authors (2 rows)

Cited by (1 rows)

References (6 rows)

Name	Order	Citations	PageRank
Pedro Miguel Dias Cardoso	1	1	0.75
Anindya Roy	2	119	12.62

1