Title
Language Identification for Social Media: Short Messages and Transliteration.
Abstract
Conversations on social media and microblogging websites such as Twitter typically consist of short and noisy texts. Due to the presence of slang, misspellings, and special elements such as hashtags, user mentions and URLs, such texts present a challenging case for the task of language identification. Furthermore, the extensive use of transliteration for languages such as Arabic and Russian that do not use Latin script raises yet another problem. This work studies the performance of language identification algorithms applied to tweets, i.e. short messages on Twitter. It uses a previously trained general purpose language identification model to semi-automatically label a large corpus of tweets - in order to train a tweet-specific language identification model. It gives special attention to text written in transliterated Arabic and Russian.
Year
DOI
Venue
2016
10.1145/2872518.2890560
WWW '16: 25th International World Wide Web Conference Montréal Québec Canada April, 2016
Field
DocType
ISBN
Document classification,World Wide Web,Social media,Computer science,Microblogging,Latin script,Natural language processing,Artificial intelligence,Language identification,Slang,Transliteration,General-purpose language
Conference
978-1-4503-4144-8
Citations 
PageRank 
References 
1
0.41
6
Authors
2
Name
Order
Citations
PageRank
Pedro Miguel Dias Cardoso110.75
Anindya Roy211912.62