Abstract | ||
---|---|---|
Conversations on social media and microblogging websites such as Twitter typically consist of short and noisy texts. Due to the presence of slang, misspellings, and special elements such as hashtags, user mentions and URLs, such texts present a challenging case for the task of language identification. Furthermore, the extensive use of transliteration for languages such as Arabic and Russian that do not use Latin script raises yet another problem.
This work studies the performance of language identification algorithms applied to tweets, i.e. short messages on Twitter. It uses a previously trained general purpose language identification model to semi-automatically label a large corpus of tweets - in order to train a tweet-specific language identification model. It gives special attention to text written in transliterated Arabic and Russian.
|
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/2872518.2890560 | WWW '16: 25th International World Wide Web Conference
Montréal
Québec
Canada
April, 2016 |
Field | DocType | ISBN |
Document classification,World Wide Web,Social media,Computer science,Microblogging,Latin script,Natural language processing,Artificial intelligence,Language identification,Slang,Transliteration,General-purpose language | Conference | 978-1-4503-4144-8 |
Citations | PageRank | References |
1 | 0.41 | 6 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Pedro Miguel Dias Cardoso | 1 | 1 | 0.75 |
Anindya Roy | 2 | 119 | 12.62 |