Finding Romanized Arabic Dialect in Code-Mixed Tweets. - Citegraph

Paper Info

Title
Finding Romanized Arabic Dialect in Code-Mixed Tweets.

Abstract
Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a romanized Arabic dialect and distinguishes it from French and English in tweets. We focus on Moroccan Darija, one of several spoken vernaculars in the family of Maghrebi Arabic dialects. Even given noisy, code-mixed tweets, the classifier achieved token-level recall of 93.2% on romanized Arabic dialect, 83.2% on English, and 90.1% on French. The classifier, now integrated into our tweet conversation annotation tool (Tratz et al. 2013), has semi-automated the construction of a romanized Arabic-dialect lexicon. Two datasets, a full list of Moroccan Darija surface token forms and a table of lexical entries derived from this list with spelling variants, as extracted from our tweet corpus collection, will be made available in the LRE MAP.

Year	Venue	Keywords
2014	LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION	language identification,code mixing,Arabic dialect,social media
Field	DocType	Citations
Romanization,Maghrebi Arabic,Conversation,Computer science,Lexicon,Latin script,Artificial intelligence,Spelling,Natural language processing,Classifier (linguistics),Arabic script	Conference	7
PageRank	References	Authors
0.59	2	4

Authors (4 rows)

Cited by (7 rows)

References (2 rows)

Name	Order	Citations	PageRank
Clare R. Voss	1	344	29.51
Stephen Tratz	2	195	15.29
Jamal Laoudi	3	29	4.32
Douglas M. Briesch	4	30	2.95

1