Abstract | ||
---|---|---|
This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1017/S1351324920000078 | NATURAL LANGUAGE ENGINEERING |
Keywords | DocType | Volume |
Part-of-speech tagging, Arabic, Dialects, Deep neural network, Brown clusters | Journal | 26 |
Issue | ISSN | Citations |
6 | 1351-3249 | 0 |
PageRank | References | Authors |
0.34 | 0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Darwish Kareem | 1 | 615 | 52.39 |
Mohammed Attia | 2 | 146 | 16.51 |
Hamdy Mubarak | 3 | 140 | 19.60 |
Samih Younes | 4 | 38 | 11.26 |
Ahmed Abdelali | 5 | 152 | 25.84 |
Lluís Màrquez | 6 | 0 | 0.34 |
Mohamed Eldesouki | 7 | 2 | 2.05 |
Laura Kallmeyer | 8 | 165 | 38.11 |