Abstract | ||
---|---|---|
It has become common, especially among urban youth, for people to use more than one language in their everyday conversations - a phenomenon referred to by linguists as "code-switching". With the rise in globalization and the widespread of code-switching among multilingual societies, a great demand has been placed on Natural Language Processing (NLP) applications to be able to handle such mixed data. In this paper, we present our efforts in language modeling for code-switch Arabic-English. In order to train a language model (LM), huge amounts of text data is required in the respective language. However, the main challenge faced in language modeling for code-switch languages, is the lack of available data. In this paper, we propose an approach to artificially generate code-switch Arabic-English n-grams and thus improve the language model. This was done by expanding the relatively-small available corpus and its corresponding n-grams using translation-based approaches. The final LM achieved relative improvements in both perplexity and OOV rates of 1.97% and 16.36% respectively. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1007/978-3-319-99010-1_20 | PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2018 |
Keywords | DocType | Volume |
Code-switching,Code-mixing,Arabic-English,Language modeling,Natural language generation | Conference | 845 |
ISSN | Citations | PageRank |
2194-5357 | 0 | 0.34 |
References | Authors | |
0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Injy Hamed | 1 | 1 | 3.08 |
Mohamed Elmahdy | 2 | 13 | 4.57 |
Slim Abdennadher | 3 | 394 | 60.95 |