Title
Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus.
Abstract
Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.
Year
DOI
Venue
2018
10.1007/s10579-019-09460-w
Language Resources and Evaluation
Keywords
Field
DocType
Arabic, Corpus, Periodization, Text reuse, Historical linguistics
Arabic nlp,Classical Arabic,Arabic,Periodization,Computer science,Natural language processing,Artificial intelligence,Language technology
Journal
Volume
Issue
ISSN
abs/1809.03891
4
1574-020X
Citations 
PageRank 
References 
0
0.34
8
Authors
5
Name
Order
Citations
PageRank
Yonatan Belinkov122725.75
Alexander Magidow200.68
Alberto Barrón-Cedeño334629.35
Avi Shmidman402.03
Maxim Romanov500.34