Title
OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure
Abstract
The World Wide Web has become a fundamental resource for building large text corpora. Broadcasting platforms such as news websites are rich sources of data regarding diverse topics and form a valuable foundation for research. The Arabic language is extensively utilized on the Web. Still, Arabic is relatively an under-resourced language in terms of availability of freely annotated corpora. This paper presents the first version of the Open Source International Arabic News (OSIAN) corpus. The corpus data was collected from international Arabic news websites, all being freely available on the Web. The corpus consists of about 3.5 million articles comprising more than 37 million sentences and roughly 1 billion tokens. It is encoded in XML; each article is annotated with metadata information. Moreover, each word is annotated with lemma and part-of-speech. The described corpus is processed, archived and published into the CLARIN infrastructure. This publication includes descriptive metadata via OAI-PMH, direct access to the plain text material (available under Creative Commons Attribution-Non-Commercial 4.0 International License - CC BY-NC 4.0), and integration into the WebLicht annotation platform and CLARIN's Federated Content Search FCS.
Year
DOI
Venue
2019
10.18653/v1/w19-4619
FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019)
Field
DocType
Citations 
Arabic,Computer science,Natural language processing,Artificial intelligence
Conference
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Imad Zeroual121.33
Dirk Goldhahn2115.22
Thomas Eckart3117.52
Abdelhak Lakhouaja4459.34