Title
CoRoLa ― The Reference Corpus of Contemporary Romanian Language.
Abstract
We present the project of creating CoRoLa, a reference corpus of contemporary Romanian (from 1945 onwards). In the international context, the project finds its place among the initiatives of gathering huge collections of texts, of pre-processing and annotating them at several levels, and also of documenting them with metadata (CMDI). Our project is a joined effort of two institutes of the Romanian Academy. We foresee a corpus of more than 500 million word forms, covering all functional styles of the language. Although the vast majority of texts will be in written form, we target about 300 hours of oral texts, too, obligatorily with associated transcripts. Most of the texts will be from books, while the rest will be harvested from newspapers, booklets, technical reports, etc. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will be done at a morphological level in a first stage, followed by lemmatization, with the possibility of adding syntactic, semantic and discourse annotation in a later stage. A core of CoRoLa is described in the article. The target users of our corpus will be researchers in linguistics and language processing, teachers of Romanian, students.
Year
Venue
Keywords
2014
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION
reference corpus,Romanian,corpus annotation,corpus design,metadata
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Verginica Barbu Mititelu12511.35
Elena Irimia2246.76
Dan Tufis348558.39