ROMBAC: The Romanian Balanced Annotated Corpus. - Citegraph

Paper Info

Title
ROMBAC: The Romanian Balanced Annotated Corpus.

Abstract
This article describes the collecting, processing and validation of a large balanced corpus for Romanian. The annotation types and structure of the corpus are briefly reviewed. It was constructed at the Research Institute for Artificial Intelligence of the Romanian Academy in the context of an international project (METANET4U). The processing covers tokenization, POS-tagging, lemmatization and chunking. The corpus is in XML format generated by our in-house annotation tools; the corpus encoding schema is XCES compliant and the metadata specification is conformant to the METANET recommendations. To the best of our knowledge, this is the first large and richly annotated corpus for Romanian. ROMBAC is intended to be the foundation of a linguistic environment containing a reference corpus for contemporary Romanian and a comprehensive collection of interoperable processing tools.

Year	Venue	Keywords
2012	LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION	balanced corpus,corpus processing,annotation type,metadata,XCES,TTL,ROMBAC,Romanian
Field	DocType	Citations
Tokenization (data security),Lemmatisation,Metadata,Annotation,XML,Romanian,Interoperability,Computer science,Chunking (psychology),Artificial intelligence,Natural language processing	Conference	3
PageRank	References	Authors
0.55	2	4

Authors (4 rows)

Cited by (3 rows)

References (2 rows)

Name	Order	Citations	PageRank
Radu Ion	1	163	22.33
Elena Irimia	2	24	6.76
Dan Ştefánescu	3	136	14.65
Dan Tufis	4	485	58.39

1