Abstract | ||
---|---|---|
The paper overviews the SYN series of synchronic corpora of written Czech compiled within the framework of the Czech National Corpus project. It describes their design and processing with a focus on the annotation, i.e. lemmatization and morphological tagging. The paper also introduces SYN2013PUB, a new 935-million newspaper corpus of Czech published in 2013 as the most recent addition to the SYN series before planned revision of its architecture. SYN2013PUB can be seen as a completion of the series in terms of titles and publication dates of major Czech newspapers that are now covered by complete volumes in comparable proportions. All SYN-series corpora can be characterized as traditional, with emphasis on cleared copyright issues, well-defined composition, reliable metadata and high-quality data processing; their overall size currently exceeds 2.2 billion running words. |
Year | Venue | Keywords |
---|---|---|
2014 | LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | written language,large corpus,Czech |
Field | DocType | Citations |
Lemmatisation,Metadata,Architecture,Czech,Data processing,Annotation,Computer science,Newspaper,Artificial intelligence,Natural language processing,Clearance | Conference | 2 |
PageRank | References | Authors |
0.40 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Milena Hnátková | 1 | 8 | 2.22 |
Michal Kren | 2 | 7 | 2.31 |
Pavel Procházka | 3 | 2 | 0.74 |
Hana Skoumalova | 4 | 37 | 7.82 |