The Joy of Parallelism with CzEng 1.0. - Citegraph

Paper Info

Title
The Joy of Parallelism with CzEng 1.0.

Abstract
CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.

Year	Venue	Keywords
2012	LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION	Czech-English parallel corpus,automatic parallel treebank,training data for machine translation
Field	DocType	Citations
Czech,Annotation,Computer science,Filter (signal processing),Plain text,Artificial intelligence,Natural language processing,Parsing,Sentence,Syntax	Conference	24
PageRank	References	Authors
0.91	8	10

Authors (10 rows)

Cited by (24 rows)

References (8 rows)

Name	Order	Citations	PageRank
Ondřej Bojar	1	1701	122.71
Zdenek Zabokrtský	2	193	22.23
Ondřej Dušek	3	180	23.08
Petra Galuscáková	4	35	6.34
Martin Majlis	5	31	1.55
David Marecek	6	114	8.57
Jirí Marsík	7	27	1.36
Michal Novák	8	55	4.03
Martin Popel	9	269	21.27
Aleš Tamchyna	10	115	14.76

1