Title
How comparable are parallel corpora? Measuring the distribution of general vocabulary and connectives
Abstract
In this paper, we question the homogeneity of a large parallel corpus by measuring the similarity between various sub-parts. We compare results obtained using a general measure of lexical similarity based on χ2 and by counting the number of discourse connectives. We argue that discourse connectives provide a more sensitive measure, revealing differences that are not visible with the general measure. We also provide evidence for the existence of specific characteristics defining translated texts as opposed to non-translated ones, due to a universal tendency for explicitation.
Year
Venue
Keywords
2011
BUCC@ACL
lexical similarity,sensitive measure,general measure,general vocabulary,specific characteristic,revealing difference,discourse connective,universal tendency,large parallel corpus,various sub-parts,homogeneity,measures,similarity,corpora
Field
DocType
Citations 
Lexical similarity,Homogeneity (statistics),Computer science,Parallel corpora,Natural language processing,Artificial intelligence,Linguistics,Vocabulary
Conference
12
PageRank 
References 
Authors
0.98
3
4
Name
Order
Citations
PageRank
Bruno Cartoni1626.93
Sandrine Zufferey2494.98
thomas meyer31109.04
Andrei Popescu-Belis457364.13