Title
Bifixer and Bicleaner - two open-source tools to clean your parallel data.
Abstract
This paper shows the utility of two open-source tools designed for parallel data cleaning: Bifixer and Bicleaner. Already used to clean highly noisy parallel content from crawled multilingual websites, we evaluate their performance in a different scenario: cleaning publicly available corpora commonly used to train machine translation systems. We choose four English–Portuguese corpora which we plan to use internally to compute paraphrases at a later stage. We clean the four corpora using both tools, which are described in detail, and analyse the effect of some of the cleaning steps on them. We then compare machine translation training times and quality before and after cleaning these corpora, showing a positive impact particularly for the noisiest ones.
Year
Venue
DocType
2020
EAMT
Conference
Citations 
PageRank 
References 
0
0.34
0
Authors
4
Name
Order
Citations
PageRank
Gema Ramírez-Sánchez103.38
Jaume Zaragoza-Bernabeu200.68
Marta Bañón301.01
Sergio Ortiz-Rojas414917.22