Abstract | ||
---|---|---|
With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%. |
Year | Venue | Keywords |
---|---|---|
2022 | International Conference on Learning Representations (ICLR) | unsupervised,translation,code,self-training,pseudo-labelling,unit tests,programming languages,deep learning,transformer |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Baptiste Rozière | 1 | 9 | 3.15 |
Jie M. Zhang | 2 | 0 | 0.34 |
François Charton | 3 | 1 | 1.36 |
Mark Harman | 4 | 10264 | 389.82 |
Synnaeve Gabriel | 5 | 21 | 5.12 |
Guillaume Lample | 6 | 651 | 22.75 |