Title
Controlling byte pair encoding for neural machine translation
Abstract
Byte pair encoding(BPE) is an approach that segments the corpus in such a way that frequent sequence of characters are combined; it results to having word surface forms divided into its' root word and affix. It alone handles out-of-vocabulary words, but tends to not consistently segment inflected words. Controlled byte pair encoding (CBPE) allowed our word-level neural machine translation (NMT) model to easily recognize inflected words which are prevalent in morphologically-rich languages. It prevented BPE from merging affixes in a word to other characters in the word. Our resulting NMT models from CBPE consistently evaluates affixes that could've been segmented with variations in BPE. In our experiments, we considered 119,969 English-Filipino parallel language pairs from an existing dataset, with Filipino as a morphologically-rich language. The results show that BPE and CBPE both showed improvements in the BLEU scores from 38.31 to 44.82 and 44.07 for English→Filipino, and from 32.17 to 35.25 and 35.98 for Filipino→English, respectively. The lower scores in the Filipino→English can be attributed to other language characteristics of Filipino such as free word order, one-to-many relationship in translating from English to Filipino, and some transliterations in the parallel corpus. CBPE also performed slightly better for English→Filipino than for Filipino→English.
Year
DOI
Venue
2017
10.1109/IALP.2017.8300571
2017 International Conference on Asian Language Processing (IALP)
Keywords
Field
DocType
Recurrent Neural Networks,Neural Machine Translation,Byte Pair Encoding,Morphologically Rich Languages,Natural Language Processing
Affix,Parallel language,Word order,Root (linguistics),Computer science,Machine translation,Recurrent neural network,Byte pair encoding,Natural language processing,Artificial intelligence,Merge (version control)
Conference
ISSN
ISBN
Citations 
2159-1962
978-1-5386-1982-7
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Alfred John Tacorda100.34
Marvin John Ignacio200.34
Nathaniel Oco355.24
Rachel Edita Roxas4116.34