Title
Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark
Abstract
Statistical machine translation (SMT) is an important research branch in natural language processing (NLP). Similar to many other NLP applications, large scale training data can potentially bring higher translation accuracy for SMT models. However, the traditional single-node SMT model training systems can hardly cope with the fast-growing amount of large scale training corpus in the big data era, which makes the urgent requirement of efficient large scale machine translation model training systems. In this paper, we propose Seal, an efficient, scalable, and end-to-end offline SMT model training toolkit based on Apache Spark which is a widely-used distributed data-parallel platform. Seal parallelizes the training process of the entire three key SMT models that are the word alignment model, the translation model, and the N -Gram language model, respectively. To further improve the performance of the model training in Seal, we also propose a number of system optimization methods. In word alignment model training, by optimizing the block size tuning, the overhead of IO operation and communication is greatly reduced. In translation model training, by well encoding the training corpus, the data size transferred over the network can be reduced significantly, thus improving the overall training efficiency. We also optimize the maximum likelihood estimation (MLE) algorithm to solve the data skew issue on the join operation which is adopted both in the translation model training and the language model training. The experiment results show that Seal outperforms the well-known SMT training system Chaski with about 5x speedup for word alignment model training. For the syntactic translation model and language model training, Seal outperforms the existing cutting-edge tools with about 9~18x speedup and 8~9x speedup on average, respectively. On the whole, Seal outperforms the existing distributed system with 4~6x speedup and the single-node system with 9~60x speedup on average respectively. Besides, Seal achieves near-linear data and node scalability.
Year
DOI
Venue
2018
10.1109/PADSW.2018.8644562
2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)
Keywords
Field
DocType
Training,Data models,Seals,Sparks,Computational modeling,Maximum likelihood estimation,Load modeling
Data modeling,Spark (mathematics),Computer science,Training system,Machine translation,Computer engineering,Language model,Speedup,Encoding (memory),Scalability,Distributed computing
Conference
ISSN
ISBN
Citations 
1521-9097
978-1-5386-7308-9
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Rong Gu111017.77
Min Chen258.48
Wenjia Yang300.34
Chunfeng Yuan456.90
Yihua Huang586.61