Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark - Citegraph

Paper Info

Title
Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark

Abstract
Statistical machine translation (SMT) is an important research branch in natural language processing (NLP). Similar to many other NLP applications, large scale training data can potentially bring higher translation accuracy for SMT models. However, the traditional single-node SMT model training systems can hardly cope with the fast-growing amount of large scale training corpus in the big data era, which makes the urgent requirement of efficient large scale machine translation model training systems. In this paper, we propose Seal, an efficient, scalable, and end-to-end offline SMT model training toolkit based on Apache Spark which is a widely-used distributed data-parallel platform. Seal parallelizes the training process of the entire three key SMT models that are the word alignment model, the translation model, and the N -Gram language model, respectively. To further improve the performance of the model training in Seal, we also propose a number of system optimization methods. In word alignment model training, by optimizing the block size tuning, the overhead of IO operation and communication is greatly reduced. In translation model training, by well encoding the training corpus, the data size transferred over the network can be reduced significantly, thus improving the overall training efficiency. We also optimize the maximum likelihood estimation (MLE) algorithm to solve the data skew issue on the join operation which is adopted both in the translation model training and the language model training. The experiment results show that Seal outperforms the well-known SMT training system Chaski with about 5x speedup for word alignment model training. For the syntactic translation model and language model training, Seal outperforms the existing cutting-edge tools with about 9~18x speedup and 8~9x speedup on average, respectively. On the whole, Seal outperforms the existing distributed system with 4~6x speedup and the single-node system with 9~60x speedup on average respectively. Besides, Seal achieves near-linear data and node scalability.

Year	DOI	Venue
2018	10.1109/PADSW.2018.8644562	2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)
Keywords	Field	DocType
Training,Data models,Seals,Sparks,Computational modeling,Maximum likelihood estimation,Load modeling	Data modeling,Spark (mathematics),Computer science,Training system,Machine translation,Computer engineering,Language model,Speedup,Encoding (memory),Scalability,Distributed computing	Conference
ISSN	ISBN	Citations
1521-9097	978-1-5386-7308-9	0
PageRank	References	Authors
0.34	0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Rong Gu	1	110	17.77
Min Chen	2	5	8.48
Wenjia Yang	3	0	0.34
Chunfeng Yuan	4	5	6.90
Yihua Huang	5	8	6.61

1