Title
End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System
Abstract
In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems. Our training system utilizes a cluster of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed “on-the-fly”. We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.
Year
DOI
Venue
2019
10.1109/ASRU46091.2019.9003976
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Keywords
DocType
ISBN
end-to-end speech recognition,distributed training,example server,data augmentation,acoustic simulation
Conference
978-1-7281-0307-5
Citations 
PageRank 
References 
0
0.34
0
Authors
13
Name
Order
Citations
PageRank
Chanwoo Kim125328.44
Minkyoo Shin200.34
Shatrughan Singh312.71
Larry Heck400.34
Dhananjaya Gowda535.47
Sungsoo Kim600.68
Kwangyoun Kim724.11
Mehul Kumar812.73
Jiyeon Kim902.37
Kyungmin Lee1023.09
Changwoo Han1101.01
Abhinav Garg1266.61
Eunhyang Kim1300.34