Abstract | ||
---|---|---|
In this work, we explore end-to-end speech recognition models (CTC, RNN-Transducer and attention-based models) with different model units (character, wordpiece and word) and various training strategies. We show that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark. To improve the performance of end-to-end systems, we propose a multi-stage pretraining strategy, which gives 25.0% and 18.0% relative improvements over training from scratch for attention and RNN-T models respectively with wordpiece units. We achieve state-of-the-art performance on the Switchboard+Fisher-2000h task, outperforming all prior work. Together with other training strategies such as label smoothing and data augmentation, we achieve 5.9%/12.1% WER on the Switch-board/CallHome test set without using any external language models. This is a new performance milestone for a single end-to-end system, and it is also much better than the previous published best hybrid system, which is 6.7%/12.5% on each set individually. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/ASRU46091.2019.9003834 | 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |
Keywords | DocType | ISBN |
end-to-end,sequence-to-sequence models,speech recognition,word piece | Conference | 978-1-7281-0307-5 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Mingkun Huang | 1 | 0 | 0.34 |
Yizhou Lu | 2 | 1 | 3.72 |
Lan Wang | 3 | 0 | 0.68 |
Yanmin Qian | 4 | 295 | 44.44 |
Kai Yu | 5 | 1082 | 90.58 |