Abstract | ||
---|---|---|
The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1]. However, this model still lags behind a large state-of-the-art conventional model in quality [2]. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3]. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T. |
Year | DOI | Venue |
---|---|---|
2019 | 10.21437/Interspeech.2019-1341 | INTERSPEECH |
DocType | Citations | PageRank |
Conference | 1 | 0.35 |
References | Authors | |
0 | 12 |
Name | Order | Citations | PageRank |
---|---|---|---|
Tara N. Sainath | 1 | 3497 | 232.43 |
Ruoming Pang | 2 | 1092 | 92.99 |
David Rybach | 3 | 188 | 20.31 |
Yanzhang He | 4 | 64 | 16.36 |
Rohit Prabhavalkar | 5 | 163 | 22.56 |
Wei Li | 6 | 436 | 140.67 |
Mirkó Visontai | 7 | 321 | 23.62 |
Qiao Liang | 8 | 77 | 19.86 |
Trevor Strohman | 9 | 462 | 25.17 |
Yonghui Wu | 10 | 1065 | 72.78 |
Ian McGraw | 11 | 253 | 24.41 |
Chung-Cheng Chiu | 12 | 248 | 28.00 |