Abstract | ||
---|---|---|
End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICASSP39728.2021.9413899 | 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) |
Keywords | DocType | Citations |
RNN-T, Conformer, cascaded encoders, latency | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 15 |
Name | Order | Citations | PageRank |
---|---|---|---|
Bo Li | 1 | 206 | 42.46 |
Anmol Gulati | 2 | 0 | 1.01 |
Jiahui Yu | 3 | 260 | 25.83 |
Tara N. Sainath | 4 | 3497 | 232.43 |
Chung-Cheng Chiu | 5 | 248 | 28.00 |
Arun Narayanan | 6 | 425 | 32.99 |
Shuo-Yiin Chang | 7 | 27 | 4.71 |
Ruoming Pang | 8 | 1092 | 92.99 |
Yanzhang He | 9 | 64 | 16.36 |
James Qin | 10 | 13 | 3.68 |
Wei Han | 11 | 75 | 13.10 |
Qiao Liang | 12 | 77 | 19.86 |
Yu Zhang | 13 | 442 | 41.79 |
Trevor Strohman | 14 | 462 | 25.17 |
Yonghui Wu | 15 | 1065 | 72.78 |