Title
A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR
Abstract
End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.
Year
DOI
Venue
2021
10.1109/ICASSP39728.2021.9413899
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords
DocType
Citations 
RNN-T, Conformer, cascaded encoders, latency
Conference
0
PageRank 
References 
Authors
0.34
0
15
Name
Order
Citations
PageRank
Bo Li120642.46
Anmol Gulati201.01
Jiahui Yu326025.83
Tara N. Sainath43497232.43
Chung-Cheng Chiu524828.00
Arun Narayanan642532.99
Shuo-Yiin Chang7274.71
Ruoming Pang8109292.99
Yanzhang He96416.36
James Qin10133.68
Wei Han117513.10
Qiao Liang127719.86
Yu Zhang1344241.79
Trevor Strohman1446225.17
Yonghui Wu15106572.78