A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR - Citegraph

Paper Info

Title
A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR

Abstract
End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.

Year	DOI	Venue
2021	10.1109/ICASSP39728.2021.9413899	2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords	DocType	Citations
RNN-T, Conformer, cascaded encoders, latency	Conference	0
PageRank	References	Authors
0.34	0	15

Authors (15 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Bo Li	1	206	42.46
Anmol Gulati	2	0	1.01
Jiahui Yu	3	260	25.83
Tara N. Sainath	4	3497	232.43
Chung-Cheng Chiu	5	248	28.00
Arun Narayanan	6	425	32.99
Shuo-Yiin Chang	7	27	4.71
Ruoming Pang	8	1092	92.99
Yanzhang He	9	64	16.36
James Qin	10	13	3.68
Wei Han	11	75	13.10
Qiao Liang	12	77	19.86
Yu Zhang	13	442	41.79
Trevor Strohman	14	462	25.17
Yonghui Wu	15	1065	72.78

1