Two-Pass End-to-End Speech Recognition - Citegraph

Paper Info

Title
Two-Pass End-to-End Speech Recognition

Abstract
The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1]. However, this model still lags behind a large state-of-the-art conventional model in quality [2]. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3]. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.

Year	DOI	Venue
2019	10.21437/Interspeech.2019-1341	INTERSPEECH
DocType	Citations	PageRank
Conference	1	0.35
References	Authors
0	12

Authors (12 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Tara N. Sainath	1	3497	232.43
Ruoming Pang	2	1092	92.99
David Rybach	3	188	20.31
Yanzhang He	4	64	16.36
Rohit Prabhavalkar	5	163	22.56
Wei Li	6	436	140.67
Mirkó Visontai	7	321	23.62
Qiao Liang	8	77	19.86
Trevor Strohman	9	462	25.17
Yonghui Wu	10	1065	72.78
Ian McGraw	11	253	24.41
Chung-Cheng Chiu	12	248	28.00

1