Abstract | ||
---|---|---|
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, and large-scale text-to-speech and audio-only utterances using joint acoustic and text decoder (JATD) and semi-supervised training, we achieved 4%-12% WER reduction for various tasks compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the Google Voice Search WER by 11% relative. We show that the deliberation model also achieves a positive human side-by-side evaluation compared to the state-of-the-art LM rescorer with reasonable endpointer latencies. |
Year | DOI | Venue |
---|---|---|
2022 | 10.21437/INTERSPEECH.2022-243 | Conference of the International Speech Communication Association (INTERSPEECH) |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ke Hu | 1 | 1 | 1.73 |
Tara N. Sainath | 2 | 3497 | 232.43 |
Yanzhang He | 3 | 64 | 16.36 |
Rohit Prabhavalkar | 4 | 163 | 22.56 |
Trevor Strohman | 5 | 0 | 2.70 |
Sepand Mavandadi | 6 | 0 | 1.35 |
Weiran Wang | 7 | 114 | 9.99 |