SNRi Target Training for Joint Speech Enhancement and Recognition - Citegraph

Paper Info

Title
SNRi Target Training for Joint Speech Enhancement and Recognition

Abstract
This study aims to improve the performance of automatic speech recognition (ASR) under noisy conditions. The use of a speech enhancement (SE) frontend has been widely studied for noise robust ASR. However, most single-channel SE models introduce processing artifacts in the enhanced speech resulting in degraded ASR performance. To overcome this problem, we propose Signal-to-Noise Ratio improvement (SNRi) target training; the SE frontend automatically controls its noise reduction level to avoid degrading the ASR performance due to artifacts. The SE frontend uses an auxiliary scalar input which represents the target SNRi of the output signal. The target SNRi value is estimated by the SNRi prediction network, which is trained to minimize the ASR loss. Experiments using 55,027 hours of noisy speech training data show that SNRi target training enables control of the SNRi of the output signal, and the joint training reduces word error rate by 12% compared to a state-of-the-art Conformer-based ASR model.

Year	DOI	Venue
2022	10.21437/INTERSPEECH.2022-302	Conference of the International Speech Communication Association (INTERSPEECH)
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Koizumi Yuma	1	41	11.75
Shigeki Karita	2	0	1.01
Arun Narayanan	3	425	32.99
Sankaran Panchapagesan	4	0	0.68
Michiel Bacchiani	5	621	55.46

1