AUDIO-VISUAL SPEECH ENHANCEMENT METHOD CONDITIONED ON THE LIP MOTION AND SPEAKER-DISCRIMINATIVE EMBEDDINGS - Citegraph

Paper Info

Title
AUDIO-VISUAL SPEECH ENHANCEMENT METHOD CONDITIONED ON THE LIP MOTION AND SPEAKER-DISCRIMINATIVE EMBEDDINGS

Abstract
We propose an audio-visual speech enhancement (AVSE) method conditioned both on the speaker's lip motion and on speaker-discriminative embeddings. We particularly explore a method of extracting the embeddings directly from noisy audio in the AVSE setting without an enrollment procedure. We aim to improve speech-enhancement performance by conditioning the model with the embedding. To achieve this goal, we devise an AV voice activity detection (AV-VAD) module and a speaker identification module for the AVSE model. The AV-VAD module assesses reliable frames from which the identification module can extract a robust embedding for achieving an enhancement with the lip motion. To effectively train our modules, we propose multi-task learning between the AVSE, speaker identification, and VAD. Experimental results show that (1) our method directly extracted robust speaker embeddings from the noisy audio without an enrollment procedure and (2) improved the enhancement performance compared with the conventional AVSE methods.

Year	DOI	Venue
2021	10.1109/ICASSP39728.2021.9414133	2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords	DocType	Citations
Speech enhancement, Audio-Visual, multi-task learning, Voice activity detection	Conference	1
PageRank	References	Authors
0.35	0	3

Authors (3 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Koichiro Ito	1	1	1.70
Masaaki Yamamoto	2	2	1.19
Kenji Nagamatsu	3	24	10.00

1