An Investigation Into The Multi-Channel Time Domain Speaker Extraction Network - Citegraph

Paper Info

Title
An Investigation Into The Multi-Channel Time Domain Speaker Extraction Network

Abstract
This paper presents an investigation into the effectiveness of spatial features for improving time-domain speaker extraction systems. A two-dimensional Convolutional Neural Network (CNN) based encoder is proposed to capture the spatial information within the multichannel input, which are then combined with the spectral features of a single channel extraction network. Two variants of target speaker extraction methods were tested, one which employs a pre-trained i-vector system to compute a speaker embedding (System A), and one which employs a jointly trained neural network to extract the embeddings directly from time domain enrolment signals (System B). The evaluation was performed on the spatialized WSJ0-2mix dataset using the Signal-to-Distortion Ratio (SDR) metric, and ASR accuracy. In the anechoic condition, more than 10 dB and 7 dB absolute SDR gains were achieved when the 2-D CNN spatial encoder features were included with Systems A and B, respectively. The performance gains in reverberation were lower, however, we have demonstrated that retraining the systems by applying dereverberation preprocessing can significantly boost both the target speaker extraction and ASR performances.

Year	DOI	Venue
2021	10.1109/SLT48900.2021.9383582	2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT)
Keywords	DocType	ISSN
target speaker extraction, spatial features, dereverberation, automatic speech recognition	Conference	2639-5479
Citations	PageRank	References
0	0.34	0
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Catalin Zorila	1	2	2.74
Mohan Li	2	3	2.43
Rama Doddipatla	3	2	4.09

1