Title
An Investigation Into The Multi-Channel Time Domain Speaker Extraction Network
Abstract
This paper presents an investigation into the effectiveness of spatial features for improving time-domain speaker extraction systems. A two-dimensional Convolutional Neural Network (CNN) based encoder is proposed to capture the spatial information within the multichannel input, which are then combined with the spectral features of a single channel extraction network. Two variants of target speaker extraction methods were tested, one which employs a pre-trained i-vector system to compute a speaker embedding (System A), and one which employs a jointly trained neural network to extract the embeddings directly from time domain enrolment signals (System B). The evaluation was performed on the spatialized WSJ0-2mix dataset using the Signal-to-Distortion Ratio (SDR) metric, and ASR accuracy. In the anechoic condition, more than 10 dB and 7 dB absolute SDR gains were achieved when the 2-D CNN spatial encoder features were included with Systems A and B, respectively. The performance gains in reverberation were lower, however, we have demonstrated that retraining the systems by applying dereverberation preprocessing can significantly boost both the target speaker extraction and ASR performances.
Year
DOI
Venue
2021
10.1109/SLT48900.2021.9383582
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT)
Keywords
DocType
ISSN
target speaker extraction, spatial features, dereverberation, automatic speech recognition
Conference
2639-5479
Citations 
PageRank 
References 
0
0.34
0
Authors
3
Name
Order
Citations
PageRank
Catalin Zorila122.74
Mohan Li232.43
Rama Doddipatla324.09