Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for the M2met Challenge. - Citegraph

Paper Info

Title
Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for the M2met Challenge.

Abstract
In this paper, we present the speaker diarization system for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU_DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice activity detection (TS-VAD) to find the overlap between speakers. For the single-channel scenario, we separately train a model for each of the 8 channels and fuse the results. We also employ the cross-channel self-attention to further improve the performance, where the non-linear spatial correlations between different channels are learned and fused. Experimental results on the evaluation set show that the single-channel TS-VAD reduces the DER by over 75% from 12.68\% to 3.14%. The multi-channel TS-VAD further reduces the DER by 28% and achieves a DER of 2.26%. Our final submitted system achieves a DER of 2.98% on the AliMeeting test set, which ranks 1st in the M2MET challenge.

Year	DOI	Venue
2022	10.1109/ICASSP43922.2022.9747019	IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Weiqing Wang	1	0	3.04
Xiaoyi Qin	2	0	0.34
Ming Li	3	331	17.67

1