Deep Speaker Embedding with Multi-Part Information Aggregation in Frequency-Time Domain for ASV - Citegraph

Paper Info

Title
Deep Speaker Embedding with Multi-Part Information Aggregation in Frequency-Time Domain for ASV

Abstract
Automatic speaker verification (ASV) is to verify the identity of speaker from a given speech utterance without direct supervision from outside entities. Majority of recent ASV systems with deep speaker embedding apply temporal pooling or similar techniques for frame-level feature aggregation in time domain. In this paper, we propose a deep speaker embedding network for adaptively modelling and fusing multi-part information in frequency-time domain, using a modified ResNet-SO to encode acoustic features into global information, a proposed multi-part information aggregator to distinguish global information and different part features for aggregating them with adaptive weight pooling to unified utterance-level embedding descriptors. More-over, we design a privacy-preserving manner and preliminarily implement it in prototype system. Experiments are conducted on three scale datasets. We demonstrate that the presented multi-part information aggregator with adaptive weight pooling is superior for producing discriminative and robust utterance-level embedding descriptors. We also show that our network achieves state-of-the-art performance by a significant margin on the popular VoxCelebl while requiring fewer parameters than previous approaches.

Year	DOI	Venue
2022	10.1109/COMPSAC54236.2022.00011	2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)
Keywords	DocType	ISSN
automatic speaker verification,CNN,speaker embedding,information aggregation,adaptive weight pooling	Conference	0730-3157
ISBN	Citations	PageRank
978-1-6654-8811-2	0	0.34
References	Authors
7	5

Authors (5 rows)

Cited by (0 rows)

References (7 rows)

Name	Order	Citations	PageRank
Xiao Li	1	0	0.34
Xi Chen	2	333	70.76
Dongfei Wang	3	0	0.34
Zhijun Guo	4	0	0.34
Kun Niu	5	0	0.68

1