Title
Deep Speaker Embedding with Multi-Part Information Aggregation in Frequency-Time Domain for ASV
Abstract
Automatic speaker verification (ASV) is to verify the identity of speaker from a given speech utterance without direct supervision from outside entities. Majority of recent ASV systems with deep speaker embedding apply temporal pooling or similar techniques for frame-level feature aggregation in time domain. In this paper, we propose a deep speaker embedding network for adaptively modelling and fusing multi-part information in frequency-time domain, using a modified ResNet-SO to encode acoustic features into global information, a proposed multi-part information aggregator to distinguish global information and different part features for aggregating them with adaptive weight pooling to unified utterance-level embedding descriptors. More-over, we design a privacy-preserving manner and preliminarily implement it in prototype system. Experiments are conducted on three scale datasets. We demonstrate that the presented multi-part information aggregator with adaptive weight pooling is superior for producing discriminative and robust utterance-level embedding descriptors. We also show that our network achieves state-of-the-art performance by a significant margin on the popular VoxCelebl while requiring fewer parameters than previous approaches.
Year
DOI
Venue
2022
10.1109/COMPSAC54236.2022.00011
2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)
Keywords
DocType
ISSN
automatic speaker verification,CNN,speaker embedding,information aggregation,adaptive weight pooling
Conference
0730-3157
ISBN
Citations 
PageRank 
978-1-6654-8811-2
0
0.34
References 
Authors
7
5
Name
Order
Citations
PageRank
Xiao Li100.34
Xi Chen233370.76
Dongfei Wang300.34
Zhijun Guo400.34
Kun Niu500.68