Title | ||
---|---|---|
Deep Speaker Embedding with Multi-Part Information Aggregation in Frequency-Time Domain for ASV |
Abstract | ||
---|---|---|
Automatic speaker verification (ASV) is to verify the identity of speaker from a given speech utterance without direct supervision from outside entities. Majority of recent ASV systems with deep speaker embedding apply temporal pooling or similar techniques for frame-level feature aggregation in time domain. In this paper, we propose a deep speaker embedding network for adaptively modelling and fusing multi-part information in frequency-time domain, using a modified ResNet-SO to encode acoustic features into global information, a proposed multi-part information aggregator to distinguish global information and different part features for aggregating them with adaptive weight pooling to unified utterance-level embedding descriptors. More-over, we design a privacy-preserving manner and preliminarily implement it in prototype system. Experiments are conducted on three scale datasets. We demonstrate that the presented multi-part information aggregator with adaptive weight pooling is superior for producing discriminative and robust utterance-level embedding descriptors. We also show that our network achieves state-of-the-art performance by a significant margin on the popular VoxCelebl while requiring fewer parameters than previous approaches. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/COMPSAC54236.2022.00011 | 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC) |
Keywords | DocType | ISSN |
automatic speaker verification,CNN,speaker embedding,information aggregation,adaptive weight pooling | Conference | 0730-3157 |
ISBN | Citations | PageRank |
978-1-6654-8811-2 | 0 | 0.34 |
References | Authors | |
7 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Xiao Li | 1 | 0 | 0.34 |
Xi Chen | 2 | 333 | 70.76 |
Dongfei Wang | 3 | 0 | 0.34 |
Zhijun Guo | 4 | 0 | 0.34 |
Kun Niu | 5 | 0 | 0.68 |