Title
A method of multi-models fusion for speaker recognition
Abstract
As a new type of biometrics recognition technology, speaker recognition is gaining more and more attention because of the advantages in remote authentication. In this paper, we construct an end-to-end speaker recognition model named GAPCNN in which a convolutional neural network is used to extract speaker embeddings from spectrogram, and speaker recognition is performed by the cosine similarity of embeddings. In addition, we use global average pooling instead of the traditional temporal average pooling to adapt to different voice lengths. We use the ‘dev’ set of Voxceleb2 for training, then evaluate the model in the test set of Voxceleb1, and obtain an equal error rate (EER) of 4.04%. Furthermore, we fuse our GAPCNN with the x-vector model and the thin-Resnet model with GhostVLAD, and obtain an EER of 3.01% which is better than any of the three. It indicates that GAPCNN is an important complement to the x-vector model and the thin-Resnet model with GhostVLAD.
Year
DOI
Venue
2022
10.1007/s10772-022-09973-w
International Journal of Speech Technology
Keywords
DocType
Volume
Speaker recognition, Speaker verification, Model fusion, CNN
Journal
25
Issue
ISSN
Citations 
2
1381-2416
0
PageRank 
References 
Authors
0.34
2
4
Name
Order
Citations
PageRank
Wu Hao100.34
Linkai Luo216314.00
Hong Peng31410.33
Wen Wei400.34