Abstract | ||
---|---|---|
As a new type of biometrics recognition technology, speaker recognition is gaining more and more attention because of the advantages in remote authentication. In this paper, we construct an end-to-end speaker recognition model named GAPCNN in which a convolutional neural network is used to extract speaker embeddings from spectrogram, and speaker recognition is performed by the cosine similarity of embeddings. In addition, we use global average pooling instead of the traditional temporal average pooling to adapt to different voice lengths. We use the ‘dev’ set of Voxceleb2 for training, then evaluate the model in the test set of Voxceleb1, and obtain an equal error rate (EER) of 4.04%. Furthermore, we fuse our GAPCNN with the x-vector model and the thin-Resnet model with GhostVLAD, and obtain an EER of 3.01% which is better than any of the three. It indicates that GAPCNN is an important complement to the x-vector model and the thin-Resnet model with GhostVLAD. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1007/s10772-022-09973-w | International Journal of Speech Technology |
Keywords | DocType | Volume |
Speaker recognition, Speaker verification, Model fusion, CNN | Journal | 25 |
Issue | ISSN | Citations |
2 | 1381-2416 | 0 |
PageRank | References | Authors |
0.34 | 2 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Wu Hao | 1 | 0 | 0.34 |
Linkai Luo | 2 | 163 | 14.00 |
Hong Peng | 3 | 14 | 10.33 |
Wen Wei | 4 | 0 | 0.34 |