End-to-End attention based text-dependent speaker verification - Citegraph

Paper Info

Title
End-to-End attention based text-dependent speaker verification

Abstract
A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.

Year	DOI	Venue
2016	10.1109/SLT.2016.7846261	2016 IEEE Spoken Language Technology Workshop (SLT)
Keywords	DocType	Volume
speaker verification,end-to-end training,attention model,deep learning,CNN	Conference	abs/1701.00562
ISSN	ISBN	Citations
2639-5479	978-1-5090-4904-2	12
PageRank	References	Authors
0.70	20	5

Authors (5 rows)

Cited by (12 rows)

References (20 rows)

Name	Order	Citations	PageRank
Shixiong Zhang	1	107	9.34
Zhuo Chen	2	153	24.33
Yong Zhao	3	127	13.62
Jinyu Li	4	915	72.84
Yifan Gong	5	1332	135.58

1