Title
End-to-End attention based text-dependent speaker verification
Abstract
A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.
Year
DOI
Venue
2016
10.1109/SLT.2016.7846261
2016 IEEE Spoken Language Technology Workshop (SLT)
Keywords
DocType
Volume
speaker verification,end-to-end training,attention model,deep learning,CNN
Conference
abs/1701.00562
ISSN
ISBN
Citations 
2639-5479
978-1-5090-4904-2
12
PageRank 
References 
Authors
0.70
20
5
Name
Order
Citations
PageRank
Shixiong Zhang11079.34
Zhuo Chen215324.33
Yong Zhao312713.62
Jinyu Li491572.84
Yifan Gong51332135.58