Title
Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification
Abstract
Short duration text-independent speaker verification remains a hot research topic in recent years, and deep neural network based embeddings have shown impressive results in such conditions. Good speaker embeddings require the property of both small intra-class variation and large inter-class difference, which is critical for the ability of discrimination and generalization. Current embedding learning strategies can be grouped into two frameworks: “Cascade embedding learning” with multiple stages and “direct embedding learning” from spectral feature directly. We propose new approaches to achieve more discriminant speaker embeddings. Within the cascade framework, a neural network based deep discriminant analysis (DDA) is proposed to project <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector to more discriminative embeddings. Within the direct embedding framework, a deep model with more advanced center loss and A-softmax loss is used, the focal loss is also investigated in this framework. Moreover, the traditional <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector and neural embeddings are finally combined with neural network based DDA to achieve further gain. Main experiments are carried out on a short-duration text-independent speaker verification dataset generated from the SRE corpus. The results show that the newly proposed method is promising for short-duration text-independent speaker verification, and it is consistently better than traditional <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector and neural embedding baselines. The best embeddings achieve roughly 30% relative EER reduction compared to the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector baseline, which could be further enhanced when combined with the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector system.
Year
DOI
Venue
2019
10.1109/TASLP.2019.2928128
IEEE/ACM Transactions on Audio, Speech, and Language Processing
Keywords
Field
DocType
Neural networks,Speech processing,Training,Feature extraction,Optimization,Analytical models,Linear discriminant analysis
Speech processing,Embedding,Pattern recognition,Discriminant,Computer science,Feature extraction,Speech recognition,Artificial intelligence,Cascade,Linear discriminant analysis,Artificial neural network,Discriminative model
Journal
Volume
Issue
ISSN
27
11
2329-9290
Citations 
PageRank 
References 
2
0.39
11
Authors
4
Name
Order
Citations
PageRank
Shuai Wang125248.81
Zili Huang2175.47
Yanmin Qian329544.44
Kai Yu4108290.58