Title | ||
---|---|---|
Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification |
Abstract | ||
---|---|---|
Short duration text-independent speaker verification remains a hot research topic in recent years, and deep neural network based embeddings have shown impressive results in such conditions. Good speaker embeddings require the property of both small intra-class variation and large inter-class difference, which is critical for the ability of discrimination and generalization. Current embedding learning strategies can be grouped into two frameworks: “Cascade embedding learning” with multiple stages and “direct embedding learning” from spectral feature directly. We propose new approaches to achieve more discriminant speaker embeddings. Within the cascade framework, a neural network based deep discriminant analysis (DDA) is proposed to project
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic>
-vector to more discriminative embeddings. Within the direct embedding framework, a deep model with more advanced center loss and A-softmax loss is used, the focal loss is also investigated in this framework. Moreover, the traditional
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic>
-vector and neural embeddings are finally combined with neural network based DDA to achieve further gain. Main experiments are carried out on a short-duration text-independent speaker verification dataset generated from the SRE corpus. The results show that the newly proposed method is promising for short-duration text-independent speaker verification, and it is consistently better than traditional
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic>
-vector and neural embedding baselines. The best embeddings achieve roughly 30% relative EER reduction compared to the
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic>
-vector baseline, which could be further enhanced when combined with the
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic>
-vector system. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/TASLP.2019.2928128 | IEEE/ACM Transactions on Audio, Speech, and Language Processing |
Keywords | Field | DocType |
Neural networks,Speech processing,Training,Feature extraction,Optimization,Analytical models,Linear discriminant analysis | Speech processing,Embedding,Pattern recognition,Discriminant,Computer science,Feature extraction,Speech recognition,Artificial intelligence,Cascade,Linear discriminant analysis,Artificial neural network,Discriminative model | Journal |
Volume | Issue | ISSN |
27 | 11 | 2329-9290 |
Citations | PageRank | References |
2 | 0.39 | 11 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shuai Wang | 1 | 252 | 48.81 |
Zili Huang | 2 | 17 | 5.47 |
Yanmin Qian | 3 | 295 | 44.44 |
Kai Yu | 4 | 1082 | 90.58 |