Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification - Citegraph

Paper Info

Title
Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification

Abstract
Short duration text-independent speaker verification remains a hot research topic in recent years, and deep neural network based embeddings have shown impressive results in such conditions. Good speaker embeddings require the property of both small intra-class variation and large inter-class difference, which is critical for the ability of discrimination and generalization. Current embedding learning strategies can be grouped into two frameworks: “Cascade embedding learning” with multiple stages and “direct embedding learning” from spectral feature directly. We propose new approaches to achieve more discriminant speaker embeddings. Within the cascade framework, a neural network based deep discriminant analysis (DDA) is proposed to project <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector to more discriminative embeddings. Within the direct embedding framework, a deep model with more advanced center loss and A-softmax loss is used, the focal loss is also investigated in this framework. Moreover, the traditional <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector and neural embeddings are finally combined with neural network based DDA to achieve further gain. Main experiments are carried out on a short-duration text-independent speaker verification dataset generated from the SRE corpus. The results show that the newly proposed method is promising for short-duration text-independent speaker verification, and it is consistently better than traditional <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector and neural embedding baselines. The best embeddings achieve roughly 30% relative EER reduction compared to the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector baseline, which could be further enhanced when combined with the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</italic> -vector system.

Year	DOI	Venue
2019	10.1109/TASLP.2019.2928128	IEEE/ACM Transactions on Audio, Speech, and Language Processing
Keywords	Field	DocType
Neural networks,Speech processing,Training,Feature extraction,Optimization,Analytical models,Linear discriminant analysis	Speech processing,Embedding,Pattern recognition,Discriminant,Computer science,Feature extraction,Speech recognition,Artificial intelligence,Cascade,Linear discriminant analysis,Artificial neural network,Discriminative model	Journal
Volume	Issue	ISSN
27	11	2329-9290
Citations	PageRank	References
2	0.39	11
Authors
4

Authors (4 rows)

Cited by (2 rows)

References (11 rows)

Name	Order	Citations	PageRank
Shuai Wang	1	252	48.81
Zili Huang	2	17	5.47
Yanmin Qian	3	295	44.44
Kai Yu	4	1082	90.58

1