Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning - Citegraph

Paper Info

Title
Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning

Abstract
The pooling function plays a vital role in the segment-level deep speaker embedding learning framework. One common method is to calculate the statistics of the temporal features, while the mean based temporal average pooling (TAP) and temporal statistics pooling (TSTP) which combine mean and standard deviation are two typical approaches. Empirically, researchers observe a big performance degradation in x-vector when removing the standard deviation. Based on this observation, in this paper, we designed a set of experiments to analyze the effectiveness of different statistics quantitatively, including the investigation and comparison on pooling functions based on standard deviation, covariance and ℓ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">p</sub> -norm. Experiments are carried out on Vox-Celeb and SRE16, and the results show that the second-order statistics based pooling functions yield better performance than TAP, and only the simple standard deviation can achieve the best performance on all the evaluation conditions.

Year	DOI	Venue
2021	10.1109/ISCSLP49672.2021.9362097	2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)
Keywords	DocType	ISBN
speaker embedding,statistics pooling,speaker recognition	Conference	978-1-7281-6995-8
Citations	PageRank	References
0	0.34	0
Authors
4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Shuai Wang	1	0	0.34
Yexin Yang	2	1	2.04
Yanmin Qian	3	295	44.44
Kai Yu	4	1082	90.58

1