Title
Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning
Abstract
The pooling function plays a vital role in the segment-level deep speaker embedding learning framework. One common method is to calculate the statistics of the temporal features, while the mean based temporal average pooling (TAP) and temporal statistics pooling (TSTP) which combine mean and standard deviation are two typical approaches. Empirically, researchers observe a big performance degradation in x-vector when removing the standard deviation. Based on this observation, in this paper, we designed a set of experiments to analyze the effectiveness of different statistics quantitatively, including the investigation and comparison on pooling functions based on standard deviation, covariance and ℓ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">p</sub> -norm. Experiments are carried out on Vox-Celeb and SRE16, and the results show that the second-order statistics based pooling functions yield better performance than TAP, and only the simple standard deviation can achieve the best performance on all the evaluation conditions.
Year
DOI
Venue
2021
10.1109/ISCSLP49672.2021.9362097
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)
Keywords
DocType
ISBN
speaker embedding,statistics pooling,speaker recognition
Conference
978-1-7281-6995-8
Citations 
PageRank 
References 
0
0.34
0
Authors
4
Name
Order
Citations
PageRank
Shuai Wang100.34
Yexin Yang212.04
Yanmin Qian329544.44
Kai Yu4108290.58