Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition - Citegraph

Paper Info

Title
Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition

Abstract
The currentsuccess of deep learning largely benefits from the availability of large amount of labeled data. However, collecting a large-scale dataset with human annotation can be expensive and sometimes difficult. Self-supervised learning thus attracts many research interests to train models without labels. In this paper, we propose a self-supervised learning framework for speaker recognition. Combining clustering with deep representation learning, the proposed framework generates pseudo labels for the unlabeled dataset and learns speaker representation without human annotation. Our method starts with training a speaker representation encoder with contrastive self-supervised learning. Clustering on the learned representation generates pseudo labels, which are used as the supervisory signal for the subsequent training of the representation encoder. The clustering and representation learning process is performed iteratively to bootstrap the discriminative power of the deep neural network. We apply this self-supervised learning framework to both single modal audio data and multi-modal audio-visual data. For audio-visual data, audio and visual representation encoders are employed to learn representations of the corresponding modality. A cluster ensemble algorithm is then used to fuse the clustering results of the two modalities. The complementary information in multi-modalities ensures a robust and fault-tolerant supervisory signal for audio and visual representation learning. Experimental results show that our proposed iterative self-supervised learning framework outperforms previous works with self-supervision by large margins. Training with single modal audio data on the development set of VoxCeleb 2, our proposed framework achieves an equal error rate (EER) of 2.8% on the original test trials of VoxCeleb 1. When training with additional visual modality, the EER further reduces to 1.8%, which is only 20% higher than the fully supervised audio-based system with an EER of 1.5%. Also, experimental analysis shows that the proposed framework generates pseudolabels that are highly correlated to ground truth labels.

Year	DOI	Venue
2022	10.1109/TASLP.2022.3162078	IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING
Keywords	DocType	Volume
Training, Visualization, Representation learning, Speaker recognition, Feature extraction, Clustering algorithms, Encoding, Self-supervised learning, self-labeling, clustering, speaker recognition, audio-visual data	Journal	30
Issue	ISSN	Citations
1	2329-9290	0
PageRank	References	Authors
0.34	0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Danwei Cai	1	16	6.71
Weiqing Wang	2	0	3.04
Ming Li	3	5595	829.00

1