Title
Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition
Abstract
The currentsuccess of deep learning largely benefits from the availability of large amount of labeled data. However, collecting a large-scale dataset with human annotation can be expensive and sometimes difficult. Self-supervised learning thus attracts many research interests to train models without labels. In this paper, we propose a self-supervised learning framework for speaker recognition. Combining clustering with deep representation learning, the proposed framework generates pseudo labels for the unlabeled dataset and learns speaker representation without human annotation. Our method starts with training a speaker representation encoder with contrastive self-supervised learning. Clustering on the learned representation generates pseudo labels, which are used as the supervisory signal for the subsequent training of the representation encoder. The clustering and representation learning process is performed iteratively to bootstrap the discriminative power of the deep neural network. We apply this self-supervised learning framework to both single modal audio data and multi-modal audio-visual data. For audio-visual data, audio and visual representation encoders are employed to learn representations of the corresponding modality. A cluster ensemble algorithm is then used to fuse the clustering results of the two modalities. The complementary information in multi-modalities ensures a robust and fault-tolerant supervisory signal for audio and visual representation learning. Experimental results show that our proposed iterative self-supervised learning framework outperforms previous works with self-supervision by large margins. Training with single modal audio data on the development set of VoxCeleb 2, our proposed framework achieves an equal error rate (EER) of 2.8% on the original test trials of VoxCeleb 1. When training with additional visual modality, the EER further reduces to 1.8%, which is only 20% higher than the fully supervised audio-based system with an EER of 1.5%. Also, experimental analysis shows that the proposed framework generates pseudolabels that are highly correlated to ground truth labels.
Year
DOI
Venue
2022
10.1109/TASLP.2022.3162078
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING
Keywords
DocType
Volume
Training, Visualization, Representation learning, Speaker recognition, Feature extraction, Clustering algorithms, Encoding, Self-supervised learning, self-labeling, clustering, speaker recognition, audio-visual data
Journal
30
Issue
ISSN
Citations 
1
2329-9290
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
Danwei Cai1166.71
Weiqing Wang203.04
Ming Li35595829.00