Title
When Hearing the Voice, Who Will Come to Your Mind
Abstract
Speech is a carrier containing rich biological information, such as speaker identity information including age, gender, race. In this paper, we explore the use of a self-supervised method to obtain speaker identity information from high-dimensional speech representations to generate face image. At the same time, considering that the biological information contained in the same piece of speech has different expression forms (such as images), we designed a cross-modal knowledge distillation method to transform the feature information from the visual domain to the speech domain. The feature vectors obtained through self-supervised learning and knowledge distillation are fed into a GAN-based generative model to obtain facial images containing speaker information. Subjective experiments show that our model can reach a well performance in the task of speaker identification. Experiments show that our proposed method can effectively establish the connection between different modalities and generate a face with rich biological information.
Year
DOI
Venue
2021
10.1109/IJCNN52387.2021.9534208
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)
Keywords
DocType
ISSN
speech representation, self-supervised learning, cross-modal distillation, visual reconstruction, facial synthesis
Conference
2161-4393
Citations 
PageRank 
References 
0
0.34
0
Authors
8
Name
Order
Citations
PageRank
Zhenhou Hong100.34
Jianzong Wang26134.65
Wenqi Wei34810.69
Jie Liu400.68
Xiaoyang Qu501.35
Bo Chen600.68
Zihang Wei700.68
Jing Xiao875.78