Title
Multi-Speaker Emotional Acoustic Modeling For Cnn-Based Speech Synthesis
Abstract
In this paper, we investigate multi-speaker emotional acoustic modeling methods for convolutional neural network (CNN) based speech synthesis system. For emotion modeling, we extend to the speech synthesis system that learns a latent embedding space of emotion, derived from a desired emotional identity, and we use emotion code and mel-frequency spectrogram as an emotion identity. In order to model speaker variation in a text-to-speech (TTS) system, we use speaker representations such as trainable speaker embedding and speaker code. We have implemented speech synthesis systems combining speaker representation and emotion representation and compared them by experiments. Experimental results have demonstrated that the multispeaker emotional speech synthesis approach using trainable speaker embedding and emotion representation from mel spectrogram achieves higher performance when compared with other approaches in terms of naturalness, speaker similarity, and emotion similarity.
Year
DOI
Venue
2019
10.1109/icassp.2019.8683682
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)
Keywords
Field
DocType
Text-to-speech, expressive speech synthesis, multi-speaker acoustic modeling, convolutional neural network
Speech synthesis,Embedding,Pattern recognition,Convolutional neural network,Computer science,Spectrogram,Naturalness,Artificial intelligence
Conference
ISSN
Citations 
PageRank 
1520-6149
1
0.35
References 
Authors
0
4
Name
Order
Citations
PageRank
Heejin Choi161.80
sangjun park222.43
Jinuk Park322.74
Minsoo Hahn410.35