Abstract | ||
---|---|---|
In this paper, we investigate multi-speaker emotional acoustic modeling methods for convolutional neural network (CNN) based speech synthesis system. For emotion modeling, we extend to the speech synthesis system that learns a latent embedding space of emotion, derived from a desired emotional identity, and we use emotion code and mel-frequency spectrogram as an emotion identity. In order to model speaker variation in a text-to-speech (TTS) system, we use speaker representations such as trainable speaker embedding and speaker code. We have implemented speech synthesis systems combining speaker representation and emotion representation and compared them by experiments. Experimental results have demonstrated that the multispeaker emotional speech synthesis approach using trainable speaker embedding and emotion representation from mel spectrogram achieves higher performance when compared with other approaches in terms of naturalness, speaker similarity, and emotion similarity. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/icassp.2019.8683682 | 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) |
Keywords | Field | DocType |
Text-to-speech, expressive speech synthesis, multi-speaker acoustic modeling, convolutional neural network | Speech synthesis,Embedding,Pattern recognition,Convolutional neural network,Computer science,Spectrogram,Naturalness,Artificial intelligence | Conference |
ISSN | Citations | PageRank |
1520-6149 | 1 | 0.35 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Heejin Choi | 1 | 6 | 1.80 |
sangjun park | 2 | 2 | 2.43 |
Jinuk Park | 3 | 2 | 2.74 |
Minsoo Hahn | 4 | 1 | 0.35 |