Multi-Speaker Emotional Acoustic Modeling For Cnn-Based Speech Synthesis - Citegraph

Paper Info

Title
Multi-Speaker Emotional Acoustic Modeling For Cnn-Based Speech Synthesis

Abstract
In this paper, we investigate multi-speaker emotional acoustic modeling methods for convolutional neural network (CNN) based speech synthesis system. For emotion modeling, we extend to the speech synthesis system that learns a latent embedding space of emotion, derived from a desired emotional identity, and we use emotion code and mel-frequency spectrogram as an emotion identity. In order to model speaker variation in a text-to-speech (TTS) system, we use speaker representations such as trainable speaker embedding and speaker code. We have implemented speech synthesis systems combining speaker representation and emotion representation and compared them by experiments. Experimental results have demonstrated that the multispeaker emotional speech synthesis approach using trainable speaker embedding and emotion representation from mel spectrogram achieves higher performance when compared with other approaches in terms of naturalness, speaker similarity, and emotion similarity.

Year	DOI	Venue
2019	10.1109/icassp.2019.8683682	2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)
Keywords	Field	DocType
Text-to-speech, expressive speech synthesis, multi-speaker acoustic modeling, convolutional neural network	Speech synthesis,Embedding,Pattern recognition,Convolutional neural network,Computer science,Spectrogram,Naturalness,Artificial intelligence	Conference
ISSN	Citations	PageRank
1520-6149	1	0.35
References	Authors
0	4

Authors (4 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Heejin Choi	1	6	1.80
sangjun park	2	2	2.43
Jinuk Park	3	2	2.74
Minsoo Hahn	4	1	0.35

1