Title
Principles For Learning Controllable Tts From Annotated And Latent Variation
Abstract
For building flexible and appealing high-quality speech synthesisers, it is desirable to be able to accommodate and reproduce fine variations in vocal expression present in natural speech. Synthesisers can enable control over such output properties by adding adjustable control parameters in parallel to their text input. If not annotated in training data, the values of these control inputs can be optimised jointly with the model parameters. We describe how this established method can be seen as approximate maximum likelihood and MAP inference in a latent variable model. This puts previous ideas of (learned) synthesiser inputs such as sentence-level control vectors on a more solid theoretical footing. We furthermore extend the method by restricting the latent variables to orthogonal subspaces via a sparse prior. This enables us to learn dimensions of variation present also within classes in coarsely annotated speech. As an example. we train an LSTM-based TTS system to learn nuances in emotional expression from a speech database annotated with seven different acted emotions. Listening tests show that our proposal successfully can synthesise speech with discernible differences in expression within each emotion, without compromising the recognisability of synthesised emotions compared to an identical system without learned nuances.
Year
DOI
Venue
2017
10.21437/Interspeech.2017-171
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION
Keywords
Field
DocType
text-to-speech, latent variables, paralinguistics
Computer science,Speech recognition,Natural language processing,Artificial intelligence
Conference
ISSN
Citations 
PageRank 
2308-457X
4
0.45
References 
Authors
11
4
Name
Order
Citations
PageRank
Gustav Eje Henter13711.40
Jaime Lorenzo-Trueba2469.26
xin wang3668.17
junichi yamagishi41906145.51