Title
Effect Of Data Reduction On Sequence-To-Sequence Neural Tts
Abstract
Recent speech synthesis systems based on sampling from autoregressive neural network models can generate speech almost indistinguishable from human recordings. However, these models require large amounts of data. This paper shows that the lack of data from one speaker can be compensated with data from other speakers. The naturalness of Tacotron2-like models trained on a blend of 5k utterances from 7 speakers is better than or equivalent to that of speaker dependent models trained on 15k utterances. Additionally, in terms of stability multispeaker models are always more stable. We also demonstrate that models mixing only 1250 utterances from a target speaker with 5k utterances from another 6 speakers can produce significantly better quality than state-of-the-art DNN-guided unit selection systems trained on more than 10 times the data from the target speaker.
Year
DOI
Venue
2018
10.1109/icassp.2019.8682168
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)
Keywords
Field
DocType
statistical parametric speech synthesis, autoregressive, neural vocoder, generative models, sequence-to-sequence
Autoregressive model,Speech synthesis,Computer science,Naturalness,Sampling (statistics),Natural language processing,Artificial intelligence,Artificial neural network,Data reduction
Journal
Volume
ISSN
Citations 
abs/1811.06315
1520-6149
0
PageRank 
References 
Authors
0.34
11
7