Abstract | ||
---|---|---|
Recent speech synthesis systems based on sampling from autoregressive neural network models can generate speech almost indistinguishable from human recordings. However, these models require large amounts of data. This paper shows that the lack of data from one speaker can be compensated with data from other speakers. The naturalness of Tacotron2-like models trained on a blend of 5k utterances from 7 speakers is better than or equivalent to that of speaker dependent models trained on 15k utterances. Additionally, in terms of stability multispeaker models are always more stable. We also demonstrate that models mixing only 1250 utterances from a target speaker with 5k utterances from another 6 speakers can produce significantly better quality than state-of-the-art DNN-guided unit selection systems trained on more than 10 times the data from the target speaker. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1109/icassp.2019.8682168 | 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) |
Keywords | Field | DocType |
statistical parametric speech synthesis, autoregressive, neural vocoder, generative models, sequence-to-sequence | Autoregressive model,Speech synthesis,Computer science,Naturalness,Sampling (statistics),Natural language processing,Artificial intelligence,Artificial neural network,Data reduction | Journal |
Volume | ISSN | Citations |
abs/1811.06315 | 1520-6149 | 0 |
PageRank | References | Authors |
0.34 | 11 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Javier Latorre | 1 | 61 | 5.09 |
Jakub Lachowicz | 2 | 0 | 0.34 |
Jaime Lorenzo-Trueba | 3 | 46 | 9.26 |
Thomas Merritt | 4 | 18 | 5.81 |
Thomas Drugman | 5 | 526 | 41.79 |
Srikanth Ronanki | 6 | 0 | 0.68 |
Viacheslav Klimkov | 7 | 5 | 3.19 |