Abstract | ||
---|---|---|
We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) fine-tuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers. |
Year | Venue | Field |
---|---|---|
2018 | international conference on learning representations | Stochastic gradient descent,Architecture,Speech synthesis,Software deployment,Embedding,Naturalness,Speech recognition,Encoder,Artificial intelligence,Artificial neural network,Mathematics,Machine learning |
DocType | Volume | Citations |
Journal | abs/1809.10460 | 2 |
PageRank | References | Authors |
0.41 | 31 | 14 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yutian Chen | 1 | 680 | 36.28 |
Yannis M. Assael | 2 | 129 | 6.51 |
Brendan Shillingford | 3 | 14 | 2.73 |
David Budden | 4 | 167 | 18.45 |
s reed | 5 | 1750 | 80.25 |
Heiga Zen | 6 | 1922 | 103.73 |
Quan Wang | 7 | 115 | 20.15 |
Luis C. Cobo | 8 | 2 | 0.74 |
andrew trask | 9 | 26 | 2.54 |
Ben Laurie | 10 | 10 | 2.89 |
Çaglar Gülçehre | 11 | 3010 | 133.22 |
Aäron Van Den Oord | 12 | 1585 | 64.43 |
Oriol Vinyals | 13 | 9419 | 418.45 |
Nando De Freitas | 14 | 3284 | 273.68 |