Abstract | ||
---|---|---|
Although a HiFi-GAN vocoder can synthesize high-fidelity speech waveforms in real time on CPUs, there is a tradeoff between synthesis quality and inference speed. To increase inference speed while maintaining synthesis quality, a multi-band structure is introduced to HiFi-GAN. However, it cannot be trained well because of the strong constraint imposed by the fixed multi-band structure. As an alternative approach, Multi-stream MelGAN and HiFi-GAN are proposed, in which the fixed synthesis filter in Multi-band MelGAN is replaced by a trainable convolutional layer with the same structure. In contrast to Multi-band MelGAN, the proposed methods use the trainable synthesis filter to decompose speech waveforms in a data-driven manner. To evaluate the proposed Multi-stream HiFi-GAN as an entire real-time neural text-to-speech system on CPUs, a fast acoustic model, based on Parallel Tacotron 2 with forced alignment and accentual label input, was implemented. The results of experiments-using Japanese male, female, and multi-speaker corpora-indicate that Multi-stream HiFi-GAN can increase synthesis speed while improving or maintaining synthesis quality in analysis-synthesis and text-to-speech conditions for single-speaker models and unseen speaker synthesis for multi-speaker models, compared with the original HiFi-GAN. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ASRU51503.2021.9688194 | 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |
Keywords | DocType | ISBN |
Speech synthesis,neural vocoder,HiFi-GAN,data-driven waveform decomposition,Parallel Tacotron 2 | Conference | 978-1-6654-3740-0 |
Citations | PageRank | References |
1 | 0.36 | 0 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Takuma Okamoto | 1 | 1 | 0.70 |
Tomoki Toda | 2 | 1874 | 167.18 |
Hisashi Kawai | 3 | 2 | 0.71 |