Title
Multi-Stream HiFi-GAN with Data-Driven Waveform Decomposition
Abstract
Although a HiFi-GAN vocoder can synthesize high-fidelity speech waveforms in real time on CPUs, there is a tradeoff between synthesis quality and inference speed. To increase inference speed while maintaining synthesis quality, a multi-band structure is introduced to HiFi-GAN. However, it cannot be trained well because of the strong constraint imposed by the fixed multi-band structure. As an alternative approach, Multi-stream MelGAN and HiFi-GAN are proposed, in which the fixed synthesis filter in Multi-band MelGAN is replaced by a trainable convolutional layer with the same structure. In contrast to Multi-band MelGAN, the proposed methods use the trainable synthesis filter to decompose speech waveforms in a data-driven manner. To evaluate the proposed Multi-stream HiFi-GAN as an entire real-time neural text-to-speech system on CPUs, a fast acoustic model, based on Parallel Tacotron 2 with forced alignment and accentual label input, was implemented. The results of experiments-using Japanese male, female, and multi-speaker corpora-indicate that Multi-stream HiFi-GAN can increase synthesis speed while improving or maintaining synthesis quality in analysis-synthesis and text-to-speech conditions for single-speaker models and unseen speaker synthesis for multi-speaker models, compared with the original HiFi-GAN.
Year
DOI
Venue
2021
10.1109/ASRU51503.2021.9688194
2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Keywords
DocType
ISBN
Speech synthesis,neural vocoder,HiFi-GAN,data-driven waveform decomposition,Parallel Tacotron 2
Conference
978-1-6654-3740-0
Citations 
PageRank 
References 
1
0.36
0
Authors
3
Name
Order
Citations
PageRank
Takuma Okamoto110.70
Tomoki Toda21874167.18
Hisashi Kawai320.71