Multi-Stream HiFi-GAN with Data-Driven Waveform Decomposition - Citegraph

Paper Info

Title
Multi-Stream HiFi-GAN with Data-Driven Waveform Decomposition

Abstract
Although a HiFi-GAN vocoder can synthesize high-fidelity speech waveforms in real time on CPUs, there is a tradeoff between synthesis quality and inference speed. To increase inference speed while maintaining synthesis quality, a multi-band structure is introduced to HiFi-GAN. However, it cannot be trained well because of the strong constraint imposed by the fixed multi-band structure. As an alternative approach, Multi-stream MelGAN and HiFi-GAN are proposed, in which the fixed synthesis filter in Multi-band MelGAN is replaced by a trainable convolutional layer with the same structure. In contrast to Multi-band MelGAN, the proposed methods use the trainable synthesis filter to decompose speech waveforms in a data-driven manner. To evaluate the proposed Multi-stream HiFi-GAN as an entire real-time neural text-to-speech system on CPUs, a fast acoustic model, based on Parallel Tacotron 2 with forced alignment and accentual label input, was implemented. The results of experiments-using Japanese male, female, and multi-speaker corpora-indicate that Multi-stream HiFi-GAN can increase synthesis speed while improving or maintaining synthesis quality in analysis-synthesis and text-to-speech conditions for single-speaker models and unseen speaker synthesis for multi-speaker models, compared with the original HiFi-GAN.

Year	DOI	Venue
2021	10.1109/ASRU51503.2021.9688194	2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Keywords	DocType	ISBN
Speech synthesis,neural vocoder,HiFi-GAN,data-driven waveform decomposition,Parallel Tacotron 2	Conference	978-1-6654-3740-0
Citations	PageRank	References
1	0.36	0
Authors
3

Authors (3 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Takuma Okamoto	1	1	0.70
Tomoki Toda	2	1874	167.18
Hisashi Kawai	3	2	0.71

1