SPTTS: Parallel Speech Synthesis without Extra Aligner Model - Citegraph

Paper Info

Title
SPTTS: Parallel Speech Synthesis without Extra Aligner Model

Abstract
In this work, we develop a novel non-autoregressive TTS model to predict all mel-spectrogram frames in parallel. Different from the previous non-autoregressive TTS methods, which typically require an external aligner implemented by an attention-based autoregressive model, our model can be optimized jointly without sophisticated external aligners. Motivated by the CTC-based speech recognition, which is a simple and effective manner to achieve the frame-level forced-alignment between the speech and text, our main idea is to consider the aligner learning of TTS as a CTC-based speech recognition like task. Specifically, our model learns the alignment generator by adopting the CTC-loss, to provide supervision for the duration predictor learning on the fly. In this way, we are able to learn a one-stage TTS system by optimizing the aligner with the feed forward transformer jointly. In inference phase, the aligner is removed and the duration predictor is used to predict duration sequence for synthesizing speech. To demonstrate our method, we conduct extensive experiments on an open-source Chinese standard Mandarin speech dataset(1). The results show that our method achieves competitive performance compared with counterpart models (e.g. FastSpeech: a well-known non-autoregressive with extra aligner) in terms of the synthesized speech quality and robustness.

Year	Venue	DocType
2021	2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC)	Conference
ISSN	Citations	PageRank
2309-9402	0	0.34
References	Authors
0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Zeqing Zhao	1	0	0.34
Xi Chen	2	0	0.34
Hui Liu	3	0	0.34
Xuyang Wang	4	0	0.68
Lin Yang	5	0	0.68
Junjie Wang	6	0	0.68

1