Title
SPTTS: Parallel Speech Synthesis without Extra Aligner Model
Abstract
In this work, we develop a novel non-autoregressive TTS model to predict all mel-spectrogram frames in parallel. Different from the previous non-autoregressive TTS methods, which typically require an external aligner implemented by an attention-based autoregressive model, our model can be optimized jointly without sophisticated external aligners. Motivated by the CTC-based speech recognition, which is a simple and effective manner to achieve the frame-level forced-alignment between the speech and text, our main idea is to consider the aligner learning of TTS as a CTC-based speech recognition like task. Specifically, our model learns the alignment generator by adopting the CTC-loss, to provide supervision for the duration predictor learning on the fly. In this way, we are able to learn a one-stage TTS system by optimizing the aligner with the feed forward transformer jointly. In inference phase, the aligner is removed and the duration predictor is used to predict duration sequence for synthesizing speech. To demonstrate our method, we conduct extensive experiments on an open-source Chinese standard Mandarin speech dataset(1). The results show that our method achieves competitive performance compared with counterpart models (e.g. FastSpeech: a well-known non-autoregressive with extra aligner) in terms of the synthesized speech quality and robustness.
Year
Venue
DocType
2021
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC)
Conference
ISSN
Citations 
PageRank 
2309-9402
0
0.34
References 
Authors
0
6
Name
Order
Citations
PageRank
Zeqing Zhao100.34
Xi Chen200.34
Hui Liu300.34
Xuyang Wang400.68
Lin Yang500.68
Junjie Wang600.68