Title
FCL-TACO2: TOWARDS FAST, CONTROLLABLE AND LIGHTWEIGHT TEXT-TO-SPEECH SYNTHESIS
Abstract
Sequence-to-sequence (seq2seq) learning has greatly improved text-to-speech (TTS) synthesis performance, but effective implementation on resource-restricted devices remains challenging as seq2seq models are usually computationally expensive and memory intensive. To achieve fast inference speed and small model size while maintain high-quality speech, we propose FCL-taco2, a Fast, Controllable and Lightweight (FCL) TTS model based on Tacotron2. FCL-taco2 adopts a novel semi-autoregressive (SAR) mode for phoneme level based parallel mel-spectrograms generation conditioned on prosody features, leading to faster inference speed and higher prosody controllability than Tacotron2. Besides, knowledge distillation (KD) is leveraged to compress a relatively large FCL-taco2 model to its small version with minor loss of speech quality. Experimental results on English (EN) and Chinese (CN) datasets show that the small version of FCL-taco2 achieves comparable performance with Tacotron2 in terms of speech quality, while it has a 4.8. smaller footprint with 17.7. and 18.5. faster inference speeds on average for EN and CN experiments respectively. Besides, execution on mobile devices shows that the proposed model can achieve faster than real-time speech synthesis. Our code and audio samples are released(1).
Year
DOI
Venue
2021
10.1109/ICASSP39728.2021.9414870
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords
DocType
Citations 
Text-to-speech, controllable and efficient, semi-autoregressive, prosody modelling, knowledge distillation
Conference
0
PageRank 
References 
Authors
0.34
0
8
Name
Order
Citations
PageRank
Disong Wang102.70
Liqun Deng202.37
Yang Zhang300.34
Nianzu Zheng400.68
Yu Ting Yeung503.38
Xi Chen633370.76
Xunying Liu733052.46
Helen M. Meng81078172.82