Title
LITESING: TOWARDS FAST, LIGHTWEIGHT AND EXPRESSIVE SINGING VOICE SYNTHESIS
Abstract
LiteSing proposed in this paper is a high-quality singing voice synthesis (SVS) system, which is fast, lightweight and expressive. This model mainly stacks several non-autoregressive WaveNet blocks in the encoder and decoder under a generative adversarial architecture, predicts full conditions from the musical score, and generates acoustic features from these conditions. The full conditions in this paper consist of dynamic spectrogram energy, voiced/unvoiced (V/UV) decision and dynamic pitch curve, which are proven related to the expressiveness. We predict the pitch and the timbre features separately, avoiding the interdependence between these two features. Instead of neural network vocoders, a parametric WORLD vocoder is employed for the pitch curve consistency. Experiment results show that LiteSing outperforms the baseline model using feed-forward Transformer by 1.386 times faster on inference speed, 15 times smaller on training parameters number, and achieves a similar MOS on sound quality. Through an A/B test, LiteSing achieves 67.3% preference rate over baseline in pitch curve and dynamic spectrogram energy prediction. which demonstrates the advantage of LiteSing over the other compared models.
Year
DOI
Venue
2021
10.1109/ICASSP39728.2021.9414043
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords
DocType
Citations 
singing voice synthesis, non-autoregressive model, generative adversarial network, lightweight, expressive
Conference
0
PageRank 
References 
Authors
0.34
0
6
Name
Order
Citations
PageRank
Xiaobin Zhuang100.34
Tao Jiang200.34
Szu-Yu Chou3496.82
Bin Wu400.34
Peng Hu500.34
Simon Lui601.01