Abstract | ||
---|---|---|
While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model. Experimental results on real-world data show that DenoiSpeech outperforms the previous two methods by 0.31 and 0.66 MOS respectively.(1) |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICASSP39728.2021.9413934 | 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) |
Keywords | DocType | Citations |
text to speech, speech synthesis, noisy speech, denoise, frame-level condition | Conference | 0 |
PageRank | References | Authors |
0.34 | 7 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Chen Zhang | 1 | 125 | 5.22 |
Ren, Yi | 2 | 10 | 4.35 |
Xu Tan | 3 | 88 | 23.94 |
Jinglin Liu | 4 | 0 | 1.35 |
Kejun Zhang | 5 | 27 | 6.35 |
Tao Qin | 6 | 2384 | 147.25 |
Zhao, Sheng | 7 | 5 | 1.42 |
Tie-yan Liu | 8 | 4662 | 256.32 |