DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING - Citegraph

Paper Info

Title
DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING

Abstract
While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model. Experimental results on real-world data show that DenoiSpeech outperforms the previous two methods by 0.31 and 0.66 MOS respectively.(1)

Year	DOI	Venue
2021	10.1109/ICASSP39728.2021.9413934	2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords	DocType	Citations
text to speech, speech synthesis, noisy speech, denoise, frame-level condition	Conference	0
PageRank	References	Authors
0.34	7	8

Authors (8 rows)

Cited by (0 rows)

References (7 rows)

Name	Order	Citations	PageRank
Chen Zhang	1	125	5.22
Ren, Yi	2	10	4.35
Xu Tan	3	88	23.94
Jinglin Liu	4	0	1.35
Kejun Zhang	5	27	6.35
Tao Qin	6	2384	147.25
Zhao, Sheng	7	5	1.42
Tie-yan Liu	8	4662	256.32

1