Title | ||
---|---|---|
PERCEPTUAL LOSS BASED SPEECH DENOISING WITH AN ENSEMBLE OF AUDIO PATTERN RECOGNITION AND SELF-SUPERVISED MODELS |
Abstract | ||
---|---|---|
Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICASSP39728.2021.9413555 | 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) |
Keywords | DocType | Citations |
Speech Denoising, Perceptual Loss, Pre-trained Networks, Multi-Task Learning, Self-Supervised Features | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Saurabh Kataria | 1 | 9 | 5.21 |
Jesús A. Villalba | 2 | 44 | 6.49 |
N. Dehak | 3 | 1269 | 92.64 |