Abstract | ||
---|---|---|
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures. The performance predictions of the SI-SNR estimator indeed correlate well with human opinions. Moreover, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow those achieved on synthetic benchmarks when evaluating popular speech separation models. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/ICASSP43922.2022.9746662 | IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Cem Subakan | 1 | 10 | 0.90 |
Mirco Ravanelli | 2 | 185 | 17.87 |
Samuele Cornell | 3 | 17 | 2.82 |
François Grondin | 4 | 7 | 3.92 |