Abstract | ||
---|---|---|
In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semisupervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data. |
Year | DOI | Venue |
---|---|---|
2021 | 10.21437/Interspeech.2021-1243 | Interspeech |
DocType | Citations | PageRank |
Conference | 1 | 0.35 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jisi Zhang | 1 | 2 | 0.71 |
Catalin Zorila | 2 | 2 | 2.74 |
Rama Doddipatla | 3 | 2 | 4.09 |
Jon Barker | 4 | 676 | 64.08 |