Abstract | ||
---|---|---|
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals. |
Year | DOI | Venue |
---|---|---|
2019 | 10.21437/Interspeech.2019-1101 | arXiv: Audio and Speech Processing |
DocType | Volume | Citations |
Conference | abs/1810.04826 | 13 |
PageRank | References | Authors |
0.60 | 8 | 10 |
Name | Order | Citations | PageRank |
---|---|---|---|
Quan Wang | 1 | 115 | 20.15 |
Hannah Muckenhirn | 2 | 29 | 3.08 |
Kevin W. Wilson | 3 | 348 | 28.35 |
Prashant Sridhar | 4 | 14 | 1.28 |
Zelin Wu | 5 | 15 | 2.00 |
John R. Hershey | 6 | 844 | 65.57 |
Rif Saurous | 7 | 148 | 10.49 |
Ron J. Weiss | 8 | 443 | 29.47 |
Ye Jia | 9 | 58 | 4.68 |
Ignacio Lopez-Moreno | 10 | 187 | 14.97 |