Title
Streaming End-To-End Multi-Talker Speech Recognition
Abstract
End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT, and the SURT model with 150 milliseconds algorithmic latency constraint compares favorably with the offline sequence-to-sequence based baseline model in terms of accuracy.
Year
DOI
Venue
2021
10.1109/LSP.2021.3070817
IEEE SIGNAL PROCESSING LETTERS
Keywords
DocType
Volume
Speech recognition, Training, Heating systems, Computational modeling, Transducers, Delays, Shape, Speech recognition, streaming, unmixing transducer, heuristic error assignment training
Journal
28
ISSN
Citations 
PageRank 
1070-9908
0
0.34
References 
Authors
8
4
Name
Order
Citations
PageRank
Liang Lu1894165.81
Naoyuki Kanda210319.45
Jinyu Li391572.84
Yifan Gong41332135.58