Abstract | ||
---|---|---|
Speech recognition on smart devices is challenging owing to the small memory footprint. Hence small size ASR models are desirable. With the use of popular transducer-based models, it has become possible to practically deploy streaming speech recognition models on small devices [1]. Recently, the two-pass model [2] combining RNN-T and LAS modules has shown exceptional performance for streaming on-device speech recognition. In this work, we propose a simple and effective approach to reduce the size of the two-pass model for memory-constrained devices. We employ a popular knowledge distillation approach in three stages using the Teacher-Student training technique. In the first stage, we use a trained RNN-T model as a teacher model and perform knowledge distillation to train the student RNN-T model. The second stage uses the shared encoder and trains a LAS rescorer for student model using the trained RNN-T+LAS teacher model. Finally, we perform deep-finetuning for the student model with a shared RNN-T encoder, RNN-T decoder, and LAS rescorer. Our experimental results on standard LibriSpeech dataset show that our system can achieve a high compression rate of 55% without significant degradation in the WER compared to the two-pass teacher model. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ASRU51503.2021.9688121 | 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |
Keywords | DocType | ISBN |
speech recognition,on-device,online,streaming ASR,model compression,knowledge distillation | Conference | 978-1-6654-3740-0 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Nauman Dawalatabad | 1 | 0 | 0.34 |
Tushar Vatsal | 2 | 0 | 0.34 |
Ashutosh Gupta | 3 | 1 | 2.38 |
Sungsoo Kim | 4 | 0 | 0.68 |
Shatrughan Singh | 5 | 1 | 2.71 |
Dhananjaya Gowda | 6 | 0 | 1.01 |
Chanwoo Kim | 7 | 253 | 28.44 |