Title
Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition
Abstract
The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in one inference step. However, the NAR model still faces two major problems. Firstly, there is still a great gap in performance between the NAR models and the advanced AR models. Secondly, it's difficult for most of the NAR models to train and converge. We propose a hybrid autoregressive and non-autoregressive transformer (HANAT) model, which integrates AR and NAR models deeply by sharing parameters. We assume that the AR model will assist the NAR model to learn some linguistic dependencies and accelerate the convergence. Furthermore, the two-stage hybrid inference is applied to improve the model performance. All the experiments are conducted on a mandarin dataset ASIEHLL-1 and a english dataset librispeech-960 h. The results show that the HANAT can achieve a competitive performance with the AR model and outperform many complicated NAR models. Besides, the RTF is only 1/5 of the AR model.
Year
DOI
Venue
2022
10.1109/LSP.2022.3152128
IEEE SIGNAL PROCESSING LETTERS
Keywords
DocType
Volume
Decoding, Transformers, Acoustics, Predictive models, Training, Speech recognition, Linguistics, Autoregressive, non-autoregressive, transformer, hybrid, speech recognition
Journal
29
ISSN
Citations 
PageRank 
1070-9908
0
0.34
References 
Authors
0
7
Name
Order
Citations
PageRank
Zhengkun Tian135.79
Jiangyan Yi21917.99
Jianhua Tao3848138.00
Ye Bai475.52
Shuai Zhang55214.00
Zhengqi Wen641.44
Xuefei Liu701.35