Abstract | ||
---|---|---|
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR). Experiments are performed in the state-of-the-art setting using ESPnet. The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances. We also show that the techniques can be combined for further improvement and applied to NAR end-to-end speech translation. All the implementations are publicly available to encourage further research in NAR speech processing. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ASRU51503.2021.9688157 | 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |
Keywords | DocType | ISBN |
Non-autoregressive sequence generation,end-to-end speech recognition,end-to-end speech translation | Conference | 978-1-6654-3740-0 |
Citations | PageRank | References |
1 | 0.35 | 0 |
Authors | ||
9 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yosuke Higuchi | 1 | 3 | 3.75 |
Nanxin Chen | 2 | 64 | 7.55 |
Yuya Fujita | 3 | 2 | 1.04 |
Hirofumi Inaguma | 4 | 2 | 0.76 |
Komatsu Tatsuya | 5 | 1 | 1.70 |
Jaesong Lee | 6 | 1 | 0.69 |
Jumon Nozaki | 7 | 1 | 0.35 |
Tianzi Wang | 8 | 3 | 0.71 |
Shinji Watanabe | 9 | 1158 | 139.38 |