Abstract | ||
---|---|---|
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline. |
Year | DOI | Venue |
---|---|---|
2020 | 10.21437/Interspeech.2020-1917 | INTERSPEECH |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 10 |
Name | Order | Citations | PageRank |
---|---|---|---|
Singh Kritika | 1 | 0 | 0.34 |
Manohar Vimal | 2 | 0 | 0.34 |
Xiao Alex | 3 | 3 | 2.44 |
Sergey Edunov | 4 | 204 | 10.37 |
Ross B. Girshick | 5 | 21921 | 927.22 |
Vitaliy Liptchinsky | 6 | 8 | 3.16 |
Christian Fuegen | 7 | 9 | 6.58 |
Saraf Yatharth | 8 | 0 | 0.34 |
Geoffrey Zweig | 9 | 3406 | 320.25 |
Abdel-rahman Mohamed | 10 | 3772 | 266.13 |