Title
Few-shot learning for E2E speech recognition: architectural variants for support set generation
Abstract
In this paper, we propose two architectural variants of our recent adaptation of a ‘few shot-learning’ (FSL) framework ‘Matching Networks’ (MN) to end-to-end (E2E) continuous speech recognition (CSR) in a formulation termed ‘MN-CTC’ which involves a CTC-loss based end-to-end episodic training of MN and an associated CTC-based decoding of continuous speech. An important component of the MN theory is the labelled support-set during training and inference. The architectural variants proposed and studied here for E2E CSR, namely, the ‘Uncoupled MN-CTC’ and the ‘Coupled MN-CTC’, address this problem of generating supervised support sets from continuous speech. While the ‘Uncoupled MN-CTC’ generates the support-sets ‘outside’ the MN-architecture, the ‘Coupled MN-CTC’ variant is a derivative framework which generates the support set ‘within’ the MN-architecture through a multi-task formulation coupling the support-set generation loss and the main MN-CTC loss for jointly optimizing the support-sets and the embedding functions of MN. On TIMIT and Librispeech datasets, we establish the ‘few-shot’ effectiveness of the proposed variants with PER and LER performances and also demonstrate the cross-domain applicability of the MN-CTC formulation with a Librispeech trained ‘Coupled MN-CTC’ variant inferencing on TIMIT low resource target-corpus with a 8% (absolute) LER advantage over a single-domain (TIMIT only) scenario.
Year
Venue
Keywords
2022
2022 30th European Signal Processing Conference (EUSIPCO)
Few-shot Learning,Matching Networks,Continuous Speech Recognition,Coupled and Uncoupled architectures,Support Set Generation
DocType
ISSN
ISBN
Conference
2219-5491
978-1-6654-6799-5
Citations 
PageRank 
References 
0
0.34
6
Authors
5