Title | ||
---|---|---|
DECOUPLING PRONUNCIATION AND LANGUAGE FOR END-TO-END CODE-SWITCHING AUTOMATIC SPEECH RECOGNITION |
Abstract | ||
---|---|---|
Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model to a certain extent. Finally, the two networks are optimized jointly through attention fusion. We evaluate the proposed method on the public Mandarin-English code-switching dataset. Compared with our transformer baseline, the proposed method achieves 18.14% relative mix error rate reduction. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICASSP39728.2021.9414428 | 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) |
Keywords | DocType | Citations |
Automatic Speech Recognition, Code-Switching, End-to-End, Decoupled Transformer | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shuai Zhang | 1 | 37 | 11.44 |
Jiangyan Yi | 2 | 19 | 17.99 |
Zhengkun Tian | 3 | 3 | 5.79 |
Ye Bai | 4 | 7 | 5.52 |
Jianhua Tao | 5 | 848 | 138.00 |
Zhengqi Wen | 6 | 86 | 24.41 |