Abstract | ||
---|---|---|
In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR). MixSpeech trains an ASR model by taking a weighted combination of two different speech features (e.g., mel-spectrograms or MFCC) as the input, and recognizing both text sequences, where the two recognition losses use the same combination weight. We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strong data augmentation method SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms SpecAugment with a relative PER improvement of 10.6% on TIMIT dataset, and achieves a strong WER of 4.7% on WSJ dataset. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICASSP39728.2021.9414483 | 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) |
Keywords | DocType | Citations |
Speech Recognition, Data Augmentation, Low-resource, Mixup | Conference | 0 |
PageRank | References | Authors |
0.34 | 8 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Linghui Meng | 1 | 0 | 1.69 |
Jin Xu | 2 | 6 | 3.22 |
Xu Tan | 3 | 88 | 23.94 |
Jindong Wang | 4 | 247 | 16.56 |
Tao Qin | 5 | 2384 | 147.25 |
Bo Xu | 6 | 130 | 9.43 |