Abstract | ||
---|---|---|
Although great progress has been made on automatic speech recognition (ASR) systems, children's speech recognition still remains a challenging task. General ASR systems for children's speech suffer from the lack of corpora and mismatch between children's and adults' speech. Efforts have been made to reduce such mismatch by applying normalization methods to generate modified adults' speech for ASR training. However, modified adults' data can reflect the characteristics of children's speech to a very limited extent. In this work, we adopt text-to-speech data augmentation to improve the performance of children's speech recognition system. We find that the children's TTS model generates speech with inconsistent quality due to children's substandard pronunciations of phonemes, and the ASR system suffers when trained with these additional synthesized data. To solve this problem, we propose data selection strategies on the TTS augmented data, and the effectiveness of the synthesized data can be substantially boosted for children's ASR modeling. We show that the speaker embedding similarity based data selection strategy can obtain the best position: relative 14.0% and 14.7% CER reduction for child conversation and child reading test set respectively compared to the baseline model trained on real data. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICASSP39728.2021.9413930 | 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) |
Keywords | DocType | Citations |
children's speech recognition, data augmentation, text-to-speech, data selection | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Wei Wang | 1 | 0 | 0.34 |
Zhikai Zhou | 2 | 0 | 0.68 |
Yizhou Lu | 3 | 1 | 3.72 |
Hongji Wang | 4 | 2 | 1.40 |
Chenpeng Du | 5 | 0 | 1.69 |
Yanmin Qian | 6 | 295 | 44.44 |