Abstract | ||
---|---|---|
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the source-language audio and target-language translations. The corpus is released along with open-sourced code enabling extension to new talks and languages as they become available. Our corpus creation methodology can be applied to more languages than previous work, and creates multi-way parallel evaluation sets. We provide baselines in multiple ASR and ST settings, including multilingual models to improve translation performance for low-resource language pairs. |
Year | DOI | Venue |
---|---|---|
2021 | 10.21437/Interspeech.2021-11 | Interspeech |
DocType | Citations | PageRank |
Conference | 2 | 0.38 |
References | Authors | |
0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
elizabeth salesky | 1 | 11 | 6.29 |
Matthew Wiesner | 2 | 2 | 1.39 |
Jacob Bremerman | 3 | 2 | 0.38 |
R. Cattoni | 4 | 26 | 5.38 |
Matteo Negri | 5 | 775 | 82.49 |
Marco Turchi | 6 | 560 | 57.79 |
Douglas W. Oard | 7 | 2 | 2.07 |
Matt Post | 8 | 414 | 35.72 |