Must-C: A Multilingual Corpus For End-To-End Speech Translation - Citegraph

Paper Info

Title
Must-C: A Multilingual Corpus For End-To-End Speech Translation

Abstract
End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Multilingual Speech Translation Corpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. Besides describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations, we present baseline results computed with strong systems on each language direction covered by MuST-C. (C) 2020 Elsevier Ltd. All rights reserved.

Year	DOI	Venue
2021	10.1016/j.csl.2020.101155	COMPUTER SPEECH AND LANGUAGE
Keywords	DocType	Volume
Spoken language translation, Multilingual corpus	Journal	66
ISSN	Citations	PageRank
0885-2308	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
R. Cattoni	1	26	5.38
Mattia Antonino Di Gangi	2	8	7.27
Luisa Bentivogli	3	412	33.63
Matteo Negri	4	775	82.49
Marco Turchi	5	560	57.79

1