Title
MLS: A Large-Scale Multilingual Dataset for Speech Research
Abstract
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.
Year
DOI
Venue
2020
10.21437/Interspeech.2020-2826
INTERSPEECH
DocType
ISSN
Citations 
Conference
Interspeech 2020
4
PageRank 
References 
Authors
0.39
0
5
Name
Order
Citations
PageRank
Vineel Pratap1162.69
Qiantong Xu2347.42
Anuroop Sriram350.76
Gabriel Synnaeve4277.73
Ronan Collobert54002308.61