GigaSpeech - An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio.

Paper Info

Title
GigaSpeech - An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio.

Abstract
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

Authors (21 rows)

Cited by (4 rows)

References (0 rows)

Name	Order	Citations	PageRank
Guoguo Chen	1	428	19.89
Shuzhou Chai	2	4	1.06
Guan-Bo Wang	3	5	1.75
Jiayu Du	4	4	0.72
Wei-Qiang Zhang	5	6	3.39
Chao Weng	6	113	19.75
Dan Su	7	75	12.37
Daniel Povey	8	2442	231.75
Jan Trmal	9	235	20.91
Junbo Zhang	10	4	1.74
Mingjie Jin	11	4	0.38
Sanjeev Khudanpur	12	2155	202.00
Shinji Watanabe	13	1158	139.38
Shuaijiang Zhao	14	4	0.72
Wei Zou	15	29	3.89
Xiangang Li	16	34	3.65
Xuchen Yao	17	208	14.09
Yongqing Wang	18	4	2.07
Yujun Wang	19	48	10.48
Zhao You	20	67	9.39
Zhiyong Yan	21	4	1.74