A Public Chinese Dataset for Language Model Adaptation - Citegraph

Paper Info

Title
A Public Chinese Dataset for Language Model Adaptation

Abstract
A language model (LM) is an important part of a speech recognition system. The performance of an LM is affected when the domains of training data and test data are different. Language model adaptation is to compensate for this mismatch. However, there is no public dataset in Chinese for evaluating language model adaptation. In this paper, we present a public Chinese dataset called CLMAD for language model adaptation. The dataset consists of four domains: sport, stock, fashion, and finance. The differences in these four domains are evaluated. We present baselines for two commonly used adaptation techniques: interpolation for n-gram, and fine-tuning for recurrent neural network language models (RNNLMs). For n-gram interpolation, when the source domain and target domain are relatively similar, the adapted model can be improved. But interpolating LMs of very different domains does not obtain improvement. For RNNLMs, fine-tuning whole network achieves the largest improvement over only fine-tuning softmax layer or embedding layer. When the domain difference is large, the improvement of the adapted RNNLM is significant. We also provide speech recognition results on AISHELL-1 with the LMs trained on CLMAD. CLMAD can be freely downloaded at http://www.openslr.org/55/.

Year	DOI	Venue
2020	10.1007/s11265-019-01482-5	JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY
Keywords	DocType	Volume
Chinese dataset,Language model adaptation,Speech recognition,N-gram,RNNLM	Journal	92.0
Issue	ISSN	Citations
SP8	1939-8018	0
PageRank	References	Authors
0.34	0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Ye Bai	1	0	1.35
Jiangyan Yi	2	19	17.99
Jianhua Tao	3	848	138.00
Zhengqi Wen	4	86	24.41
Cunhang Fan	5	0	1.35

1