Title
A Public Chinese Dataset for Language Model Adaptation
Abstract
A language model (LM) is an important part of a speech recognition system. The performance of an LM is affected when the domains of training data and test data are different. Language model adaptation is to compensate for this mismatch. However, there is no public dataset in Chinese for evaluating language model adaptation. In this paper, we present a public Chinese dataset called CLMAD for language model adaptation. The dataset consists of four domains: sport, stock, fashion, and finance. The differences in these four domains are evaluated. We present baselines for two commonly used adaptation techniques: interpolation for n-gram, and fine-tuning for recurrent neural network language models (RNNLMs). For n-gram interpolation, when the source domain and target domain are relatively similar, the adapted model can be improved. But interpolating LMs of very different domains does not obtain improvement. For RNNLMs, fine-tuning whole network achieves the largest improvement over only fine-tuning softmax layer or embedding layer. When the domain difference is large, the improvement of the adapted RNNLM is significant. We also provide speech recognition results on AISHELL-1 with the LMs trained on CLMAD. CLMAD can be freely downloaded at http://www.openslr.org/55/.
Year
DOI
Venue
2020
10.1007/s11265-019-01482-5
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY
Keywords
DocType
Volume
Chinese dataset,Language model adaptation,Speech recognition,N-gram,RNNLM
Journal
92.0
Issue
ISSN
Citations 
SP8
1939-8018
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Ye Bai101.35
Jiangyan Yi21917.99
Jianhua Tao3848138.00
Zhengqi Wen48624.41
Cunhang Fan501.35