Title
CLMAD: A Chinese Language Model Adaptation Dataset
Abstract
A language model (LM) is an important part of a speech recognition system. Language model adaptation techniques use a large amount of source domain data and limited target domain data to improve the performance of language models in target domain. Even though text datasets are easy to obtain, there is no public Chinese text dataset for language model adaptation tasks. This paper presents a language model adaptation dataset which consists of four different domains of news data, i.e., sport, stock, fashion, finance. The discrepancy between the domains of data is evaluated. Model combination based adaptation of n-gram is evaluated on the dataset. Three different fine-tuning adaptation methods of recurrent neural network language models (RNNLMs) are evaluated. WER results on AIShell speech data with the language models trained on this dataset are also provided. The absolute WER reduction of lattice rescoring with adapted RNNLM is 4.74%.
Year
DOI
Venue
2018
10.1109/ISCSLP.2018.8706600
2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP)
Keywords
Field
DocType
Adaptation models,Sports,Data models,Training,Testing,Vocabulary,Artificial neural networks
Data modeling,Recurrent neural network language models,Computer science,Speech recognition,Artificial neural network,Vocabulary,Language model
Conference
ISBN
Citations 
PageRank 
978-1-5386-5627-3
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Ye Bai175.52
Jianhua Tao2848138.00
Jiangyan Yi31917.99
Zhengqi Wen48624.41
Cunhang Fan501.35