CLMAD: A Chinese Language Model Adaptation Dataset - Citegraph

Paper Info

Title
CLMAD: A Chinese Language Model Adaptation Dataset

Abstract
A language model (LM) is an important part of a speech recognition system. Language model adaptation techniques use a large amount of source domain data and limited target domain data to improve the performance of language models in target domain. Even though text datasets are easy to obtain, there is no public Chinese text dataset for language model adaptation tasks. This paper presents a language model adaptation dataset which consists of four different domains of news data, i.e., sport, stock, fashion, finance. The discrepancy between the domains of data is evaluated. Model combination based adaptation of n-gram is evaluated on the dataset. Three different fine-tuning adaptation methods of recurrent neural network language models (RNNLMs) are evaluated. WER results on AIShell speech data with the language models trained on this dataset are also provided. The absolute WER reduction of lattice rescoring with adapted RNNLM is 4.74%.

Year	DOI	Venue
2018	10.1109/ISCSLP.2018.8706600	2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP)
Keywords	Field	DocType
Adaptation models,Sports,Data models,Training,Testing,Vocabulary,Artificial neural networks	Data modeling,Recurrent neural network language models,Computer science,Speech recognition,Artificial neural network,Vocabulary,Language model	Conference
ISBN	Citations	PageRank
978-1-5386-5627-3	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Ye Bai	1	7	5.52
Jianhua Tao	2	848	138.00
Jiangyan Yi	3	19	17.99
Zhengqi Wen	4	86	24.41
Cunhang Fan	5	0	1.35

1