Title
Construction of Chinese segmented and POS-tagged conversational corpora and their evaluations on spontaneous speech recognitions
Abstract
The performance of a corpus-based language and speech processing system depends heavily on the quantity and quality of the training corpora. Although several famous Chinese corpora have been developed, most of them are mainly written text. Even for some existing corpora that contain spoken data, the quantity is insufficient and the domain is limited. In this paper, we describe the development of Chinese conversational annotated textual corpora currently being used in the NICT/ATR speech-to-speech translation system. A total of 510K manually checked utterances provide 3.5M words of Chinese corpora. As far as we know, this is the largest conversational textual corpora in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Evaluation experiments on these corpora were conducted by comparing the parameters of the language models, perplexities of test sets, and speech recognition performance with Japanese and English. The characteristics of the Chinese corpora, their limitations, and solutions to these limitations are analyzed and discussed.
Year
Venue
Keywords
2009
ALR7@IJCNLP
pos-tagged conversational corpus,chinese word,language model,spontaneous speech recognition,largest conversational textual corpus,chinese conversational,famous chinese corpus,speech processing system,corpus-based language,atr speech-to-speech translation system,chinese corpus,english word,speech recognition,speech processing
Field
DocType
Citations 
Speech corpus,Speech processing,Computer science,Parallel corpora,Speech recognition,Natural language processing,Artificial intelligence,Speech recognition performance,Language model
Conference
2
PageRank 
References 
Authors
0.36
2
3
Name
Order
Citations
PageRank
Xinhui Hu15111.32
Ryosuke Isotani23810.60
Satoshi Nakamura31099194.59