Title
Topic Modeling of Chinese Language Using Character-Word Relations.
Abstract
Topic models are hierarchical Bayesian models for language modeling and document analysis. It has been well-used and achieved a lot of success in modeling English documents. However, unlike English and the majority of alphabetic languages. the basic structural unit of Chinese language is character instead of word, and Chinese words are written without spaces between them. Most previous research of using topic models for Chinese documents did not take the Chinese character-word relationship into consideration and simply take the Chinese word as the basic term of documents. In this paper, we propose a novel model to consider the character-word relation into topic modeling by placing an asymmetric prior on the topic-word distribution of the standard Latent Dirichlet Allocation (LDA) model. Compared to LDA, this model can improve performance in document classification especially when test data contains considerable number of Chinese words not appeared in training data.
Year
DOI
Venue
2011
10.1007/978-3-642-24965-5_16
Lecture Notes in Computer Science
Keywords
Field
DocType
Topic Models,Latent Dirichlet Allocation,CWTM,Gibbs Sampler
Document classification,Latent Dirichlet allocation,Document analysis,Computer science,Test data,Artificial intelligence,Natural language processing,Topic model,Machine learning,Language model,Gibbs sampling,Bayesian probability
Conference
Volume
Issue
ISSN
7064
PART 3
0302-9743
Citations 
PageRank 
References 
3
0.40
4
Authors
3
Name
Order
Citations
PageRank
Qi Zhao1101.01
Zengchang Qin243945.46
Tao Wan318121.18