Topic Modeling of Chinese Language Using Character-Word Relations. - Citegraph

Paper Info

Title
Topic Modeling of Chinese Language Using Character-Word Relations.

Abstract
Topic models are hierarchical Bayesian models for language modeling and document analysis. It has been well-used and achieved a lot of success in modeling English documents. However, unlike English and the majority of alphabetic languages. the basic structural unit of Chinese language is character instead of word, and Chinese words are written without spaces between them. Most previous research of using topic models for Chinese documents did not take the Chinese character-word relationship into consideration and simply take the Chinese word as the basic term of documents. In this paper, we propose a novel model to consider the character-word relation into topic modeling by placing an asymmetric prior on the topic-word distribution of the standard Latent Dirichlet Allocation (LDA) model. Compared to LDA, this model can improve performance in document classification especially when test data contains considerable number of Chinese words not appeared in training data.

Year	DOI	Venue
2011	10.1007/978-3-642-24965-5_16	Lecture Notes in Computer Science
Keywords	Field	DocType
Topic Models,Latent Dirichlet Allocation,CWTM,Gibbs Sampler	Document classification,Latent Dirichlet allocation,Document analysis,Computer science,Test data,Artificial intelligence,Natural language processing,Topic model,Machine learning,Language model,Gibbs sampling,Bayesian probability	Conference
Volume	Issue	ISSN
7064	PART 3	0302-9743
Citations	PageRank	References
3	0.40	4
Authors
3

Authors (3 rows)

Cited by (3 rows)

References (4 rows)

Name	Order	Citations	PageRank
Qi Zhao	1	10	1.01
Zengchang Qin	2	439	45.46
Tao Wan	3	181	21.18

1