Abstract | ||
---|---|---|
Topic models are hierarchical Bayesian models for language modeling and document analysis. It has been well-used and achieved a lot of success in modeling English documents. However, unlike English and the majority of alphabetic languages. the basic structural unit of Chinese language is character instead of word, and Chinese words are written without spaces between them. Most previous research of using topic models for Chinese documents did not take the Chinese character-word relationship into consideration and simply take the Chinese word as the basic term of documents. In this paper, we propose a novel model to consider the character-word relation into topic modeling by placing an asymmetric prior on the topic-word distribution of the standard Latent Dirichlet Allocation (LDA) model. Compared to LDA, this model can improve performance in document classification especially when test data contains considerable number of Chinese words not appeared in training data. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1007/978-3-642-24965-5_16 | Lecture Notes in Computer Science |
Keywords | Field | DocType |
Topic Models,Latent Dirichlet Allocation,CWTM,Gibbs Sampler | Document classification,Latent Dirichlet allocation,Document analysis,Computer science,Test data,Artificial intelligence,Natural language processing,Topic model,Machine learning,Language model,Gibbs sampling,Bayesian probability | Conference |
Volume | Issue | ISSN |
7064 | PART 3 | 0302-9743 |
Citations | PageRank | References |
3 | 0.40 | 4 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Qi Zhao | 1 | 10 | 1.01 |
Zengchang Qin | 2 | 439 | 45.46 |
Tao Wan | 3 | 181 | 21.18 |