Modeling word burstiness using the Dirichlet distribution - Citegraph

Paper Info

Title
Modeling word burstiness using the Dirichlet distribution

Abstract
Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.

Year	DOI	Venue
2005	10.1145/1102351.1102420	ICML
Keywords	Field	DocType
dirichlet compound multinomial model,better classification,standard document collection,dirichlet distribution,model text document,dcm performance,text data,multinomial model,dcm model,multinomial distribution,additional degree,degree of freedom,multinomial,categorization,text mining	Multinomial probit,Categorization,Perplexity,Degrees of freedom (statistics),Heuristic,Pattern recognition,Computer science,Multinomial distribution,Burstiness,Artificial intelligence,Dirichlet distribution,Machine learning	Conference
ISBN	Citations	PageRank
1-59593-180-5	113	7.67
References	Authors
13	3

Search Limit

100113

Authors (3 rows)

Cited by (100 rows)

References (13 rows)

Name	Order	Citations	PageRank
Rasmus E. Madsen	1	113	7.67
David Kauchak	2	363	25.92
Charles Elkan	3	5118	572.94

1