Topic-based document segmentation with probabilistic latent semantic analysis - Citegraph

Paper Info

Title
Topic-based document segmentation with probabilistic latent semantic analysis

Abstract
This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.

Year	DOI	Venue
2002	10.1145/584792.584829	CIKM
Keywords	Field	DocType
different random initialization,segmentation performance,probabilistic latent semantic analysis,different instantiations,better representation,different topic,different number,segmentation point,new method,topic-based document segmentation,plsa,text segmentation	Data set,Latent Dirichlet allocation,Information retrieval,Pattern recognition,Computer science,Segmentation,Document segmentation,Text segmentation,Natural language processing,Probabilistic latent semantic analysis,Artificial intelligence,Sentence	Conference
ISBN	Citations	PageRank
1-58113-492-4	69	4.40
References	Authors
11	3

Authors (3 rows)

Cited by (69 rows)

References (11 rows)

Name	Order	Citations	PageRank
Thorsten Brants	1	1938	190.33
Francine Chen	2	1218	153.96
Ioannis Tsochantaridis	3	2861	155.43

1