Title
Spoken document retrieval using topic models
Abstract
In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.
Year
DOI
Venue
2009
10.1145/1667780.1667862
IUCS
Keywords
Field
DocType
nmf-based topic model,topic expression,topic number,document topic model,retrieval performance,generative topic model,topic model,multiple recognition hypothesis,document retrieval,conventional vector space model,non negative matrix factorization,speech recognition,nmf,word recognition,vector space model,latent semantic indexing,singular value decomposition
Document clustering,Computer science,Word recognition,Matrix decomposition,Search engine indexing,Speech recognition,Artificial intelligence,Natural language processing,Non-negative matrix factorization,Document retrieval,Topic model,Vector space model
Conference
Citations 
PageRank 
References 
2
0.38
4
Authors
3
Name
Order
Citations
PageRank
Xinhui Hu15111.32
Ryosuke Isotani23810.60
Satoshi Nakamura31099194.59