Title
Decomposing background topics from keywords by principal component pursuit
Abstract
Low-dimensional topic models have been proven very useful for modeling a large corpus of documents that share a relatively small number of topics. Dimensionality reduction tools such as Principal Component Analysis or Latent Semantic Indexing (LSI) have been widely adopted for document modeling, analysis, and retrieval. In this paper, we contend that a more pertinent model for a document corpus as the combination of an (approximately) low-dimensional topic model for the corpus and a sparse model for the keywords of individual documents. For such a joint topic-document model, LSI or PCA is no longer appropriate to analyze the corpus data. We hence introduce a powerful new tool called Principal Component Pursuit that can effectively decompose the low-dimensional and the sparse components of such corpus data. We give empirical results on data synthesized with a Latent Dirichlet Allocation (LDA) mode to validate the new model. We then show that for real document data analysis, the new tool significantly reduces the perplexity and improves retrieval performance compared to classical baselines.
Year
DOI
Venue
2010
10.1145/1871437.1871475
CIKM
Keywords
Field
DocType
sparse model,real document data analysis,low-dimensional topic model,new model,principal component pursuit,corpus data,joint topic-document model,document modeling,large corpus,document corpus,decomposing background topic,pertinent model,perplexity,principal component,principal component analysis,latent dirichlet allocation,latent semantic indexing,data analysis
Data mining,Perplexity,Latent semantic indexing,Latent Dirichlet allocation,Dimensionality reduction,Information retrieval,Computer science,Principal component pursuit,Topic model,Document modeling,Principal component analysis
Conference
Citations 
PageRank 
References 
36
1.44
13
Authors
4
Name
Order
Citations
PageRank
Kerui Min11035.33
Zhengdong Zhang226112.55
John Wright310974361.48
Yi Ma414931536.21