Title
Hierarchical classification with a topic taxonomy via LDA.
Abstract
Large scale hierarchical classification problem researches how to classify documents into a predefined taxonomy with thousands of categories. As the skewed category distribution over documents, that is, most categories have very few labeled documents, the data sparseness problem in the rare categories lead to a low classification performance. In this paper, we study the problem of web-page classification over the topic taxonomy of the DMOZ directory. For this hard task, we proposed a hierarchical classification model based on Latent Dirichlet allocation (LDA). We use LDA model as the feature extraction technique to extract latent topics to reduce the effects of data sparseness, and construct topic feature vectors associated with the corpus for training more robust classification models for rare categories. Experiments were conducted on the dataset of web pages from the Chinese Simplified branch of the DMOZ directory. The results show that our method achieves a performance improvement for rare categories over the hierarchical classification methods based on full-term and feature-word, and further improves the performance over the whole topic taxonomy.
Year
DOI
Venue
2014
10.1007/s13042-013-0203-3
Int. J. Machine Learning & Cybernetics
Keywords
Field
DocType
Text categorization, Hierarchical classification, Topic taxonomy, Latent dirichlet allocation (LDA), Rare category
Dynamic topic model,Latent Dirichlet allocation,Feature vector,Web page,Computer science,Directory,Feature extraction,Natural language processing,Artificial intelligence,Text categorization,Machine learning,Performance improvement
Journal
Volume
Issue
ISSN
5
4
1868-808X
Citations 
PageRank 
References 
1
0.35
11
Authors
4
Name
Order
Citations
PageRank
Li He141.09
Yan Jia25610.52
Zhaoyun Ding382.54
WeiHong Han46416.26