Title
Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation
Abstract
The classification of web pages content is essential to many information retrieval tasks. In this paper, we propose a new methodology for a multilayer soft classification. Our approach is based on the connection between the semi-supervised Latent Dirichlet Allocation (LDA) and the Random Forest classifier. We compute with LDA the distribution of topics in each document and use the results to train the Random Forest classifier. The trained classifier is then able to categorize each web document in different layers of the categories hierarchy. We have applied our methodology on a collected data set from dmoz and have obtained satisfactory results.
Year
DOI
Venue
2015
10.1109/I4CS.2015.7294479
2015 15th International Conference on Innovations for Community Services (I4CS)
Keywords
Field
DocType
Semi-Supervised Latent Dirichlet Allocation (LDA),Topic modeling,Web Classification,Random Forest
Resource management,Categorization,Data mining,Web document,Latent Dirichlet allocation,Web page,Computer science,Artificial intelligence,Classifier (linguistics),Hierarchy,Random forest,Machine learning
Conference
ISBN
Citations 
PageRank 
978-1-4673-7327-2
1
0.37
References 
Authors
12
3
Name
Order
Citations
PageRank
Karim Sayadi110.37
Quang Vu Bui213.41
Marc Bui3239.28