Title
Hierarchical classification of web documents by stratified discriminant analysis
Abstract
In this work we present and evaluate a methodology to classify web documents into a predefined hierarchy using the textual content of the documents. The general problem of hierarchical classification using taxonomies with thousands of categories is a hard task due to the problem of scarcity of training data. Hierarchical classification is one of the rare situations where, despite the large amount of available data, as more documents become available, more classes are also added to the hierarchy. This leads to a lack of training data for most of the categories, which produces poor individual classification models and tends to bias the classification to dense categories. Here we propose a novel feature extraction technique called Stratified Discriminant Analysis (sDA) that reduces the dimensions of the text-content features of the web documents along the different levels of the hierarchy. The sDA model is intended to reduce the effects of scarcity of data by better grouping and identify the categories with few training examples leading to more robust classification models for those categories. The results of classifying web pages from the Kids&Teens branch of the DMOZ directory show that our model extracts features that are well suited for category grouping of web pages and representation of categories with few training examples.
Year
DOI
Venue
2012
10.1007/978-3-642-31274-8_8
IRFC
Keywords
Field
DocType
hierarchical classification,available data,stratified discriminant analysis,poor individual classification model,predefined hierarchy,classifying web page,robust classification model,web document,training example,web page,training data
Data mining,Scarcity,Web mining,Web page,Information retrieval,Directory,Computer science,Web query classification,Feature extraction,Linear discriminant analysis,Hierarchy
Conference
Citations 
PageRank 
References 
4
0.39
28
Authors
2
Name
Order
Citations
PageRank
Juan Carlos Gomez18412.89
Marie-Francine Moens21750139.27