Title
Learning outliers to refine a corpus for chinese webpage categorization
Abstract
Webpage categorization has turned out to be an important topic in recent years. In a webpage, text is usually the main content, so that auto text categorization (ATC) becomes the key technique to such a task. For Chinese text categorization as well as Chinese webpage categorization, one of the basic and urgent problems is the construction of a good benchmark corpus. In this study, a machine learning approach is presented to refine a corpus for Chinese webpage categorization, where the AdaBoost algorithm is adopted to identify outliers in the corpus. The standard k nearest neighbor (kNN) algorithm under a vector space model (VSM) is adopted to construct a webpage categorization system. Simulation results as well as manual investigation of the identified outliers reveal that the presented method works well.
Year
DOI
Venue
2005
10.1007/11539087_19
ICNC (1)
Keywords
Field
DocType
webpage categorization system,auto text categorization,chinese webpage categorization,adaboost algorithm,chinese text categorization,good benchmark corpus,key technique,important topic,main content,webpage categorization,k nearest neighbor,vector space model,machine learning
k-nearest neighbors algorithm,Categorization,Adaboost algorithm,Web page,Computer science,Outlier,Auto-text,Artificial intelligence,Vector space model,Text categorization,Machine learning
Conference
Volume
ISSN
ISBN
3610
0302-9743
3-540-28323-4
Citations 
PageRank 
References 
1
0.36
14
Authors
4
Name
Order
Citations
PageRank
Dingsheng Luo14611.61
Xinhao Wang25715.23
Xihong Wu327953.02
Huisheng Chi421122.81