Title
An n-Gram Based Approach to Multi-Labeled Web Page Genre Classification
Abstract
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre, even when the Web page belongs to more than one genre. Experiments are run on a multi-labeled data set using both an SVM classifier and a distance function classification model. These n-gram based methods had very high precision results but somewhat lower recall results, indicating that the genre labels assigned by the classifiers are quite accurate, but that these machine learning classifiers are not assigning as many labels as did the human classifiers. The classification results compare favorably with those of other researchers on the same data set.
Year
DOI
Venue
2010
10.1109/HICSS.2010.58
System Sciences
Keywords
Field
DocType
svm classifier,genre label,distance function classification model,extraordinary growth,n-gram representation,web page genre,multi-labeled data,classification result,multi-labeled web page genre,web page,world wide web,classification algorithms,internet,support vector machines,web pages,html,distance function,machine learning
Web page,Computer science,Popularity,Knowledge management,Metric (mathematics),Natural language processing,Artificial intelligence,n-gram,The Internet,Support vector machine,Svm classifier,Statistical classification,Machine learning
Conference
ISSN
ISBN
Citations 
1530-1605 E-ISBN : 978-1-4244-5510-2
978-1-4244-5510-2
2
PageRank 
References 
Authors
0.37
11
5
Name
Order
Citations
PageRank
Jane E. Mason181.56
Michael Shepherd215910.56
Jack Duffy31017.57
Vlado Keselj434339.11
Carolyn R. Watters5970107.76