Abstract | ||
---|---|---|
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre, even when the Web page belongs to more than one genre. Experiments are run on a multi-labeled data set using both an SVM classifier and a distance function classification model. These n-gram based methods had very high precision results but somewhat lower recall results, indicating that the genre labels assigned by the classifiers are quite accurate, but that these machine learning classifiers are not assigning as many labels as did the human classifiers. The classification results compare favorably with those of other researchers on the same data set. |
Year | DOI | Venue |
---|---|---|
2010 | 10.1109/HICSS.2010.58 | System Sciences |
Keywords | Field | DocType |
svm classifier,genre label,distance function classification model,extraordinary growth,n-gram representation,web page genre,multi-labeled data,classification result,multi-labeled web page genre,web page,world wide web,classification algorithms,internet,support vector machines,web pages,html,distance function,machine learning | Web page,Computer science,Popularity,Knowledge management,Metric (mathematics),Natural language processing,Artificial intelligence,n-gram,The Internet,Support vector machine,Svm classifier,Statistical classification,Machine learning | Conference |
ISSN | ISBN | Citations |
1530-1605 E-ISBN : 978-1-4244-5510-2 | 978-1-4244-5510-2 | 2 |
PageRank | References | Authors |
0.37 | 11 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jane E. Mason | 1 | 8 | 1.56 |
Michael Shepherd | 2 | 159 | 10.56 |
Jack Duffy | 3 | 101 | 7.57 |
Vlado Keselj | 4 | 343 | 39.11 |
Carolyn R. Watters | 5 | 970 | 107.76 |