Title
Blog, Forum or Newspaper? Web Genre Detection Using SVMs
Abstract
In recent years, blogs have become a very popular way to publish information, express opinions and hold discussions. Hence researchers and industry have interest in analyzing the blogosphere. Due to the increasing diversity of blog usage, the initial categorization into web genres is the first necessary step before any analyses. In this research, we focus on the distinction between traditional blogs, news portals, forums and miscellaneous websites. Especially the new distinction between news portals and blogs allows analyses to adapt to the network-specific characteristics of traditional media with high journalistic effort and more personal weblogs and their authors. We present a set of 80 features and extensively experiment with possible combinations and SVM parameters to identify the best constellation for the categorization into the four different web genres. Our experiments show a maximal accuracy of 83.5% overall. This high precision was reached using a combination of trained n-grams, structural properties (e.g. Twitter links) and quantitative properties like the text's length and number of dates.
Year
DOI
Venue
2015
10.1109/WI-IAT.2015.59
2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)
Keywords
Field
DocType
web categorization,machine learning,blogs
Publication,Data mining,Categorization,World Wide Web,Information retrieval,Computer science,Support vector machine,Newspaper,Feature extraction,Constellation,Blogosphere
Conference
Volume
Citations 
PageRank 
3
2
0.37
References 
Authors
12
4
Name
Order
Citations
PageRank
Philipp Berger1178.14
Patrick Hennig2147.38
Martin Schönberg320.37
Christoph Meinel42341319.90