Abstract | ||
---|---|---|
In recent years, blogs have become a very popular way to publish information, express opinions and hold discussions. Hence researchers and industry have interest in analyzing the blogosphere. Due to the increasing diversity of blog usage, the initial categorization into web genres is the first necessary step before any analyses. In this research, we focus on the distinction between traditional blogs, news portals, forums and miscellaneous websites. Especially the new distinction between news portals and blogs allows analyses to adapt to the network-specific characteristics of traditional media with high journalistic effort and more personal weblogs and their authors. We present a set of 80 features and extensively experiment with possible combinations and SVM parameters to identify the best constellation for the categorization into the four different web genres. Our experiments show a maximal accuracy of 83.5% overall. This high precision was reached using a combination of trained n-grams, structural properties (e.g. Twitter links) and quantitative properties like the text's length and number of dates. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1109/WI-IAT.2015.59 | 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) |
Keywords | Field | DocType |
web categorization,machine learning,blogs | Publication,Data mining,Categorization,World Wide Web,Information retrieval,Computer science,Support vector machine,Newspaper,Feature extraction,Constellation,Blogosphere | Conference |
Volume | Citations | PageRank |
3 | 2 | 0.37 |
References | Authors | |
12 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Philipp Berger | 1 | 17 | 8.14 |
Patrick Hennig | 2 | 14 | 7.38 |
Martin Schönberg | 3 | 2 | 0.37 |
Christoph Meinel | 4 | 2341 | 319.90 |