Analyzing the Impact of Unbalanced Data on Web Spam Classification - Citegraph

Paper Info

Title
Analyzing the Impact of Unbalanced Data on Web Spam Classification

Abstract
Web spam is a serious problem which nowadays continues to threaten search engines because the quality of their results can be severely degraded by the presence of illegitimate pages. With the aim of fighting against web spam, several works have been carried out trying to reduce the impact of spam content. Regardless of the type of developed approaches, all the proposals have been faced with the difficulty of dealing with a corpus in which the difference between the amount of legitimate pages and the number of web sites with spam content is extremely high. Unbalanced data is a well-known common problem present in many practical applications of machine learning, having significant effects on the performance of standard classifiers. Focusing on web spam detection, the objective of this work is two-fold: to evaluate the effect of the class imbalance ratio over popular classifiers such as Naive Bayes, SVM and C5.0, and to assess how their performance can be improved when different types of techniques are combined in an unbalanced scenario.

Year	DOI	Venue
2015	10.1007/978-3-319-19638-1_28	DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE, 12TH INTERNATIONAL CONFERENCE
Keywords	Field	DocType
web spam detection,unbalanced data,sampling techniques,ensemble of classifiers	Data mining,Search engine,Naive Bayes classifier,Information retrieval,Computer science,Support vector machine,Artificial intelligence,Machine learning,Spamdexing	Conference
Volume	ISSN	Citations
373	2194-5357	1
PageRank	References	Authors
0.34	10	6

Authors (6 rows)

Cited by (1 rows)

References (10 rows)

Name	Order	Citations	PageRank
J. Fdez-Glez	1	26	3.28
David Ruano-Ordás	2	94	9.32
Florentino Fdez-Riverola	3	464	57.16
José Ramon Méndez	4	254	17.69
Reyes Pavón	5	57	8.08
Rosalia Laza	6	131	14.52

1