Title
Analyzing the Impact of Unbalanced Data on Web Spam Classification
Abstract
Web spam is a serious problem which nowadays continues to threaten search engines because the quality of their results can be severely degraded by the presence of illegitimate pages. With the aim of fighting against web spam, several works have been carried out trying to reduce the impact of spam content. Regardless of the type of developed approaches, all the proposals have been faced with the difficulty of dealing with a corpus in which the difference between the amount of legitimate pages and the number of web sites with spam content is extremely high. Unbalanced data is a well-known common problem present in many practical applications of machine learning, having significant effects on the performance of standard classifiers. Focusing on web spam detection, the objective of this work is two-fold: to evaluate the effect of the class imbalance ratio over popular classifiers such as Naive Bayes, SVM and C5.0, and to assess how their performance can be improved when different types of techniques are combined in an unbalanced scenario.
Year
DOI
Venue
2015
10.1007/978-3-319-19638-1_28
DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE, 12TH INTERNATIONAL CONFERENCE
Keywords
Field
DocType
web spam detection,unbalanced data,sampling techniques,ensemble of classifiers
Data mining,Search engine,Naive Bayes classifier,Information retrieval,Computer science,Support vector machine,Artificial intelligence,Machine learning,Spamdexing
Conference
Volume
ISSN
Citations 
373
2194-5357
1
PageRank 
References 
Authors
0.34
10
6
Name
Order
Citations
PageRank
J. Fdez-Glez1263.28
David Ruano-Ordás2949.32
Florentino Fdez-Riverola346457.16
José Ramon Méndez425417.69
Reyes Pavón5578.08
Rosalia Laza613114.52