Abstract | ||
---|---|---|
Web spam is a serious problem which nowadays continues to threaten search engines because the quality of their results can be severely degraded by the presence of illegitimate pages. With the aim of fighting against web spam, several works have been carried out trying to reduce the impact of spam content. Regardless of the type of developed approaches, all the proposals have been faced with the difficulty of dealing with a corpus in which the difference between the amount of legitimate pages and the number of web sites with spam content is extremely high. Unbalanced data is a well-known common problem present in many practical applications of machine learning, having significant effects on the performance of standard classifiers. Focusing on web spam detection, the objective of this work is two-fold: to evaluate the effect of the class imbalance ratio over popular classifiers such as Naive Bayes, SVM and C5.0, and to assess how their performance can be improved when different types of techniques are combined in an unbalanced scenario. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1007/978-3-319-19638-1_28 | DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE, 12TH INTERNATIONAL CONFERENCE |
Keywords | Field | DocType |
web spam detection,unbalanced data,sampling techniques,ensemble of classifiers | Data mining,Search engine,Naive Bayes classifier,Information retrieval,Computer science,Support vector machine,Artificial intelligence,Machine learning,Spamdexing | Conference |
Volume | ISSN | Citations |
373 | 2194-5357 | 1 |
PageRank | References | Authors |
0.34 | 10 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
J. Fdez-Glez | 1 | 26 | 3.28 |
David Ruano-Ordás | 2 | 94 | 9.32 |
Florentino Fdez-Riverola | 3 | 464 | 57.16 |
José Ramon Méndez | 4 | 254 | 17.69 |
Reyes Pavón | 5 | 57 | 8.08 |
Rosalia Laza | 6 | 131 | 14.52 |