Title
Vandalism Detection in Wikidata
Abstract
Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation which can be edited by anyone. Its knowledge is increasingly used within Wikipedia as well as in all kinds of information systems, which imposes high demands on its integrity. Nevertheless, Wikidata frequently gets vandalized, exposing all its users to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We engineer 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and achieves an area under curve of the receiver operating characteristic (ROC-AUC) of 0.991, thereby significantly outperforming the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia (0.868 ROC-AUC).
Year
DOI
Venue
2016
10.1145/2983323.2983740
ACM International Conference on Information and Knowledge Management
Keywords
Field
DocType
Data Quality,Knowledge Base,Vandalism
Information system,Data mining,World Wide Web,Data quality,Information retrieval,Computer science,Exploit,Knowledge base,Detector
Conference
Citations 
PageRank 
References 
16
0.96
28
Authors
4
Name
Order
Citations
PageRank
Stefan Heindorf1325.87
Martin Potthast287187.94
Benno Stein31499148.97
Gregor Engels42245420.50