Abstract | ||
---|---|---|
Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation which can be edited by anyone. Its knowledge is increasingly used within Wikipedia as well as in all kinds of information systems, which imposes high demands on its integrity. Nevertheless, Wikidata frequently gets vandalized, exposing all its users to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We engineer 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and achieves an area under curve of the receiver operating characteristic (ROC-AUC) of 0.991, thereby significantly outperforming the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia (0.868 ROC-AUC). |
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/2983323.2983740 | ACM International Conference on Information and Knowledge Management |
Keywords | Field | DocType |
Data Quality,Knowledge Base,Vandalism | Information system,Data mining,World Wide Web,Data quality,Information retrieval,Computer science,Exploit,Knowledge base,Detector | Conference |
Citations | PageRank | References |
16 | 0.96 | 28 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Stefan Heindorf | 1 | 32 | 5.87 |
Martin Potthast | 2 | 871 | 87.94 |
Benno Stein | 3 | 1499 | 148.97 |
Gregor Engels | 4 | 2245 | 420.50 |