Title
A Crowdsourcing Approach To Construct Mono-Lingual Plagiarism Detection Corpus
Abstract
Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus.
Year
DOI
Venue
2021
10.1007/s00799-020-00294-4
INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES
Keywords
DocType
Volume
Persian corpus, Crowdsourcing, Plagiarism detection, Text re-use detection, Low resource languages
Journal
22
Issue
ISSN
Citations 
1
1432-5012
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Habibollah Asghari1104.92
Omid Fatemi27815.71
Salar Mohtaj300.34
Heshaam Faili410428.10