A Crowdsourcing Approach To Construct Mono-Lingual Plagiarism Detection Corpus - Citegraph

Paper Info

Title
A Crowdsourcing Approach To Construct Mono-Lingual Plagiarism Detection Corpus

Abstract
Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus.

Year	DOI	Venue
2021	10.1007/s00799-020-00294-4	INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES
Keywords	DocType	Volume
Persian corpus, Crowdsourcing, Plagiarism detection, Text re-use detection, Low resource languages	Journal	22
Issue	ISSN	Citations
1	1432-5012	0
PageRank	References	Authors
0.34	0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Habibollah Asghari	1	10	4.92
Omid Fatemi	2	78	15.71
Salar Mohtaj	3	0	0.34
Heshaam Faili	4	104	28.10

1