Title
Debugging a Crowdsourced Task with Low Inter-Rater Agreement
Abstract
In this paper, we describe the process we used to debug a crowdsourced labeling task with low inter-rater agreement. In the labeling task, the workers' subjective judgment was used to detect high-quality social media content-interesting tweets-with the ultimate aim of building a classifier that would automatically curate Twitter content. We describe the effects of varying the genre and recency of the dataset, of testing the reliability of the workers, and of recruiting workers from different crowdsourcing platforms. We also examined the effect of redesigning the work itself, both to make it easier and to potentially improve inter-rater agreement. As a result of the debugging process, we have developed a framework for diagnosing similar efforts and a technique to evaluate worker reliability. The technique for evaluating worker reliability, Human Intelligence Data-Driven Enquiries (HIDDENs), differs from other such schemes, in that it has the potential to produce useful secondary results and enhance performance on the main task. HIDDEN subtasks pivot around the same data as the main task, but ask workers questions with greater expected inter-rater agreement. Both the framework and the HIDDENs are currently in use in a production environment.
Year
DOI
Venue
2015
10.1145/2756406.2757741
ACM/IEEE Joint Conference on Digital Libraries
Keywords
Field
DocType
Crowdsourcing, labeling, inter-rater agreement, relevance judgment, debugging, Captchas, worker reliability
Data mining,Computer science,Crowdsourcing,Artificial intelligence,Classifier (linguistics),Inter-rater reliability,Social media,Ask price,Information retrieval,Human intelligence,CAPTCHA,Machine learning,Debugging
Conference
Citations 
PageRank 
References 
5
0.47
12
Authors
3
Name
Order
Citations
PageRank
Omar Alonso185565.44
Catherine C. Marshall22382287.21
Marc A. Najork32538278.16