Title
Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning
Abstract
Social media has led to the democratisation of opinion sharing. A wealth of information about public opinions, current events, and authors' insights into specific topics can be gained by understanding the text written by users. However, there is a wide variation in the language used by different authors in different contexts on the web. This diversity in language makes interpretation an extremely challenging task. Crowdsourcing presents an opportunity to interpret the sentiment, or topic, of free-text. However, the subjectivity and bias of human interpreters raise challenges in inferring the semantics expressed by the text. To overcome this problem, we present a novel Bayesian approach to language understanding that relies on aggregated crowdsourced judgements. Our model encodes the relationships between labels and text features in documents, such as tweets, web articles, and blog posts, accounting for the varying reliability of human labellers. It allows inference of annotations that scales to arbitrarily large pools of documents. Our evaluation using two challenging crowdsourcing datasets shows that by efficiently exploiting language models learnt from aggregated crowdsourced labels, we can provide up to 25% improved classifications when only a small portion, less than 4% of documents has been labelled. Compared to the six state-of-the-art methods, we reduce by up to 67% the number of crowd responses required to achieve comparable accuracy. Our method was a joint winner of the CrowdFlower - CrowdScale 2013 Shared Task challenge at the conference on Human Computation and Crowdsourcing (HCOMP 2013).
Year
DOI
Venue
2015
10.1145/2736277.2741689
WWW
Field
DocType
Citations 
Data science,Data mining,Computer science,Crowdsourcing,Artificial intelligence,Language model,World Wide Web,Social media,Sentiment analysis,Inference,Interpreter,Semantics,Machine learning,Bayesian probability
Conference
9
PageRank 
References 
Authors
0.55
19
7
Name
Order
Citations
PageRank
Edwin Simpson1658.50
Matteo Venanzi225116.27
Steven Reece313311.51
Pushmeet Kohli47398332.84
John Guiver548221.48
stephen j roberts61244174.70
Nicholas R. Jennings7193481564.35