Title
Unbiased online active learning in data streams
Abstract
Unlabeled samples can be intelligently selected for labeling to minimize classification error. In many real-world applications, a large number of unlabeled samples arrive in a streaming manner, making it impossible to maintain all the data in a candidate pool. In this work, we focus on binary classification problems and study selective labeling in data streams where a decision is required on each sample sequentially. We consider the unbiasedness property in the sampling process, and design optimal instrumental distributions to minimize the variance in the stochastic process. Meanwhile, Bayesian linear classifiers with weighted maximum likelihood are optimized online to estimate parameters. In empirical evaluation, we collect a data stream of user-generated comments on a commercial news portal in 30 consecutive days, and carry out offline evaluation to compare various sampling strategies, including unbiased active learning, biased variants, and random sampling. Experimental results verify the usefulness of online active learning, especially in the non-stationary situation with concept drift.
Year
DOI
Venue
2011
10.1145/2020408.2020444
KDD
Keywords
Field
DocType
unlabeled sample,sampling process,offline evaluation,binary classification problem,classification error,unbiased online active learning,random sampling,online active learning,empirical evaluation,various sampling strategy,data stream,active learning,importance sampling,maximum likelihood,concept drift,stochastic process,design optimization,binary classification
Data mining,Data stream mining,Active learning,Binary classification,Computer science,Data stream,Stochastic process,Concept drift,Sampling (statistics),Artificial intelligence,Machine learning,Bayesian probability
Conference
Citations 
PageRank 
References 
41
1.07
16
Authors
5
Name
Order
Citations
PageRank
Wei Chu12589139.79
Martin Zinkevich21893160.99
Lihong Li32390128.53
Achint Oommen Thomas4743.63
Belle Tseng572148.61