Title
Sampling dilemma: towards effective data sampling for click prediction in sponsored search
Abstract
Precise prediction of the probability that users click on ads plays a key role in sponsored search. State-of-the-art sponsored search systems typically employ a machine learning approach to conduct click prediction. While paying much attention to extracting useful features and building effective models, previous studies have overshadowed seemingly less obvious but essentially important challenges in terms of data sampling. To fulfill the learning objective of click prediction, it is not only necessary to ensure that the sampled training data implies the similar input distribution compared with the real world one, but also to guarantee that the sampled training data yield the consistent conditional output distribution, i.e. click-through rate (CTR), with the real world data. However, due to the sparseness of clicks in sponsored search, it is a bit contradictory to address these two challenges simultaneously. In this paper, we first take a theoretical analysis to reveal this sampling dilemma, followed by a thorough data analysis which demonstrates that the straightforward random sampling method may not be effective to balance these two kinds of consistency in sampling dilemma simultaneously. To address this problem, we propose a new sampling algorithm which can succeed in retaining the consistency between the sampled data and real world in terms of both input distribution and conditional output distribution. Large scale evaluations on the click-through logs from a commercial search engine demonstrate that this new sampling algorithm can effectively address the sampling dilemma. Further experiments illustrate that, by using the training data obtained by our new sampling algorithm, we can learn the model with much higher accuracy in click prediction.
Year
DOI
Venue
2014
10.1145/2556195.2556242
WSDM
Keywords
Field
DocType
click prediction,thorough data analysis,effective data,real world data,commercial search engine,conditional output distribution,straightforward random sampling method,new sampling algorithm,real world,sampling dilemma,training data,online advertising
Training set,Data mining,Search engine,Information retrieval,Computer science,Online advertising,Sampling (statistics),Artificial intelligence,Dilemma,Data sampling,Machine learning
Conference
Citations 
PageRank 
References 
1
0.39
13
Authors
6
Name
Order
Citations
PageRank
Jun Feng1293.20
Jiang Bian289761.74
Taifeng Wang317913.33
Wei Chen43416170.71
Xiaoyan Zhu52125141.16
Tie-yan Liu64662256.32