Abstract | ||
---|---|---|
We consider a structured Multi-Armed bandit problem in which mean rewards of different arms are related through a hidden parameter. We propose an approach that allows generalization of classical bandit algorithms such as UCB and Thompson sampling to the structured bandit setting. Our approach is based on exploiting the structure in the problem to identify some arms as sub-optimal and pulling them only O(1) times. This results in significant reduction in cumulative regret and in fact our algorithm achieves bounded (i.e., O(1)) regret whenever possible. We empirically demonstrate the superiority of our algorithms via simulations and experiments on the Movielens dataset. Moreover, the problem setting we study in this paper subsumes several previously studied framework such as Global, Regional and Structured bandits with linear rewards. |
Year | Venue | Field |
---|---|---|
2018 | arXiv: Machine Learning | Mathematical optimization,Regret,MovieLens,Thompson sampling,Correlation,Mathematics,Bounded function |
DocType | Volume | Citations |
Journal | abs/1810.08164 | 0 |
PageRank | References | Authors |
0.34 | 0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Samarth Gupta | 1 | 13 | 5.60 |
Gauri Joshi | 2 | 308 | 29.70 |
Osman Yagan | 3 | 430 | 43.65 |