Abstract | ||
---|---|---|
We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest estimated value in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are designed to offset the underestimated values of arms in round $t$ with a sufficiently high probability. We analyze PHE in a $K$-armed bandit and prove a $O(K Delta^{-1} log n)$ bound on its $n$-round regret, where $Delta$ is the minimum gap between the expected rewards of the optimal and suboptimal arms. The key to our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. We compare PHE empirically to several baselines and show that it is competitive with the best of them. |
Year | DOI | Venue |
---|---|---|
2019 | 10.24963/ijcai.2019/386 | IJCAI |
Field | DocType | Volume |
Binary logarithm,Online algorithm,Mathematical optimization,Regret,Regret minimization,Mathematics,Offset (computer science),Bernoulli's principle | Journal | abs/1902.10089 |
Citations | PageRank | References |
0 | 0.34 | 2 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Branislav Kveton | 1 | 455 | 49.32 |
Csaba Szepesvári | 2 | 1774 | 157.06 |
Mohammad Ghavamzadeh | 3 | 814 | 67.73 |
Craig Boutilier | 4 | 6864 | 621.05 |