Exploiting the Natural Exploration In Contextual Bandits. - Citegraph

Paper Info

Title
Exploiting the Natural Exploration In Contextual Bandits.

Abstract
The contextual bandit literature has traditionally focused on algorithms that address the exploration-exploitation trade-off. In particular, greedy policies that exploit current estimates without any exploration may be sub-optimal in general. However, exploration-free greedy policies are desirable in many practical settings where exploration may be prohibitively costly or unethical (e.g. clinical trials). We prove that, for a general class of context distributions, the greedy policy benefits from a natural exploration obtained from the varying contexts and becomes asymptotically rate-optimal for the two-armed contextual bandit. Through simulations, we also demonstrate that these results generalize to more than two arms if the dimension of contexts is large enough. Motivated by these results, we introduce Greedy-First, a new algorithm that uses only observed contexts and rewards to determine whether to follow a greedy policy or to explore. We prove that this algorithm is asymptotically optimal without any additional assumptions on the distribution of contexts or the number of arms. Extensive simulations demonstrate that Greedy-First successfully reduces experimentation and outperforms existing (exploration-based) contextual bandit algorithms such as Thompson sampling, UCB, or $epsilon$-greedy.

Year	Venue	Field
2017	arXiv: Machine Learning	Thompson sampling,Exploit,Artificial intelligence,Asymptotically optimal algorithm,Mathematics,Machine learning
DocType	Volume	Citations
Journal	abs/1704.09011	0
PageRank	References	Authors
0.34	7	3

Authors (3 rows)

Cited by (0 rows)

References (7 rows)

Name	Order	Citations	PageRank
Hamsa Bastani	1	16	2.72
Mohsen Bayati	2	678	44.93
Khashayar Khosravi	3	6	2.45

1