Abstract | ||
---|---|---|
We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments. We provide theoretical guarantees on the behavior of D-LinUCB in both slowly-varying and abruptly-changing environments. We obtain an upper bound on the dynamic regret that is of order d(2/3)B(T)(1/3)T(2/3), where B-T is a measure of non-stationarity (d and T being, respectively, dimension and horizon). This rate is known to be optimal. We also illustrate the empirical performance of D-LinUCB and compare it with recently proposed alternatives in simulated environments. |
Year | Venue | Field |
---|---|---|
2019 | ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | Mathematical optimization,Exponential function,Regret,Regression,Upper and lower bounds,Computer science,Horizon,Estimator,Linear regression |
DocType | Volume | ISSN |
Conference | 32 | 1049-5258 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Russac, Yoan | 1 | 0 | 1.35 |
Vernade, Claire | 2 | 10 | 1.93 |
O. Cappe | 3 | 2112 | 207.95 |