Abstract | ||
---|---|---|
When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a 'universal off-policy estimator' (UnO)---one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss UnO's applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts. |
Year | Venue | DocType |
---|---|---|
2021 | Annual Conference on Neural Information Processing Systems | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yash Chandak | 1 | 0 | 1.01 |
S. Niekum | 2 | 165 | 23.73 |
Bruno Castro da Silva | 3 | 0 | 0.34 |
Erik Learned-Miller | 4 | 0 | 0.34 |
Emma Brunskill | 5 | 0 | 0.34 |
Philip S. Thomas | 6 | 22 | 3.27 |