Safe Policy Improvement with an Estimated Baseline Policy - Citegraph

Paper Info

Title
Safe Policy Improvement with an Estimated Baseline Policy

Abstract
Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance.

Year	DOI	Venue
2020	10.5555/3398761.3398908	AAMAS '19: International Conference on Autonomous Agents and Multiagent Systems Auckland New Zealand May, 2020
DocType	ISBN	Citations
Conference	978-1-4503-7518-4	0
PageRank	References	Authors
0.34	0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Thiago D. Simão	1	0	2.70
Thiago D. Simão	2	0	2.70
Romain Laroche	3	110	17.35
Remi Tachet des Combes	4	28	7.42

1