Abstract
Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.
Dokumententyp: | Konferenzbeitrag (Paper) |
---|---|
Fakultät: | Mathematik, Informatik und Statistik > Mathematik |
Themengebiete: | 500 Naturwissenschaften und Mathematik > 510 Mathematik |
Ort: | Setúbal |
Sprache: | Englisch |
Dokumenten ID: | 110117 |
Datum der Veröffentlichung auf Open Access LMU: | 25. Mrz. 2024, 13:48 |
Letzte Änderungen: | 13. Aug. 2024, 12:47 |