Adaptive Reward-Free Exploration.

Authors
  • KAUFMANN Emilie
  • MENARD Pierre
  • DARWICHE DOMINGUES Omar
  • JONSSON Anders
  • LEURENT Edouard
  • VALKO Michal
Publication date
2020
Publication type
Other
Summary Reward-free exploration is a reinforcement learning setting recently studied by Jin et al., who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994 [11], originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs O (SAH 4 /ε 2) log(1/δ) episodes to output, with probability 1 − δ, an ε-approximation of the optimal policy for any reward function. We empirically compare it to oracle strategies using a generative model.
Topics of the publication
  • ...
  • No themes identified
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr