Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited.

Authors
Publication date
2021
Publication type
Proceedings Article
Summary In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of Ω((H 3 SA/ε 2) log(1/δ)) on the sample complexity of an (ε, δ)-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the Ω(√ H 3 SAT) regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr