Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Autor:	Jin, Chi, Jin, Tiancheng, Luo, Haipeng, Sra, Suvrit, Yu, Tiancheng
Rok vydání:	2019
Předmět:	Computer Science - Machine Learning Statistics - Machine Learning I.2.6
Druh dokumentu:	Working Paper
Popis:	We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L\|X\|\sqrt{\|A\|T})$ regret with high probability, where $L$ is the horizon, $\|X\|$ is the number of states, $\|A\|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$. Comment: Fix a bug
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/1912.01192 Zobrazit plný text záznamu View this record from Arxiv