Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games

Autor:	Masoumian, Alireza, Wright, James R.
Rok vydání:	2024
Předmět:	Computer Science - Machine Learning Computer Science - Computer Science and Game Theory Statistics - Machine Learning
Druh dokumentu:	Working Paper
Popis:	In standard RL, a learner attempts to learn an optimal policy for a Markov Decision Process whose structure (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of $M >1$ model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose $\mathsf{MRBEAR}$, an online model selection algorithm for the average reward RL setting. The regret of the algorithm is in $\tilde O(M C_{m^}^2 \mathsf{B}_{m^}(T,\delta))$ where $C_{m^}$ represents the complexity of the simplest well-specified model class and $\mathsf{B}_{m^}(T,\delta)$ is its corresponding regret bound. This result shows that in average reward RL, like the episodic online RL, the additional cost of model selection scales only linearly in $M$, the number of model classes. We apply $\mathsf{MRBEAR}$ to the interaction between a learner and an opponent in a two-player simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy. The learner's goal is to maximize its utility without knowing the opponent's utility function. The interaction is over $T$ rounds with no episode or discounting which leads us to measure the learner's performance by average reward regret. In this application, our algorithm enjoys an opponent-complexity-dependent regret in $\tilde O(M(\mathsf{sp}(h^) B^{m^} A^{m^+1})^{\frac{3}{2}} \sqrt{T})$, where $m^\le M$ is the unknown memory limit of the opponent, $\mathsf{sp}(h^)$ is the unknown span of optimal bias induced by the opponent, and $A$ and $B$ are the number of actions for the learner and opponent respectively. We also show that the exponential dependency on $m^$ is inevitable by proving a lower bound on the learner's regret.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2411.06069 Zobrazit plný text záznamu View this record from Arxiv