Zobrazeno 1 - 10
of 65
pro vyhledávání: '"Menard, Pierre"'
Autor:
Scheid, Antoine, Boursier, Etienne, Durmus, Alain, Jordan, Michael I., Ménard, Pierre, Moulines, Eric, Valko, Michal
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and u
Externí odkaz:
http://arxiv.org/abs/2410.17055
Autor:
Perrault, Pierre, Belomestny, Denis, Ménard, Pierre, Moulines, Éric, Naumov, Alexey, Tiapkin, Daniil, Valko, Michal
In this paper, we introduce a novel approach for bounding the cumulant generating function (CGF) of a Dirichlet process (DP) $X \sim \text{DP}(\alpha \nu_0)$, using superadditivity. In particular, our key technical contribution is the demonstration o
Externí odkaz:
http://arxiv.org/abs/2409.18621
Autor:
Tiapkin, Daniil, Belomestny, Denis, Calandriello, Daniele, Moulines, Eric, Munos, Remi, Naumov, Alexey, Perrault, Pierre, Valko, Michal, Menard, Pierre
In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior
Externí odkaz:
http://arxiv.org/abs/2310.18186
Autor:
Tiapkin, Daniil, Belomestny, Denis, Calandriello, Daniele, Moulines, Eric, Naumov, Alexey, Perrault, Pierre, Valko, Michal, Menard, Pierre
Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we st
Externí odkaz:
http://arxiv.org/abs/2310.17303
We study how to learn $\epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denot
Externí odkaz:
http://arxiv.org/abs/2309.00656
Autor:
Kitamura, Toshinori, Kozuno, Tadashi, Tang, Yunhao, Vieillard, Nino, Valko, Michal, Yang, Wenhao, Mei, Jincheng, Ménard, Pierre, Azar, Mohammad Gheshlaghi, Munos, Rémi, Pietquin, Olivier, Geist, Matthieu, Szepesvári, Csaba, Kumagai, Wataru, Matsuo, Yutaka
Mirror descent value iteration (MDVI), an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms. However, despite the use of function appro
Externí odkaz:
http://arxiv.org/abs/2305.13185
In this work, we derive sharp non-asymptotic deviation bounds for weighted sums of Dirichlet random variables. These bounds are based on a novel integral representation of the density of a weighted Dirichlet sum. This representation allows us to obta
Externí odkaz:
http://arxiv.org/abs/2304.03056
Autor:
Vieyra, Mariana Vargas, Ménard, Pierre
We present a novel, alternative framework for learning generative models with goal-conditioned reinforcement learning. We define two agents, a goal conditioned agent (GC-agent) and a supervised agent (S-agent). Given a user-input initial state, the G
Externí odkaz:
http://arxiv.org/abs/2303.14811
Autor:
Tiapkin, Daniil, Belomestny, Denis, Calandriello, Daniele, Moulines, Eric, Munos, Remi, Naumov, Alexey, Perrault, Pierre, Tang, Yunhao, Valko, Michal, Menard, Pierre
We address the challenge of exploration in reinforcement learning (RL) when the agent operates in an unknown environment with sparse or no rewards. In this work, we study the maximum entropy exploration problem of two different types. The first type
Externí odkaz:
http://arxiv.org/abs/2303.08059
Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $\epsilon$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-indep
Externí odkaz:
http://arxiv.org/abs/2212.12567