Výsledky vyhledávání - "Pires, Bernardo A."

Report

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Autor: Khetarpal, Khimya, Guo, Zhaohan Daniel, Pires, Bernardo Avila, Tang, Yunhao, Lyle, Clare, Rowland, Mark, Heess, Nicolas, Borsa, Diana, Guez, Arthur, Dabney, Will

Learning a good representation is a crucial challenge for Reinforcement Learning (RL) agents. Self-predictive learning provides means to jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYO

Externí odkaz: http://arxiv.org/abs/2406.02035

Zobrazit plný text záznamu

Report

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Autor: Richemond, Pierre Harvey, Tang, Yunhao, Guo, Daniel, Calandriello, Daniele, Azar, Mohammad Gheshlaghi, Rafailov, Rafael, Pires, Bernardo Avila, Tarassov, Eugene, Spangher, Lucas, Ellsworth, Will, Severyn, Aliaksei, Mallinson, Jonathan, Shani, Lior, Shamir, Gil, Joshi, Rishabh, Liu, Tianqi, Munos, Remi, Piot, Bilal

The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is

Externí odkaz: http://arxiv.org/abs/2405.19107

Zobrazit plný text záznamu

Report

Understanding the performance gap between online and offline alignment algorithms

Autor: Tang, Yunhao, Guo, Daniel Zhaohan, Zheng, Zeyu, Calandriello, Daniele, Cao, Yuan, Tarassov, Eugene, Munos, Rémi, Pires, Bernardo Ávila, Valko, Michal, Cheng, Yong, Dabney, Will

Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of rewar

Externí odkaz: http://arxiv.org/abs/2405.08448

Zobrazit plný text záznamu

Report

Human Alignment of Large Language Models through Online Preference Optimisation

Autor: Calandriello, Daniele, Guo, Daniel, Munos, Remi, Rowland, Mark, Tang, Yunhao, Pires, Bernardo Avila, Richemond, Pierre Harvey, Lan, Charline Le, Valko, Michal, Liu, Tianqi, Joshi, Rishabh, Zheng, Zeyu, Piot, Bilal

Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learnin

Externí odkaz: http://arxiv.org/abs/2403.08635

Zobrazit plný text záznamu

Report

Off-policy Distributional Q($\lambda$): Distributional RL without Importance Sampling

Autor: Tang, Yunhao, Rowland, Mark, Munos, Rémi, Pires, Bernardo Ávila, Dabney, Will

We introduce off-policy distributional Q($\lambda$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($\lambda$) does not apply importance sampling for off-policy learning, which introduces

Externí odkaz: http://arxiv.org/abs/2402.05766

Zobrazit plný text záznamu

Report

Generalized Preference Optimization: A Unified Approach to Offline Alignment

Autor: Tang, Yunhao, Guo, Zhaohan Daniel, Zheng, Zeyu, Calandriello, Daniele, Munos, Rémi, Rowland, Mark, Richemond, Pierre Harvey, Valko, Michal, Pires, Bernardo Ávila, Piot, Bilal

Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a ge

Externí odkaz: http://arxiv.org/abs/2402.05749

Zobrazit plný text záznamu

Report

URHand: Universal Relightable Hands

Existing photorealistic relightable hand models require extensive identity-specific observations in different views, poses, and illuminations, and face challenges in generalizing to natural illuminations and novel identities. To bridge this gap, we p

Externí odkaz: http://arxiv.org/abs/2401.05334

Zobrazit plný text záznamu

Report

DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

Autor: Tang, Yunhao, Kozuno, Tadashi, Rowland, Mark, Harutyunyan, Anna, Munos, Rémi, Pires, Bernardo Ávila, Valko, Michal

Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings. However, in the optimal control case, the impact of multi-step learning has been relatively limited despite a number of prior effort

Externí odkaz: http://arxiv.org/abs/2305.18501

Zobrazit plný text záznamu

Report

Understanding plasticity in neural networks

Autor: Lyle, Clare, Zheng, Zeyu, Nikishin, Evgenii, Pires, Bernardo Avila, Pascanu, Razvan, Dabney, Will

Plasticity, the ability of a neural network to quickly change its predictions in response to new information, is essential for the adaptability and robustness of deep reinforcement learning systems. Deep neural networks are known to lose plasticity o

Externí odkaz: http://arxiv.org/abs/2303.01486

Zobrazit plný text záznamu

Report

Hierarchical Reinforcement Learning in Complex 3D Environments

Autor: Pires, Bernardo Avila, Behbahani, Feryal, Soyer, Hubert, Nikiforou, Kyriacos, Keck, Thomas, Singh, Satinder

Hierarchical Reinforcement Learning (HRL) agents have the potential to demonstrate appealing capabilities such as planning and exploration with abstraction, transfer, and skill reuse. Recent successes with HRL across different domains provide evidenc

Externí odkaz: http://arxiv.org/abs/2302.14451

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání