Zobrazeno 1 - 10
of 18 582
pro vyhledávání: '"Geist A"'
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimisation (DPO) and Identity Preference Optimisation (IPO), have emerged as alternatives to online Reinforcement Learning from Human Feedback (RLHF) algorithms such as Proximal Policy O
Externí odkaz:
http://arxiv.org/abs/2410.11677
Designing control policies whose performance level is guaranteed to remain above a given threshold in a span of environments is a critical feature for the adoption of reinforcement learning (RL) in real-world applications. The search for such robust
Externí odkaz:
http://arxiv.org/abs/2410.06212
Autor:
Wulfmeier, Markus, Bloesch, Michael, Vieillard, Nino, Ahuja, Arun, Bornschein, Jorg, Huang, Sandy, Sokolov, Artem, Barnes, Matt, Desjardins, Guillaume, Bewley, Alex, Bechtle, Sarah Maria Elisabeth, Springenberg, Jost Tobias, Momchev, Nikola, Bachem, Olivier, Geist, Matthieu, Riedmiller, Martin
The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum
Externí odkaz:
http://arxiv.org/abs/2409.01369
The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL
Externí odkaz:
http://arxiv.org/abs/2407.06121
The manipulation of deformable linear objects (DLOs) via model-based control requires an accurate and computationally efficient dynamics model. Yet, data-driven DLO dynamics models require large training data sets while their predictions often do not
Externí odkaz:
http://arxiv.org/abs/2407.03476
Autor:
Grinsztajn, Nathan, Flet-Berliac, Yannis, Azar, Mohammad Gheshlaghi, Strub, Florian, Wu, Bill, Choi, Eugene, Cremer, Chris, Ahmadian, Arash, Chandak, Yash, Pietquin, Olivier, Geist, Matthieu
To better align Large Language Models (LLMs) with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a f
Externí odkaz:
http://arxiv.org/abs/2406.19188
Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion
Autor:
Flet-Berliac, Yannis, Grinsztajn, Nathan, Strub, Florian, Choi, Eugene, Cremer, Chris, Ahmadian, Arash, Chandak, Yash, Azar, Mohammad Gheshlaghi, Pietquin, Olivier, Geist, Matthieu
Reinforcement Learning (RL) has been used to finetune Large Language Models (LLMs) using a reward model trained from preference data, to better align with human judgment. The recently introduced direct alignment methods, which are often simpler, more
Externí odkaz:
http://arxiv.org/abs/2406.19185
Autor:
Gallagher, J. S., Kotulla, R., Laufman, L., Geist, E., Aalto, S., Falstad, N., König, S., Krause, J., Privon, G., Wethers, C., Evans, A. S., Gorski, M.
Zw~049.057 is a moderate mass, dusty, early-type galaxy that hosts a powerful compact obscured nucleus (CON, L$_{FIR,CON} \geq$10$^{11}$~L$_{\odot}$). The resolution of HST enabled measurements of the stellar light distribution and characterization o
Externí odkaz:
http://arxiv.org/abs/2406.12126
Robust reinforcement learning is the problem of learning control policies that provide optimal worst-case performance against a span of adversarial environments. It is a crucial ingredient for deploying algorithms in real-world scenarios with prevale
Externí odkaz:
http://arxiv.org/abs/2406.08406
Robust reinforcement learning is essential for deploying reinforcement learning algorithms in real-world scenarios where environmental uncertainty predominates. Traditional robust reinforcement learning often depends on rectangularity assumptions, wh
Externí odkaz:
http://arxiv.org/abs/2406.08395