Výsledky vyhledávání - "LINDNER, DAVID"

Report

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risk

Externí odkaz: http://arxiv.org/abs/2412.03904

Zobrazit plný text záznamu

Report

ViSTa Dataset: Do vision-language models understand sequential tasks?

Autor: Wybitul, Evžen, Gunter, Evan Ryan, Seleznyov, Mikhail, Lindner, David

Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final

Externí odkaz: http://arxiv.org/abs/2411.13211

Zobrazit plný text záznamu

Report

Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

Autor: Metz, Yannick, Lindner, David, Baur, Raphaël, El-Assady, Mennatallah

Reinforcement Learning from Human feedback (RLHF) has become a powerful tool to fine-tune or train agentic machine learning models. Similar to how humans interact in social contexts, we can use many types of feedback to communicate our preferences, i

Externí odkaz: http://arxiv.org/abs/2411.11761

Zobrazit plný text záznamu

Report

Towards evaluations-based safety cases for AI scheming

Autor: Balesni, Mikita, Hobbhahn, Marius, Lindner, David, Meinke, Alexander, Korbak, Tomek, Clymer, Joshua, Shlegeris, Buck, Scheurer, Jérémy, Stix, Charlotte, Shah, Rusheb, Goldowsky-Dill, Nicholas, Braun, Dan, Chughtai, Bilal, Evans, Owain, Kokotajlo, Daniel, Bushnaq, Lucius

We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI systems could pu

Externí odkaz: http://arxiv.org/abs/2411.03336

Zobrazit plný text záznamu

Report

On scalable oversight with weak LLMs judging strong LLMs

Autor: Kenton, Zachary, Siegel, Noah Y., Kramár, János, Brown-Cohen, Jonah, Albanie, Samuel, Bulian, Jannis, Agarwal, Rishabh, Lindner, David, Tang, Yunhao, Goodman, Noah D., Shah, Rohin

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and comp

Externí odkaz: http://arxiv.org/abs/2407.04622

Zobrazit plný text záznamu

Report

Evaluating Frontier Models for Dangerous Capabilities

To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four

Externí odkaz: http://arxiv.org/abs/2403.13793

Zobrazit plný text záznamu

Report

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Autor: Rocamonde, Juan, Montesinos, Victoriano, Nava, Elvis, Perez, Ethan, Lindner, David

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternativ

Externí odkaz: http://arxiv.org/abs/2310.12921

Zobrazit plný text záznamu

Report

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Autor: Metz, Yannick, Lindner, David, Baur, Raphaël, Keim, Daniel, El-Assady, Mennatallah

Publikováno v: ICML2023 Interactive Learning from Implicit Human Feedback Workshop

To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models from diverse sources of human feedback and to consider human factors involved in providing feedback of different types. However,

Externí odkaz: http://arxiv.org/abs/2308.04332

Zobrazit plný text záznamu

Report

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there

Externí odkaz: http://arxiv.org/abs/2307.15217

Zobrazit plný text záznamu

Report

Learning Safety Constraints from Demonstrations with Unknown Rewards

Autor: Lindner, David, Chen, Xin, Tschiatschek, Sebastian, Hofmann, Katja, Krause, Andreas

We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. W

Externí odkaz: http://arxiv.org/abs/2305.16147

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání