Zobrazeno 1 - 10
of 235
pro vyhledávání: '"LINDNER, DAVID"'
Autor:
Fronsdal, Kai, Lindner, David
We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risk
Externí odkaz:
http://arxiv.org/abs/2412.03904
Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final
Externí odkaz:
http://arxiv.org/abs/2411.13211
Reinforcement Learning from Human feedback (RLHF) has become a powerful tool to fine-tune or train agentic machine learning models. Similar to how humans interact in social contexts, we can use many types of feedback to communicate our preferences, i
Externí odkaz:
http://arxiv.org/abs/2411.11761
Autor:
Balesni, Mikita, Hobbhahn, Marius, Lindner, David, Meinke, Alexander, Korbak, Tomek, Clymer, Joshua, Shlegeris, Buck, Scheurer, Jérémy, Stix, Charlotte, Shah, Rusheb, Goldowsky-Dill, Nicholas, Braun, Dan, Chughtai, Bilal, Evans, Owain, Kokotajlo, Daniel, Bushnaq, Lucius
We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI systems could pu
Externí odkaz:
http://arxiv.org/abs/2411.03336
Autor:
Kenton, Zachary, Siegel, Noah Y., Kramár, János, Brown-Cohen, Jonah, Albanie, Samuel, Bulian, Jannis, Agarwal, Rishabh, Lindner, David, Tang, Yunhao, Goodman, Noah D., Shah, Rohin
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and comp
Externí odkaz:
http://arxiv.org/abs/2407.04622
Autor:
Phuong, Mary, Aitchison, Matthew, Catt, Elliot, Cogan, Sarah, Kaskasoli, Alexandre, Krakovna, Victoria, Lindner, David, Rahtz, Matthew, Assael, Yannis, Hodkinson, Sarah, Howard, Heidi, Lieberum, Tom, Kumar, Ramana, Raad, Maria Abi, Webson, Albert, Ho, Lewis, Lin, Sharon, Farquhar, Sebastian, Hutter, Marcus, Deletang, Gregoire, Ruoss, Anian, El-Sayed, Seliem, Brown, Sasha, Dragan, Anca, Shah, Rohin, Dafoe, Allan, Shevlane, Toby
To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four
Externí odkaz:
http://arxiv.org/abs/2403.13793
Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternativ
Externí odkaz:
http://arxiv.org/abs/2310.12921
Publikováno v:
ICML2023 Interactive Learning from Implicit Human Feedback Workshop
To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models from diverse sources of human feedback and to consider human factors involved in providing feedback of different types. However,
Externí odkaz:
http://arxiv.org/abs/2308.04332
Autor:
Casper, Stephen, Davies, Xander, Shi, Claudia, Gilbert, Thomas Krendl, Scheurer, Jérémy, Rando, Javier, Freedman, Rachel, Korbak, Tomasz, Lindner, David, Freire, Pedro, Wang, Tony, Marks, Samuel, Segerie, Charbel-Raphaël, Carroll, Micah, Peng, Andi, Christoffersen, Phillip, Damani, Mehul, Slocum, Stewart, Anwar, Usman, Siththaranjan, Anand, Nadeau, Max, Michaud, Eric J., Pfau, Jacob, Krasheninnikov, Dmitrii, Chen, Xin, Langosco, Lauro, Hase, Peter, Bıyık, Erdem, Dragan, Anca, Krueger, David, Sadigh, Dorsa, Hadfield-Menell, Dylan
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there
Externí odkaz:
http://arxiv.org/abs/2307.15217
We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. W
Externí odkaz:
http://arxiv.org/abs/2305.16147