Zobrazeno 1 - 10
of 3 376
pro vyhledávání: '"RUSSELL, STUART"'
Pretrained language models (LMs) can generalize to implications of facts that they are finetuned on. For example, if finetuned on ``John Doe lives in Tokyo," LMs can correctly answer ``What language do the people in John Doe's city speak?'' with ``Ja
Externí odkaz:
http://arxiv.org/abs/2412.04614
A wide variety of goals could cause an AI to disable its off switch because "you can't fetch the coffee if you're dead" (Russell 2019). Prior theoretical work on this shutdown problem assumes that humans know everything that AIs do. In practice, howe
Externí odkaz:
http://arxiv.org/abs/2411.17749
Autor:
Bengio, Yoshua, Mindermann, Sören, Privitera, Daniel, Besiroglu, Tamay, Bommasani, Rishi, Casper, Stephen, Choi, Yejin, Goldfarb, Danielle, Heidari, Hoda, Khalatbari, Leila, Longpre, Shayne, Mavroudis, Vasilios, Mazeika, Mantas, Ng, Kwan Yee, Okolo, Chinasa T., Raji, Deborah, Skeadas, Theodora, Tramèr, Florian, Adekanmbi, Bayo, Christiano, Paul, Dalrymple, David, Dietterich, Thomas G., Felten, Edward, Fung, Pascale, Gourinchas, Pierre-Olivier, Jennings, Nick, Krause, Andreas, Liang, Percy, Ludermir, Teresa, Marda, Vidushi, Margetts, Helen, McDermid, John A., Narayanan, Arvind, Nelson, Alondra, Oh, Alice, Ramchurn, Gopal, Russell, Stuart, Schaake, Marietje, Song, Dawn, Soto, Alvaro, Tiedrich, Lee, Varoquaux, Gaël, Yao, Andrew, Zhang, Ya-Qin
This is the interim publication of the first International Scientific Report on the Safety of Advanced AI. The report synthesises the scientific understanding of general-purpose AI -- AI that can perform a wide variety of tasks -- with a focus on und
Externí odkaz:
http://arxiv.org/abs/2412.05282
Learning from human feedback has gained traction in fields like robotics and natural language processing in recent years. While prior works mostly rely on human feedback in the form of comparisons, language is a preferable modality that provides more
Externí odkaz:
http://arxiv.org/abs/2410.06401
In reinforcement learning, if the agent's reward differs from the designers' true utility, even only rarely, the state distribution resulting from the agent's policy can be very bad, in theory and in practice. When RL policies would devolve into unde
Externí odkaz:
http://arxiv.org/abs/2410.06213
Intrinsic motivation (IM) and reward shaping are common methods for guiding the exploration of reinforcement learning (RL) agents by adding pseudo-rewards. Designing these rewards is challenging, however, and they can counter-intuitively harm perform
Externí odkaz:
http://arxiv.org/abs/2409.05358
Autor:
Noël, Jean-Christophe
Publikováno v:
Politique étrangère, 2020 Dec 01. 85(4), 202-203.
Externí odkaz:
https://www.jstor.org/stable/48643822
Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypoth
Externí odkaz:
http://arxiv.org/abs/2406.19501
Autor:
Jenner, Erik, Kapur, Shreyas, Georgiev, Vasil, Allen, Cameron, Emmons, Scott, Russell, Stuart
Do neural networks learn to implement algorithms such as look-ahead or search "in the wild"? Or do they rely purely on collections of simple heuristics? We present evidence of learned look-ahead in the policy network of Leela Chess Zero, the currentl
Externí odkaz:
http://arxiv.org/abs/2406.00877
Large language models generate code one token at a time. Their autoregressive generation process lacks the feedback of observing the program's output. Training LLMs to suggest edits directly can be challenging due to the scarcity of rich edit data. T
Externí odkaz:
http://arxiv.org/abs/2405.20519