Zobrazeno 1 - 10
of 10
pro vyhledávání: '"Pfau, Jacob"'
Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are u
Externí odkaz:
http://arxiv.org/abs/2406.15518
Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that a
Externí odkaz:
http://arxiv.org/abs/2404.15758
Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-con
Externí odkaz:
http://arxiv.org/abs/2310.13439
Autor:
Casper, Stephen, Davies, Xander, Shi, Claudia, Gilbert, Thomas Krendl, Scheurer, Jérémy, Rando, Javier, Freedman, Rachel, Korbak, Tomasz, Lindner, David, Freire, Pedro, Wang, Tony, Marks, Samuel, Segerie, Charbel-Raphaël, Carroll, Micah, Peng, Andi, Christoffersen, Phillip, Damani, Mehul, Slocum, Stewart, Anwar, Usman, Siththaranjan, Anand, Nadeau, Max, Michaud, Eric J., Pfau, Jacob, Krasheninnikov, Dmitrii, Chen, Xin, Langosco, Lauro, Hase, Peter, Bıyık, Erdem, Dragan, Anca, Krueger, David, Sadigh, Dorsa, Hadfield-Menell, Dylan
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there
Externí odkaz:
http://arxiv.org/abs/2307.15217
We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For
Externí odkaz:
http://arxiv.org/abs/2105.14111
Interpretability methods for image classification assess model trustworthiness by attempting to expose whether the model is systematically biased or attending to the same cues as a human would. Saliency methods for feature attribution dominate the in
Externí odkaz:
http://arxiv.org/abs/2104.02768
In high-stakes applications of machine learning models, interpretability methods provide guarantees that models are right for the right reasons. In medical imaging, saliency maps have become the standard tool for determining whether a neural model ha
Externí odkaz:
http://arxiv.org/abs/1910.07604
Publikováno v:
In Journal of Investigative Dermatology August 2020 140(8):1504-1512
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Publikováno v:
Current Dermatology Reports; Sep2019, Vol. 8 Issue 3, p85-90, 6p