Zobrazeno 1 - 10
of 981
pro vyhledávání: '"A. McCandlish"'
Biological sequences do not come at random. Instead, they appear with particular frequencies that reflect properties of the associated system or phenomenon. Knowing how biological sequences are distributed in sequence space is thus a natural first st
Externí odkaz:
http://arxiv.org/abs/2404.11228
Autor:
Livesey, Benjamin J., Badonyi, Mihaly, Dias, Mafalda, Frazer, Jonathan, Kumar, Sushant, Lindorff-Larsen, Kresten, McCandlish, David M., Orenbuch, Rose, Shearer, Courtney A., Muffley, Lara, Foreman, Julia, Glazer, Andrew M., Lehner, Ben, Marks, Debora S., Roth, Frederick P., Rubin, Alan F., Starita, Lea M., Marsh, Joseph A.
Computational methods for assessing the likely impacts of mutations, known as variant effect predictors (VEPs), are widely used in the assessment and interpretation of human genetic variation, as well as in other applications like protein engineering
Externí odkaz:
http://arxiv.org/abs/2404.10807
Autor:
Kundu, Sandipan, Bai, Yuntao, Kadavath, Saurav, Askell, Amanda, Callahan, Andrew, Chen, Anna, Goldie, Anna, Balwit, Avital, Mirhoseini, Azalia, McLean, Brayden, Olsson, Catherine, Evraets, Cassie, Tran-Johnson, Eli, Durmus, Esin, Perez, Ethan, Kernion, Jackson, Kerr, Jamie, Ndousse, Kamal, Nguyen, Karina, Elhage, Nelson, Cheng, Newton, Schiefer, Nicholas, DasSarma, Nova, Rausch, Oliver, Larson, Robin, Yang, Shannon, Kravec, Shauna, Telleen-Lawton, Timothy, Liao, Thomas I., Henighan, Tom, Hume, Tristan, Hatfield-Dodds, Zac, Mindermann, Sören, Joseph, Nicholas, McCandlish, Sam, Kaplan, Jared
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing
Externí odkaz:
http://arxiv.org/abs/2310.13798
Autor:
Sharma, Mrinank, Tong, Meg, Korbak, Tomasz, Duvenaud, David, Askell, Amanda, Bowman, Samuel R., Cheng, Newton, Durmus, Esin, Hatfield-Dodds, Zac, Johnston, Scott R., Kravec, Shauna, Maxwell, Timothy, McCandlish, Sam, Ndousse, Kamal, Rausch, Oliver, Schiefer, Nicholas, Yan, Da, Zhang, Miranda, Perez, Ethan
Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models wh
Externí odkaz:
http://arxiv.org/abs/2310.13548
Autor:
Grosse, Roger, Bae, Juhan, Anil, Cem, Elhage, Nelson, Tamkin, Alex, Tajdini, Amirhossein, Steiner, Benoit, Li, Dustin, Durmus, Esin, Perez, Ethan, Hubinger, Evan, Lukošiūtė, Kamilė, Nguyen, Karina, Joseph, Nicholas, McCandlish, Sam, Kaplan, Jared, Bowman, Samuel R.
When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functio
Externí odkaz:
http://arxiv.org/abs/2308.03296
Autor:
Radhakrishnan, Ansh, Nguyen, Karina, Chen, Anna, Chen, Carol, Denison, Carson, Hernandez, Danny, Durmus, Esin, Hubinger, Evan, Kernion, Jackson, Lukošiūtė, Kamilė, Cheng, Newton, Joseph, Nicholas, Schiefer, Nicholas, Rausch, Oliver, McCandlish, Sam, Showk, Sheer El, Lanham, Tamera, Maxwell, Tim, Chandrasekaran, Venkatesa, Hatfield-Dodds, Zac, Kaplan, Jared, Brauner, Jan, Bowman, Samuel R., Perez, Ethan
As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them genera
Externí odkaz:
http://arxiv.org/abs/2307.11768
Autor:
Lanham, Tamera, Chen, Anna, Radhakrishnan, Ansh, Steiner, Benoit, Denison, Carson, Hernandez, Danny, Li, Dustin, Durmus, Esin, Hubinger, Evan, Kernion, Jackson, Lukošiūtė, Kamilė, Nguyen, Karina, Cheng, Newton, Joseph, Nicholas, Schiefer, Nicholas, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Kundu, Sandipan, Kadavath, Saurav, Yang, Shannon, Henighan, Thomas, Maxwell, Timothy, Telleen-Lawton, Timothy, Hume, Tristan, Hatfield-Dodds, Zac, Kaplan, Jared, Brauner, Jan, Bowman, Samuel R., Perez, Ethan
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its
Externí odkaz:
http://arxiv.org/abs/2307.13702
Autor:
Durmus, Esin, Nguyen, Karina, Liao, Thomas I., Schiefer, Nicholas, Askell, Amanda, Bakhtin, Anton, Chen, Carol, Hatfield-Dodds, Zac, Hernandez, Danny, Joseph, Nicholas, Lovitt, Liane, McCandlish, Sam, Sikder, Orowa, Tamkin, Alex, Thamkul, Janel, Kaplan, Jared, Clark, Jack, Ganguli, Deep
Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dat
Externí odkaz:
http://arxiv.org/abs/2306.16388
Publikováno v:
Nature Communications, Vol 15, Iss 1, Pp 1-14 (2024)
Abstract Cells possess multiple mitochondrial DNA (mtDNA) copies, which undergo semi-autonomous replication and stochastic inheritance. This enables mutant mtDNA variants to arise and selfishly compete with cooperative (wildtype) mtDNA. Selfish mitoc
Externí odkaz:
https://doaj.org/article/fa80ab2a5a5d4fbd893919cdafd231ec
Autor:
Ganguli, Deep, Askell, Amanda, Schiefer, Nicholas, Liao, Thomas I., Lukošiūtė, Kamilė, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, Olsson, Catherine, Hernandez, Danny, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Perez, Ethan, Kernion, Jackson, Kerr, Jamie, Mueller, Jared, Landau, Joshua, Ndousse, Kamal, Nguyen, Karina, Lovitt, Liane, Sellitto, Michael, Elhage, Nelson, Mercado, Noemi, DasSarma, Nova, Rausch, Oliver, Lasenby, Robert, Larson, Robin, Ringer, Sam, Kundu, Sandipan, Kadavath, Saurav, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Lanham, Tamera, Telleen-Lawton, Timothy, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Mann, Ben, Amodei, Dario, Joseph, Nicholas, McCandlish, Sam, Brown, Tom, Olah, Christopher, Clark, Jack, Bowman, Samuel R., Kaplan, Jared
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in suppo
Externí odkaz:
http://arxiv.org/abs/2302.07459