Zobrazeno 1 - 10
of 166
pro vyhledávání: '"Johnson, Eli"'
Autor:
Kundu, Sandipan, Bai, Yuntao, Kadavath, Saurav, Askell, Amanda, Callahan, Andrew, Chen, Anna, Goldie, Anna, Balwit, Avital, Mirhoseini, Azalia, McLean, Brayden, Olsson, Catherine, Evraets, Cassie, Tran-Johnson, Eli, Durmus, Esin, Perez, Ethan, Kernion, Jackson, Kerr, Jamie, Ndousse, Kamal, Nguyen, Karina, Elhage, Nelson, Cheng, Newton, Schiefer, Nicholas, DasSarma, Nova, Rausch, Oliver, Larson, Robin, Yang, Shannon, Kravec, Shauna, Telleen-Lawton, Timothy, Liao, Thomas I., Henighan, Tom, Hume, Tristan, Hatfield-Dodds, Zac, Mindermann, Sören, Joseph, Nicholas, McCandlish, Sam, Kaplan, Jared
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing
Externí odkaz:
http://arxiv.org/abs/2310.13798
Autor:
Ganguli, Deep, Askell, Amanda, Schiefer, Nicholas, Liao, Thomas I., Lukošiūtė, Kamilė, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, Olsson, Catherine, Hernandez, Danny, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Perez, Ethan, Kernion, Jackson, Kerr, Jamie, Mueller, Jared, Landau, Joshua, Ndousse, Kamal, Nguyen, Karina, Lovitt, Liane, Sellitto, Michael, Elhage, Nelson, Mercado, Noemi, DasSarma, Nova, Rausch, Oliver, Lasenby, Robert, Larson, Robin, Ringer, Sam, Kundu, Sandipan, Kadavath, Saurav, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Lanham, Tamera, Telleen-Lawton, Timothy, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Mann, Ben, Amodei, Dario, Joseph, Nicholas, McCandlish, Sam, Brown, Tom, Olah, Christopher, Clark, Jack, Bowman, Samuel R., Kaplan, Jared
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in suppo
Externí odkaz:
http://arxiv.org/abs/2302.07459
Autor:
Perez, Ethan, Ringer, Sam, Lukošiūtė, Kamilė, Nguyen, Karina, Chen, Edwin, Heiner, Scott, Pettit, Craig, Olsson, Catherine, Kundu, Sandipan, Kadavath, Saurav, Jones, Andy, Chen, Anna, Mann, Ben, Israel, Brian, Seethor, Bryan, McKinnon, Cameron, Olah, Christopher, Yan, Da, Amodei, Daniela, Amodei, Dario, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Khundadze, Guro, Kernion, Jackson, Landis, James, Kerr, Jamie, Mueller, Jared, Hyun, Jeeyoon, Landau, Joshua, Ndousse, Kamal, Goldberg, Landon, Lovitt, Liane, Lucas, Martin, Sellitto, Michael, Zhang, Miranda, Kingsland, Neerav, Elhage, Nelson, Joseph, Nicholas, Mercado, Noemí, DasSarma, Nova, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Lanham, Tamera, Telleen-Lawton, Timothy, Brown, Tom, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Clark, Jack, Bowman, Samuel R., Askell, Amanda, Grosse, Roger, Hernandez, Danny, Ganguli, Deep, Hubinger, Evan, Schiefer, Nicholas, Kaplan, Jared
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which
Externí odkaz:
http://arxiv.org/abs/2212.09251
Autor:
Bai, Yuntao, Kadavath, Saurav, Kundu, Sandipan, Askell, Amanda, Kernion, Jackson, Jones, Andy, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, McKinnon, Cameron, Chen, Carol, Olsson, Catherine, Olah, Christopher, Hernandez, Danny, Drain, Dawn, Ganguli, Deep, Li, Dustin, Tran-Johnson, Eli, Perez, Ethan, Kerr, Jamie, Mueller, Jared, Ladish, Jeffrey, Landau, Joshua, Ndousse, Kamal, Lukosuite, Kamile, Lovitt, Liane, Sellitto, Michael, Elhage, Nelson, Schiefer, Nicholas, Mercado, Noemi, DasSarma, Nova, Lasenby, Robert, Larson, Robin, Ringer, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Fort, Stanislav, Lanham, Tamera, Telleen-Lawton, Timothy, Conerly, Tom, Henighan, Tom, Hume, Tristan, Bowman, Samuel R., Hatfield-Dodds, Zac, Mann, Ben, Amodei, Dario, Joseph, Nicholas, McCandlish, Sam, Brown, Tom, Kaplan, Jared
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only hum
Externí odkaz:
http://arxiv.org/abs/2212.08073
Autor:
Bowman, Samuel R., Hyun, Jeeyoon, Perez, Ethan, Chen, Edwin, Pettit, Craig, Heiner, Scott, Lukošiūtė, Kamilė, Askell, Amanda, Jones, Andy, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, McKinnon, Cameron, Olah, Christopher, Amodei, Daniela, Amodei, Dario, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Kernion, Jackson, Kerr, Jamie, Mueller, Jared, Ladish, Jeffrey, Landau, Joshua, Ndousse, Kamal, Lovitt, Liane, Elhage, Nelson, Schiefer, Nicholas, Joseph, Nicholas, Mercado, Noemí, DasSarma, Nova, Larson, Robin, McCandlish, Sam, Kundu, Sandipan, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Fort, Stanislav, Telleen-Lawton, Timothy, Brown, Tom, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Mann, Ben, Kaplan, Jared
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this proble
Externí odkaz:
http://arxiv.org/abs/2211.03540
Autor:
Ganguli, Deep, Lovitt, Liane, Kernion, Jackson, Askell, Amanda, Bai, Yuntao, Kadavath, Saurav, Mann, Ben, Perez, Ethan, Schiefer, Nicholas, Ndousse, Kamal, Jones, Andy, Bowman, Sam, Chen, Anna, Conerly, Tom, DasSarma, Nova, Drain, Dawn, Elhage, Nelson, El-Showk, Sheer, Fort, Stanislav, Hatfield-Dodds, Zac, Henighan, Tom, Hernandez, Danny, Hume, Tristan, Jacobson, Josh, Johnston, Scott, Kravec, Shauna, Olsson, Catherine, Ringer, Sam, Tran-Johnson, Eli, Amodei, Dario, Brown, Tom, Joseph, Nicholas, McCandlish, Sam, Olah, Chris, Kaplan, Jared, Clark, Jack
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming
Externí odkaz:
http://arxiv.org/abs/2209.07858