Zobrazeno 1 - 10
of 113
pro vyhledávání: '"Hendrycks, Dan"'
Autor:
Li, Qinbin, Hong, Junyuan, Xie, Chulin, Tan, Jeffrey, Xin, Rachel, Hou, Junyi, Yin, Xavier, Wang, Zhun, Hendrycks, Dan, Wang, Zhangyang, Li, Bo, He, Bingsheng, Song, Dawn
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis. Their profound capabilities in processing and interpreting complex language data, however, bring to
Externí odkaz:
http://arxiv.org/abs/2408.12787
Autor:
Tamirisa, Rishub, Bharathi, Bhrugu, Phan, Long, Zhou, Andy, Gatti, Alice, Suresh, Tarun, Lin, Maxwell, Wang, Justin, Wang, Rowan, Arel, Ron, Zou, Andy, Song, Dawn, Li, Bo, Hendrycks, Dan, Mazeika, Mantas
Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks th
Externí odkaz:
http://arxiv.org/abs/2408.00761
Autor:
Ren, Richard, Basart, Steven, Khoja, Adam, Gatti, Alice, Phan, Long, Yin, Xuwang, Mazeika, Mantas, Pan, Alexander, Mukobi, Gabriel, Kim, Ryan H., Fitz, Stephen, Hendrycks, Dan
As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to con
Externí odkaz:
http://arxiv.org/abs/2407.21792
Autor:
Zou, Andy, Phan, Long, Wang, Justin, Duenas, Derek, Lin, Maxwell, Andriushchenko, Maksym, Wang, Rowan, Kolter, Zico, Fredrikson, Matt, Hendrycks, Dan
AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit bre
Externí odkaz:
http://arxiv.org/abs/2406.04313
Autor:
Hong, Junyuan, Duan, Jinhao, Zhang, Chenhui, Li, Zhangheng, Xie, Chulin, Lieberman, Kelsey, Diffenderfer, James, Bartoldson, Brian, Jaiswal, Ajay, Xu, Kaidi, Kailkhura, Bhavya, Hendrycks, Dan, Song, Dawn, Wang, Zhangyang, Li, Bo
Compressing high-capability Large Language Models (LLMs) has emerged as a favored strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods boast impressive advancements in preserving benign task performance, the p
Externí odkaz:
http://arxiv.org/abs/2403.15447
Autor:
Li, Nathaniel, Pan, Alexander, Gopal, Anjali, Yue, Summer, Berrios, Daniel, Gatti, Alice, Li, Justin D., Dombrowski, Ann-Kathrin, Goel, Shashwat, Phan, Long, Mukobi, Gabriel, Helm-Burger, Nathan, Lababidi, Rassin, Justen, Lennart, Liu, Andrew B., Chen, Michael, Barrass, Isabelle, Zhang, Oliver, Zhu, Xiaoyuan, Tamirisa, Rishub, Bharathi, Bhrugu, Khoja, Adam, Zhao, Zhenqi, Herbert-Voss, Ariel, Breuer, Cort B., Marks, Samuel, Patel, Oam, Zou, Andy, Mazeika, Mantas, Wang, Zifan, Oswal, Palash, Lin, Weiran, Hunt, Adam A., Tienken-Harder, Justin, Shih, Kevin Y., Talley, Kemper, Guan, John, Kaplan, Russell, Steneker, Ian, Campbell, David, Jokubaitis, Brad, Levinson, Alex, Wang, Jean, Qian, William, Karmakar, Kallol Krishna, Basart, Steven, Fitz, Stephen, Levine, Mindy, Kumaraguru, Ponnurangam, Tupakula, Uday, Varadharajan, Vijay, Wang, Ruoyu, Shoshitaishvili, Yan, Ba, Jimmy, Esvelt, Kevin M., Wang, Alexandr, Hendrycks, Dan
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government ins
Externí odkaz:
http://arxiv.org/abs/2403.03218
Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineeri
Externí odkaz:
http://arxiv.org/abs/2402.11777
Autor:
Mazeika, Mantas, Phan, Long, Yin, Xuwang, Zou, Andy, Wang, Zifan, Mu, Norman, Sakhaee, Elham, Li, Nathaniel, Basart, Steven, Li, Bo, Forsyth, David, Hendrycks, Dan
Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To ad
Externí odkaz:
http://arxiv.org/abs/2402.04249
Autor:
Mu, Norman, Chen, Sarah, Wang, Zifan, Chen, Sizhe, Karamardian, David, Aljeraisy, Lulwa, Alomair, Basel, Hendrycks, Dan, Wagner, David
As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the
Externí odkaz:
http://arxiv.org/abs/2311.04235
Autor:
Zou, Andy, Phan, Long, Chen, Sarah, Campbell, James, Guo, Phillip, Ren, Richard, Pan, Alexander, Yin, Xuwang, Mazeika, Mantas, Dombrowski, Ann-Kathrin, Goel, Shashwat, Li, Nathaniel, Byun, Michael J., Wang, Zifan, Mallen, Alex, Basart, Steven, Koyejo, Sanmi, Song, Dawn, Fredrikson, Matt, Kolter, J. Zico, Hendrycks, Dan
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representatio
Externí odkaz:
http://arxiv.org/abs/2310.01405