Výsledky vyhledávání - "Hendrycks, Dan"

Report

LLM-PBE: Assessing Data Privacy in Large Language Models

Autor: Li, Qinbin, Hong, Junyuan, Xie, Chulin, Tan, Jeffrey, Xin, Rachel, Hou, Junyi, Yin, Xavier, Wang, Zhun, Hendrycks, Dan, Wang, Zhangyang, Li, Bo, He, Bingsheng, Song, Dawn

Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis. Their profound capabilities in processing and interpreting complex language data, however, bring to

Externí odkaz: http://arxiv.org/abs/2408.12787

Zobrazit plný text záznamu

Report

Tamper-Resistant Safeguards for Open-Weight LLMs

Autor: Tamirisa, Rishub, Bharathi, Bhrugu, Phan, Long, Zhou, Andy, Gatti, Alice, Suresh, Tarun, Lin, Maxwell, Wang, Justin, Wang, Rowan, Arel, Ron, Zou, Andy, Song, Dawn, Li, Bo, Hendrycks, Dan, Mazeika, Mantas

Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks th

Externí odkaz: http://arxiv.org/abs/2408.00761

Zobrazit plný text záznamu

Report

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Autor: Ren, Richard, Basart, Steven, Khoja, Adam, Gatti, Alice, Phan, Long, Yin, Xuwang, Mazeika, Mantas, Pan, Alexander, Mukobi, Gabriel, Kim, Ryan H., Fitz, Stephen, Hendrycks, Dan

As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to con

Externí odkaz: http://arxiv.org/abs/2407.21792

Zobrazit plný text záznamu

Report

Improving Alignment and Robustness with Circuit Breakers

Autor: Zou, Andy, Phan, Long, Wang, Justin, Duenas, Derek, Lin, Maxwell, Andriushchenko, Maksym, Wang, Rowan, Kolter, Zico, Fredrikson, Matt, Hendrycks, Dan

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit bre

Externí odkaz: http://arxiv.org/abs/2406.04313

Zobrazit plný text záznamu

Report

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

Autor: Hong, Junyuan, Duan, Jinhao, Zhang, Chenhui, Li, Zhangheng, Xie, Chulin, Lieberman, Kelsey, Diffenderfer, James, Bartoldson, Brian, Jaiswal, Ajay, Xu, Kaidi, Kailkhura, Bhavya, Hendrycks, Dan, Song, Dawn, Wang, Zhangyang, Li, Bo

Compressing high-capability Large Language Models (LLMs) has emerged as a favored strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods boast impressive advancements in preserving benign task performance, the p

Externí odkaz: http://arxiv.org/abs/2403.15447

Zobrazit plný text záznamu

Report

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Autor: Li, Nathaniel, Pan, Alexander, Gopal, Anjali, Yue, Summer, Berrios, Daniel, Gatti, Alice, Li, Justin D., Dombrowski, Ann-Kathrin, Goel, Shashwat, Phan, Long, Mukobi, Gabriel, Helm-Burger, Nathan, Lababidi, Rassin, Justen, Lennart, Liu, Andrew B., Chen, Michael, Barrass, Isabelle, Zhang, Oliver, Zhu, Xiaoyuan, Tamirisa, Rishub, Bharathi, Bhrugu, Khoja, Adam, Zhao, Zhenqi, Herbert-Voss, Ariel, Breuer, Cort B., Marks, Samuel, Patel, Oam, Zou, Andy, Mazeika, Mantas, Wang, Zifan, Oswal, Palash, Lin, Weiran, Hunt, Adam A., Tienken-Harder, Justin, Shih, Kevin Y., Talley, Kemper, Guan, John, Kaplan, Russell, Steneker, Ian, Campbell, David, Jokubaitis, Brad, Levinson, Alex, Wang, Jean, Qian, William, Karmakar, Kallol Krishna, Basart, Steven, Fitz, Stephen, Levine, Mindy, Kumaraguru, Ponnurangam, Tupakula, Uday, Varadharajan, Vijay, Wang, Ruoyu, Shoshitaishvili, Yan, Ba, Jimmy, Esvelt, Kevin M., Wang, Alexandr, Hendrycks, Dan

The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government ins

Externí odkaz: http://arxiv.org/abs/2403.03218

Zobrazit plný text záznamu

Report

Uncovering Latent Human Wellbeing in Language Model Embeddings

Autor: Freire, Pedro, Tan, ChengCheng, Gleave, Adam, Hendrycks, Dan, Emmons, Scott

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineeri

Externí odkaz: http://arxiv.org/abs/2402.11777

Zobrazit plný text záznamu

Report

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Autor: Mazeika, Mantas, Phan, Long, Yin, Xuwang, Zou, Andy, Wang, Zifan, Mu, Norman, Sakhaee, Elham, Li, Nathaniel, Basart, Steven, Li, Bo, Forsyth, David, Hendrycks, Dan

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To ad

Externí odkaz: http://arxiv.org/abs/2402.04249

Zobrazit plný text záznamu

Report

Can LLMs Follow Simple Rules?

Autor: Mu, Norman, Chen, Sarah, Wang, Zifan, Chen, Sizhe, Karamardian, David, Aljeraisy, Lulwa, Alomair, Basel, Hendrycks, Dan, Wagner, David

As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the

Externí odkaz: http://arxiv.org/abs/2311.04235

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání