Výsledky vyhledávání - "Mażeika, P."

Report

Tamper-Resistant Safeguards for Open-Weight LLMs

Autor: Tamirisa, Rishub, Bharathi, Bhrugu, Phan, Long, Zhou, Andy, Gatti, Alice, Suresh, Tarun, Lin, Maxwell, Wang, Justin, Wang, Rowan, Arel, Ron, Zou, Andy, Song, Dawn, Li, Bo, Hendrycks, Dan, Mazeika, Mantas

Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks th

Externí odkaz: http://arxiv.org/abs/2408.00761

Zobrazit plný text záznamu

Report

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Autor: Ren, Richard, Basart, Steven, Khoja, Adam, Gatti, Alice, Phan, Long, Yin, Xuwang, Mazeika, Mantas, Pan, Alexander, Mukobi, Gabriel, Kim, Ryan H., Fitz, Stephen, Hendrycks, Dan

As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to con

Externí odkaz: http://arxiv.org/abs/2407.21792

Zobrazit plný text záznamu

Report

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Autor: Li, Nathaniel, Pan, Alexander, Gopal, Anjali, Yue, Summer, Berrios, Daniel, Gatti, Alice, Li, Justin D., Dombrowski, Ann-Kathrin, Goel, Shashwat, Phan, Long, Mukobi, Gabriel, Helm-Burger, Nathan, Lababidi, Rassin, Justen, Lennart, Liu, Andrew B., Chen, Michael, Barrass, Isabelle, Zhang, Oliver, Zhu, Xiaoyuan, Tamirisa, Rishub, Bharathi, Bhrugu, Khoja, Adam, Zhao, Zhenqi, Herbert-Voss, Ariel, Breuer, Cort B., Marks, Samuel, Patel, Oam, Zou, Andy, Mazeika, Mantas, Wang, Zifan, Oswal, Palash, Lin, Weiran, Hunt, Adam A., Tienken-Harder, Justin, Shih, Kevin Y., Talley, Kemper, Guan, John, Kaplan, Russell, Steneker, Ian, Campbell, David, Jokubaitis, Brad, Levinson, Alex, Wang, Jean, Qian, William, Karmakar, Kallol Krishna, Basart, Steven, Fitz, Stephen, Levine, Mindy, Kumaraguru, Ponnurangam, Tupakula, Uday, Varadharajan, Vijay, Wang, Ruoyu, Shoshitaishvili, Yan, Ba, Jimmy, Esvelt, Kevin M., Wang, Alexandr, Hendrycks, Dan

The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government ins

Externí odkaz: http://arxiv.org/abs/2403.03218

Zobrazit plný text záznamu

Report

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Autor: Mazeika, Mantas, Phan, Long, Yin, Xuwang, Zou, Andy, Wang, Zifan, Mu, Norman, Sakhaee, Elham, Li, Nathaniel, Basart, Steven, Li, Bo, Forsyth, David, Hendrycks, Dan

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To ad

Externí odkaz: http://arxiv.org/abs/2402.04249

Zobrazit plný text záznamu

Report

Representation Engineering: A Top-Down Approach to AI Transparency

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representatio

Externí odkaz: http://arxiv.org/abs/2310.01405

Zobrazit plný text záznamu

Report

An Overview of Catastrophic AI Risks

Autor: Hendrycks, Dan, Mazeika, Mantas, Woodside, Thomas

Rapid advancements in artificial intelligence (AI) have sparked growing concerns among experts, policymakers, and world leaders regarding the potential for increasingly advanced AI systems to pose catastrophic risks. Although numerous risks have been

Externí odkaz: http://arxiv.org/abs/2306.12001

Zobrazit plný text záznamu

Report

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Autor: Wang, Boxin, Chen, Weixin, Pei, Hengzhi, Xie, Chulin, Kang, Mintong, Zhang, Chenhui, Xu, Chejian, Xiong, Zidi, Dutta, Ritik, Schaeffer, Rylan, Truong, Sang T., Arora, Simran, Mazeika, Mantas, Hendrycks, Dan, Lin, Zinan, Cheng, Yu, Koyejo, Sanmi, Song, Dawn, Li, Bo

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, prac

Externí odkaz: http://arxiv.org/abs/2306.11698

Zobrazit plný text záznamu

Report

The antiferromagnetic phase transition in the layered Cu$_{0.15}$Fe$_{0.85}$PS$_3$ semiconductor: experiment and DFT modelling

Autor: Pashchenko, V., Bludov, O., Baltrunas, D., Mazeika, K., Motria, S., Glukhov, K., Vysochanskii, Yu.

Publikováno v: Condensed Matter Physics, 2022, vol. 25, No. 4, 43701

The experimental studies of the paramagnetic-antiferromagnetic phase transition through M\"{o}ssbauer spectroscopy and measurements of temperature and field dependencies of magnetic susceptibility in the layered Cu$_{0.15}$Fe$_{0.85}$PS$_3$ crystal a

Externí odkaz: http://arxiv.org/abs/2301.01338

Zobrazit plný text záznamu

Report

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Autor: Mazeika, Mantas, Tang, Eric, Zou, Andy, Basart, Steven, Chan, Jun Shern, Song, Dawn, Forsyth, David, Steinhardt, Jacob, Hendrycks, Dan

In recent years, deep neural networks have demonstrated increasingly strong abilities to recognize objects and activities in videos. However, as video understanding becomes widely used in real-world applications, a key consideration is developing hum

Externí odkaz: http://arxiv.org/abs/2210.10039

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání