Zobrazeno 1 - 10
of 153
pro vyhledávání: '"Wang, Tony T"'
Autor:
Wang, Tony T., Hughes, John, Sleight, Henry, Schaeffer, Rylan, Agrawal, Rajashree, Barez, Fazl, Sharma, Mrinank, Mu, Jesse, Shavit, Nir, Perez, Ethan
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-
Externí odkaz:
http://arxiv.org/abs/2412.02159
Autor:
Halawi, Danny, Wei, Alexander, Wallace, Eric, Wang, Tony T., Haghtalab, Nika, Steinhardt, Jacob
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we
Externí odkaz:
http://arxiv.org/abs/2406.20053
Prior work found that superhuman Go AIs can be defeated by simple adversarial strategies, especially "cyclic" attacks. In this paper, we study whether adding natural countermeasures can achieve robustness in Go, a favorable domain for robustness sinc
Externí odkaz:
http://arxiv.org/abs/2406.12843
LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factu
Externí odkaz:
http://arxiv.org/abs/2312.08793
We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improve
Externí odkaz:
http://arxiv.org/abs/2302.07348
Autor:
Wang, Tony T., Gleave, Adam, Tseng, Tom, Pelrine, Kellin, Belrose, Nora, Miller, Joseph, Dennis, Michael D., Duan, Yawen, Pogrebniak, Viktor, Levine, Sergey, Russell, Stuart
We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo
Externí odkaz:
http://arxiv.org/abs/2211.00241
Autor:
Wang, Tony T.
In this thesis we explore adversarial examples for simple model families and simple data distributions, focusing in particular on linear and kernel classifiers. On the theoretical front we find evidence that natural accuracy and robust accuracy are m
Externí odkaz:
https://hdl.handle.net/1721.1/139041
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Publikováno v:
In Cell Reports 13 September 2022 40(11)